Posts by Mad_Max

1) Message boards : Number crunching : Waste of resource. (Message 106192)
Posted 14 May 2022 by Mad_Max
Post:
This number of "runs" it really big, like from tens of thousands to hundreds of thousands runs per model, in some cases it can be more than a million. The fact that we increase the number of runs in one task by increasing the target computation time does not affect anything - the total number is regulated by the number of computed and validated results on server side.
When scientists decide that they have already received enough runs for this particular model/goal, they simply cancel all remaining tasks of the same series that are still in the queue. Including ones that have been already downloaded to volunteer PCs (probably you have already noticed this sometime - when a part of tasks from a series that worked without problems / errors is suddenly canceled by a command from the server - "server abort" in WUs status).


P.S
And Well, in any case, even if we generate some "excessive" result it will not be a complete loss of resources. Because with this pseudo-random search approach used in R@H, more runs/passes = better (more accurate) result become. The cut-off point (when enough is enough to say) is rather arbitrary. It is usually chosen when the improvement in accuracy is already quite insignificant and it is not practical to continue to spend large amounts of computing power on it. But insignificant doesn't mean they don't exist at all.
2) Message boards : Number crunching : Level 3 cache requirements? (Message 106191)
Posted 14 May 2022 by Mad_Max
Post:
Use of additional ("virtual") thread always decrease performance of other running threads.
Its normal - because this share same compute unit in same physical cores.

If this slowdown of a single thread performance not very big (like in 10-30% range) then its normal and there is nothing to do about it.
Only if slowdown of a single thread is very big like twice slower (so total throughput of all threads combined decrease) then it indicates that something went wrong. Like not enough cache size or other issues.

But a low to moderate slowdown when using virtual threads is both normal and inevitable.
3) Message boards : Number crunching : Rosetta stops the use of my 2nd GPU (Message 106190)
Posted 14 May 2022 by Mad_Max
Post:
It probable due to combination of large work cache in your BOINC setting + short deadlines for R@H WUs (R@H server side settings).
It can drive BOINC into "panic" (high priority mode) - it allocate all CPU resources to a single project in try to avoid missing deadline.
Including CPU core(s) allocated for GPU support.

Its a stupid design decision(because it greatly decrease total performance by cut down of all GPU computations) but it have been this way in BOINC scheduler for years.

Possible workarounds for this
1 - reduce cache size setting or/and abort excessive WUs from a "hoarding" project in BOINC WUs queue.
2 - use of app_config.xml to limit max number of running instances for one project
4) Message boards : Number crunching : Excessive workunit fetch (Message 106180)
Posted 11 May 2022 by Mad_Max
Post:
Yes, it already included in BOINC ver 7.20.0 (and later).

But v. 7.20 itself is not finished yet.
5) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 103893)
Posted 24 Dec 2021 by Mad_Max
Post:
I had gone as far to untick all the disk space boxes to give it unlimited use of the disk
The boxes aren't tickable, they require values. And one value in any one of the options overrides the values in any of the other two when it comes to what disk space is actually available.

They are quite tickable - there are checkboxes to the left of each value box which turn off corresponding limit.

I was referring to the web based settings.
If you've only got one system, local Setting are ok. More than one, web based settings make life much easier.

web based settings also have same checkboxes for disk and network usage limits as local settings do. At least here on Rosetta server web based settings for BOINC.
6) Message boards : Number crunching : How to determine what future work is coming this way from Robetta (Message 103890)
Posted 24 Dec 2021 by Mad_Max
Post:
P.S.

I saw messages in other topic about that project is simply run out of all regular (4.20) Rosetta WUs and all few millions WUs in main work queue is Python tasks only.
But I see that is not the point. New regular WUs continue to be created and distributed every day by server, only sometimes at a greatly reduced pace (you need to be a lucky one to get some, but it still possible).
And after monitoring the server status, it looks like this pace is limited by the factors that I described in the previous post.
7) Message boards : Number crunching : How to determine what future work is coming this way from Robetta (Message 103889)
Posted 24 Dec 2021 by Mad_Max
Post:
I think there are still a plenty of work for regular Rosetta in queue.
And work shortage is just another example of misconfiguration and poor maintenance of the project tech staf (admins/programmers).

Looks like work generator daemons configured to maintain about 5000 tasks in ready to sent state (taking it from the large work queue displayed on the home page when needed), but this target set as combined number.
Ignoring the fact that the project now has two fundamentally different types of tasks that are not mixed with each other.

There are two separate work generators. One creates regular Rosetta WUs only and second is creating Python vBOX WUs only.
You can see them on server status page: https://boinc.bakerlab.org/rosetta/server_status.php
rah_make_work_rosetta - regular WUs generator
rah_make_work_rosetta_python_projects - Python WUs generator

Pythons WUs processed by volunteer computer at much lower pace (due to much higher system requirements, absence of vBox on some of BOINC client installation, some users do NOT want vBOX tasks and disabled it, etc)
This cause current work cache (Tasks ready to send) after some time end up occupied by vBox python WUs only and no any tasks for regular Rosetta available to download.
But regular WUs genenator does not kick in because target work cache size is already filled (by Python tasks) and produce new work only after some Pyton tasks distributed to clients "freeing up space" for regular tasks.

So now regular rosetta performance and work supply is limited by performance of much slower Python work queue.
8) Message boards : Number crunching : Excessive workunit fetch (Message 103762)
Posted 7 Dec 2021 by Mad_Max
Post:
He said it about regular (non Python/virtual box) task. Regular R@H tasks RAM usage lies in 0.7-3 Gb range for almost all tasks (for >95% of tasks) and in 0.7-1.5 Gb usually (for >70%).
So it is very far from the 8 GB per tasks.
9) Message boards : Number crunching : Excessive workunit fetch (Message 103653)
Posted 2 Dec 2021 by Mad_Max
Post:
It is a confirmed BOINC (not Rosetta) bug with excessive workunit fetch for a project if <max_concurrent> setting is used.

It was identified about year ago or so.
One of the latest bug reports to BOINC developer: https://github.com/BOINC/boinc/issues/4322

And finally a patch to this problem is in work now: https://github.com/BOINC/boinc/pull/4592
It will be included in one of the future BOINC releases.

For now there are few workarounds until new fixed version of BOINC available:
1 - avoid using <max_concurrent> setting completely
OR
2 - set work cache size to a really low values
OR
3 - use R@H with other projects with stable WUs supply and WITHOUT <max_concurrent> setting applied to this "spare" projects (only to R@H)

Because bug itself is a wrong calculation of amount of work BOINC already have in queue for projects limited by <max_concurrent> setting.
It just count only up to <max_concurrent> number of tasks in queue and ignores the rest. So calculated work queue size does not increase no matter how many tasks BOINC has already loaded for a such project. And falls into an endless loop of getting new work.
10) Message boards : Number crunching : Some of the erroneous WUs are listed as successful and even pass validation (Message 102326)
Posted 31 Jul 2021 by Mad_Max
Post:
It did not done any useful work.
Such WUs error out just few minutes after start before computations of a very first decoy/model completed and results saved. Logs suggest it happens due to R@H app can not open/load all necessary input data for computation (possible a configuration error at WU generation stage?).
But still somehow slip through a server validator and marked as Success/Valid.
11) Message boards : Number crunching : Some of the erroneous WUs are listed as successful and even pass validation (Message 101729)
Posted 5 May 2021 by Mad_Max
Post:
I spot some WUs which completely failed, but has " Success" status, and even server validate and grant credits for such WUs with critical errors.

Name pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y_1391022_1_0
Workunit 1229929395
Created 5 May 2021, 2:48:08 UTC
Sent 5 May 2021, 4:49:41 UTC
Report deadline 8 May 2021, 4:49:41 UTC
Received 5 May 2021, 5:48:12 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x00000000)
Run time 7 min 18 sec
CPU time 6 min 49 sec
Validate state Valid
Credit 3.27



While logs clearly shows fatal errors:
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2578584
Using database: database_357d5d93529_n_methylminirosetta_database

ERROR: [ERROR] Unable to open constraints file: bc1dd6b031238f177cab303f1b5a3aef_0001.MSAcst
ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457
08:41:23 (4920): called boinc_finish(0)

</stderr_txt>
]]>


Links to tasks

https://boinc.bakerlab.org/rosetta/result.php?resultid=1376543173

https://boinc.bakerlab.org/rosetta/result.php?resultid=1375927242

https://boinc.bakerlab.org/rosetta/result.php?resultid=1375899817
12) Message boards : Number crunching : Rosetta needs 6675.72 MB RAM: is the restriction really needed? (Message 101728)
Posted 5 May 2021 by Mad_Max
Post:

Indeed Windows and most Linux flavors allocate a swap file the same size as physical memory. For some reason, probably historical, the Raspberry Pi foundation allocate 100MB to this day. Increasing it should allow it to download tasks although I haven't tried it.


Not sure about Linux, but on Windows BOINC count only real RAM and it does not matter how many swap space you allocate. It wont help with this abnormal R@H RAM requirements.
I still (for more than month already) see errors like
05-May-2021 07:48:46 [Rosetta@home] Scheduler request completed: got 0 new tasks
05-May-2021 07:48:46 [Rosetta@home] No tasks sent
05-May-2021 07:48:46 [Rosetta@home] Rosetta needs 6675.72 MB RAM but only 6117.47 MB is available for use.
On computers with 8 GB of RAM + 8 GB of swap space.
And after such task finally downloaded they usually use less < 1 GB of RAM per task. And computer run up to 4-8 R@H tasks simultaneously without any problems.
But last month usually can not get any because server thinks that there is not enough RAM for just one task and refuse to send any work.

Pure stupidity.
13) Message boards : Number crunching : Rosetta needs 6675.72 MB RAM: is the restriction really needed? (Message 101357)
Posted 18 Apr 2021 by Mad_Max
Post:
Total Project RAC is falling too already (about ~30% down in last week) as this problem starts to affect many computers. Which can not get ANY R@H work for a long periods of time because of abnormal RAM requirements of SOME work in server queue.
14) Message boards : Number crunching : Rosetta needs 6675.72 MB RAM: is the restriction really needed? (Message 101296)
Posted 14 Apr 2021 by Mad_Max
Post:
To get new WUs, the PC needs 6675.72 MB.
Is the restriction really needed?
No.
There is a problem with the configuration of many of the current Work Units, they are requesting more RAM & Disk space than they require, making it impossible for some systems to process them.

Its even worse - such bad WUs block the receipt and processing of normal WUs. Every time when such a task, with incorrectly set memory requirements comes across in the queue for distribution (on the server side), the server refuses to issue ANY tasks to machine, incl. normally configured.
In recent days, my 8GB computers are often completely spent on the processing of backup projects instead of R@H.
Because cannot receive tasks from Rosetta, due to the fact that the server refuses to issue any other tasks until someone else (with a sufficiently large amounts of free RAM) picks up all these tasks with abnormal RAM request from server queue. But by this time, my machines have already managed to fill their work queue with WUs from backup projects.
15) Message boards : Number crunching : 12 CPU WUs (Message 97702)
Posted 27 Jun 2020 by Mad_Max
Post:
What are you talking about?

There is no such thing as Multi-threaded Rosetta app. And so all WUs are single thread ONLY.
It is not possible to create such WUs because current R@H application does not support MT processing.
16) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 97694)
Posted 27 Jun 2020 by Mad_Max
Post:
Could you perhaps also measure the power consumption difference between running 8 WU vs. 16 WU on your Ryzen? My theory is that if the CPU is blocked for longer, it may consume a bit less energy. This would also increase the credit/watt ratio more than 17% then..

Interesting and simple test indeed.
There is a results. I do not have have external hardware tools to measure power consumption right now. So i used internal CPU monitoring: CPU package power (SMU) values collected by HWiNFO64.

With 16 WUs running in parallel its shows 72 W as average (waited a couple of minutes to collect average values) CPU package power
With only 8 WUs running in parallel it drops down to 60 W.
CPU temperature also slowly decreased by about 3 degrees as well, confirming power monitoring values.

CPU frequencies and voltages stays the same during comparison (3.34 GHz and 1.02 V - its stock values for this CPU, no manual tuning).
And SMT was not entirely disabled i just reduced number of WUs running without system reboot, so additional 8 SMT threads was still present in the system, just were not been used for computation.

So "there is no magic" (c) - more real work is done, more energy is consumed by CPU. Given the CPU stays the same.

And it almost perfect linear correlation in my case: 16 WUs running on 8 cores produce about 17% more total computation throughput and consume about same 17% more power (about 20% actually: 72/60 = ~1.2)
So credit/watt ratio stays about the same. But only CPU wise.
There is some additional power consumptions (RAM, disk, Motherboard components) which should not be affected or affected negligible so total system is a bit more energy efficient if it runs all 16 WUs on all threads with SMT compared to just 8 WUs .

And it definitely more efficient from a "credit/$ cost of the system" point of view as use of SMT cost nothing.

P.S.
Note: +17% performance gain from SMT was measured few month ago. While power comparison made today. Current SMT boost may slightly different due to different tasks being processed in R@H queue. Its still same Rosetta and BOINC, but a different tasks/protein targets alter work/load profiles sightly.

To direct comparison and accurate energy efficiency calculation 2 new performance tests needed(with 8 and 16 WUs running), but it take a lot of time to do it and i do not have enough spare currently.
17) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 97690)
Posted 27 Jun 2020 by Mad_Max
Post:

ANY CPU running R@H with default 8 hours target runtime will give about 2.5-3 GFLOPS as average processing rate: 80000/(8*60*60). Regardless of real CPU speed. Pentium 4 and Core i9 have similar values because FLOPS count is fixed and runtime is fixed too.
If you change target CPU time - you will get significant change in "average processing rate" reported by BOINC.
Which would explain some of the APR values i've seen on some systems.

If the FLOPs values used for the the wu.fpops_est were set to proportionally match the Target CPU runtime (eg 2hr Runtime- wu.fpops_est, 4hr Runtime- wu.fpops_est * 2, 8hr Runtime- wu.fpops_est * 4, 36hr Runtime wu.fpops_est * 18) then the APRs would be more representative of computation done, as would the Credit awarded. Tasks that run longer or shorter than the Target CPU time will still cause variations.
But the Credit awarded & APR would be a lot more representative of the processing a given CPU has done, and initial Estimated completion times for new Tasks and particularly new applications shouldn't be nearly as far out as they presently are.

Yes, i agree this should be fixed. By a simple multiplier at least as a fast/simpler solution. E.g 10 000 FLOPs her hour of target CPU time to be in line with current baseline of 80000 GFLOPs and default 8 hr target runtime.

It will both improve Cr calculation accuracy and help a LOT to BOINC client adapts Estimated completion times and queue size faster if user changed target CPU setting. Without it BOINC starts to correct these values slowly and only after some WUs finished as it does not aware that WUs became longer or shorted out of sudden. And only see it after some WUs already finished.
With wu.fpops_est altered in a proportion to target CPU time BOINC client will know that new WUs will be shorter or longer in advance: right after downloading and before even starting processing the first of it.
18) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 97689)
Posted 27 Jun 2020 by Mad_Max
Post:
And Runtime is no good as Runtimes are fixed. Credit is no good as it is based on APR.
And are Decoys even a useful indicator for the same type of Task? For a given CPU- a 2hr Target CPU time produces 10 Decoys. Would a 4 hour runtime produce 20? 8hrs 40?
And for a different CPU on the very same Task, if a 2hr runtime time produced 20 Decoys, would a 4 hour runtime produce 40 etc? Is this what happens?

Then you've got different tasks that might produce 10 times as many, or 10 times less Decoys for the same Runtime. Hence why the number FLOPs done was used to determine the work done by a CPU (although of course not all FLOPs are equal, some have larger or smaller execution overheads than others so often some sort of scaling factor is required to smooth those things out).


Sorry for a late reply (forget to subscribe to thread initially).

Runtimes are NOT fixed. The target CPU time is fixed in the settings, yes. But actual run times can vary significantly from it. Some task ends prematurely - usually if there is an some errors during processing or WU hits internal "max decoy limit" (there is an instruction in each WU to stop processing data if set number of decoys already generated and sent results to server, ignoring fact it did not reach target CPU time).
And on the other case some WUs exceed target CPU time significantly - usually it happens if WU works on really hard/big models and generation of one decoy on such hard targets can take few hour of CPU work. And target CPU time is checked only between decoy, CPU time trigger does not interrupt calculation of already started decoy until it fully finished or watchdog kick-in (usually it set to CPU target time + 4 hours) and abort the task .

That is why I count actual CPU time: take some(more is better) completed WUs, sum up all CPU times used by it, sum up all the credit generated. Divide sum of the credit by sum of all CPU time consumed. And you got a fairly accurate estimate of real host performance without waiting a LONG time for the average indicator (RAC) to stabilize.
Usually grab(C&P) all recent WUs from result tables into Excel/Calc spreadsheet,

And about decoys - based on my observation - yes, there is almost linear relation - e.g. double CPU runtime of WU and it will produce about twice number of decoys.
With same type of WU and same hardware of course.
Moreover, it the number of decoys generated is the main factor for calculating credits for a successfully completed task at server after reporting. Its like simple formula:
Cr granted = decoy count in reported WU x "price" of one decoy.
Host CPU benchmarks and APR is used to determine that "decoy price" but server uses average values collected from many (not sure how many? probable all) host contributing to the same target/work type. Its NOT a BOINC WU "wingmans"(such scheme is used in WCG for example). For R@H it all host getting WUs of a same type/batch (usually hundreds or even few thousands hosts on large batches). So all "anomalies" in benchmarks and AFR smoothed out due to large scale averaging.

But for particular host only number of successfully generated decoys determine how much CR it receives for completed WUs.
19) Message boards : Number crunching : no new tasks? (Message 97667)
Posted 27 Jun 2020 by Mad_Max
Post:
As of this morning (US Central time), looks like the well has run dry again.

I'd like to figure out a work share setup that would allow WCG tasks to run to completion when the Rosetta tasks come back. I've never liked the "suspend one to do another" approach - I'd rather have what's in the queue finish up regardless of which project it belongs to.


There is an option in the BOINC computing preferences "Switch between tasks every xxx minutes"
If you increase this value BOINC will switch between project less often.
And if you increase it above average WU run time BOINC should switch between projects only after fully competing previous WU.

I have it set to 300 min (from default 60 min) and usually BOINC finish already started WUs from WCG before switching back to R@H.

And work share set to 100 for WCG (its a default value) and 200 for R@H.
20) Message boards : Number crunching : COVID 19 WU Errors (Message 97665)
Posted 27 Jun 2020 by Mad_Max
Post:
There is a good update.

Looks like in latest BOINC versions (v 7.16.x and later) BOINC developers finally fixed "waiting for memory" behavior.
It no longer conflicts with "leave non-GPU WU in memory while suspended" option.

Before fix:
If BOINC suspend task due to exceeding allowed RAM usage and option "leave non-GPU WU in memory while suspended" was enabled tasks stays in RAM while "waiting for memory" . And so this way any attempts of BOINC client to free some RAM via suspending some of running WU led to the exact opposite effect: RAM usage only increased(because it consumed by more and more tasks in "waiting for memory" state) until some of tasks or systems as a whole crash due to out of RAM errors.

After the fix:
BOINC client now ignores "leave non-GPU WU in memory while suspended" options for task in "waiting for memory" state and unload them from RAM anyway regardless of this option.
Option now apply to task suspended due to other reasons (like manual user request or CPU switch to another project).

So I retract my recommendation made in this post to avoid use of this option on machines with limited memory - provided BOINC client updated to one of the last versions (>= 7.16.x) .

Also there is no longer useful to limit number of R@H task running in parallel via <max_concurrent> </max_concurrent> settings in the app_config.xml
Its more efficient and easy to tune "use at most" memory settings. And BOINC should do the rest automatically to not exceed these values.

This should be this way from beginning, but unfortunately due to some errors in the BOINC code did not work as expected.
But it work now!


Next 20



©2022 University of Washington
https://www.bakerlab.org