Posts by Mad

1) Message boards : Number crunching : Vbox_image (Message 109483) Posted 1 day ago by Mad_Max Post: If you are curious about the details, you can just watch it yourself. This is a standard image compatible with Oracle VM VirtualBox and if you start it from VirtualBox Manager (instead of BOINC) you can explore and see everything. I did this over a year ago when I was trying to optimize work of R@H Python jobs. But I don't remember the details anymore, because no good results came out of this and I just opted-out from this type of task and deleted the downloaded V-box images and the VM VirtualBox itself because no other DC projects in which I participate use it.
2) Message boards : Number crunching : Rosetta Beta 6.00 (Message 109245) Posted 13 May 2024 by Mad_Max Post: A bug report (in the unlikely event that one of the developers still read forum and decides to fix some of the bugs). On one of my computers (this one: https://boinc.bakerlab.org/rosetta/results.php?hostid=1211592) ALL, without exception, tasks for the Rosetta Beta 6.xx application end with an error at the 1st minute of operation. The error is always the same: Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address ........... (address vary) This is a fairly old, but still decent and stable computer running on an AMD Phenom II X6 processor (6 physical cores) and 16 GB of RAM Tasks for Rosetta 4.x are performed on it without any problems (except for faulty tasks that cause errors on all computers like famous "CHI angle" or "residue LOWERCONNECT" errors) as well as many(thousands - literally) tasks for several other DC projects: Einstein@Home, World Community Grid, SiDock@Home. This has been going on for several months by now (starting from Rosetta beta version 6.03, I think, because I've seen some valid WUs with early versions from the 6.x branch on this computer), killing many hundreds of WUs. During this time, there were several computer reboots and I reset the project twice (this includes re-downloading of all executable files and data files). Without changes. I do not think that the problem is in the software configuration, because on other computers I have not just a similar, but almost identical setup (originally obtained by cloning the system disk from this one). The main difference is that other computers have newer processors installed - also from AMD but from newer generations - Ryzen 7 2700 and Ryzen 5 5600X on which the Rosetta 6.xx application works without problems with same SW setup. So one of the versions of the error causes is that the application is trying to use one of the new instruction sets (AVX maybe? or other never CPU features which old Phenom series lacking ) without proper verification of their availability causing errors on older CPUs. It would be nice if someone from the owners of the old processors (like a Phenom or Core 2 DUO/QUAD) check this.
3) Message boards : Number crunching : Rosetta 4.1+ and 4.2+ (Message 108326) Posted 20 Apr 2023 by Mad_Max Post: Yes, i also now getting some of such errors (and my wingmans on failed WUs too). It is an old error indeed. But I have not seen them for a long time (several months). And now they have come back again...
4) Message boards : Number crunching : Waste of resource. (Message 106192) Posted 14 May 2022 by Mad_Max Post: This number of "runs" it really big, like from tens of thousands to hundreds of thousands runs per model, in some cases it can be more than a million. The fact that we increase the number of runs in one task by increasing the target computation time does not affect anything - the total number is regulated by the number of computed and validated results on server side. When scientists decide that they have already received enough runs for this particular model/goal, they simply cancel all remaining tasks of the same series that are still in the queue. Including ones that have been already downloaded to volunteer PCs (probably you have already noticed this sometime - when a part of tasks from a series that worked without problems / errors is suddenly canceled by a command from the server - "server abort" in WUs status). P.S And Well, in any case, even if we generate some "excessive" result it will not be a complete loss of resources. Because with this pseudo-random search approach used in R@H, more runs/passes = better (more accurate) result become. The cut-off point (when enough is enough to say) is rather arbitrary. It is usually chosen when the improvement in accuracy is already quite insignificant and it is not practical to continue to spend large amounts of computing power on it. But insignificant doesn't mean they don't exist at all.
5) Message boards : Number crunching : Level 3 cache requirements? (Message 106191) Posted 14 May 2022 by Mad_Max Post: Use of additional ("virtual") thread always decrease performance of other running threads. Its normal - because this share same compute unit in same physical cores. If this slowdown of a single thread performance not very big (like in 10-30% range) then its normal and there is nothing to do about it. Only if slowdown of a single thread is very big like twice slower (so total throughput of all threads combined decrease) then it indicates that something went wrong. Like not enough cache size or other issues. But a low to moderate slowdown when using virtual threads is both normal and inevitable.
6) Message boards : Number crunching : Rosetta stops the use of my 2nd GPU (Message 106190) Posted 14 May 2022 by Mad_Max Post: It probable due to combination of large work cache in your BOINC setting + short deadlines for R@H WUs (R@H server side settings). It can drive BOINC into "panic" (high priority mode) - it allocate all CPU resources to a single project in try to avoid missing deadline. Including CPU core(s) allocated for GPU support. Its a stupid design decision(because it greatly decrease total performance by cut down of all GPU computations) but it have been this way in BOINC scheduler for years. Possible workarounds for this 1 - reduce cache size setting or/and abort excessive WUs from a "hoarding" project in BOINC WUs queue. 2 - use of app_config.xml to limit max number of running instances for one project
7) Message boards : Number crunching : Excessive workunit fetch (Message 106180) Posted 11 May 2022 by Mad_Max Post: Yes, it already included in BOINC ver 7.20.0 (and later). But v. 7.20 itself is not finished yet.
8) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 103893) Posted 24 Dec 2021 by Mad_Max Post: I had gone as far to untick all the disk space boxes to give it unlimited use of the disk The boxes aren't tickable, they require values. And one value in any one of the options overrides the values in any of the other two when it comes to what disk space is actually available. They are quite tickable - there are checkboxes to the left of each value box which turn off corresponding limit. I was referring to the web based settings. If you've only got one system, local Setting are ok. More than one, web based settings make life much easier. web based settings also have same checkboxes for disk and network usage limits as local settings do. At least here on Rosetta server web based settings for BOINC.
9) Message boards : Number crunching : How to determine what future work is coming this way from Robetta (Message 103890) Posted 24 Dec 2021 by Mad_Max Post: P.S. I saw messages in other topic about that project is simply run out of all regular (4.20) Rosetta WUs and all few millions WUs in main work queue is Python tasks only. But I see that is not the point. New regular WUs continue to be created and distributed every day by server, only sometimes at a greatly reduced pace (you need to be a lucky one to get some, but it still possible). And after monitoring the server status, it looks like this pace is limited by the factors that I described in the previous post.
10) Message boards : Number crunching : How to determine what future work is coming this way from Robetta (Message 103889) Posted 24 Dec 2021 by Mad_Max Post: I think there are still a plenty of work for regular Rosetta in queue. And work shortage is just another example of misconfiguration and poor maintenance of the project tech staf (admins/programmers). Looks like work generator daemons configured to maintain about 5000 tasks in ready to sent state (taking it from the large work queue displayed on the home page when needed), but this target set as combined number. Ignoring the fact that the project now has two fundamentally different types of tasks that are not mixed with each other. There are two separate work generators. One creates regular Rosetta WUs only and second is creating Python vBOX WUs only. You can see them on server status page: https://boinc.bakerlab.org/rosetta/server_status.php rah_make_work_rosetta - regular WUs generator rah_make_work_rosetta_python_projects - Python WUs generator Pythons WUs processed by volunteer computer at much lower pace (due to much higher system requirements, absence of vBox on some of BOINC client installation, some users do NOT want vBOX tasks and disabled it, etc) This cause current work cache (Tasks ready to send) after some time end up occupied by vBox python WUs only and no any tasks for regular Rosetta available to download. But regular WUs genenator does not kick in because target work cache size is already filled (by Python tasks) and produce new work only after some Pyton tasks distributed to clients "freeing up space" for regular tasks. So now regular rosetta performance and work supply is limited by performance of much slower Python work queue.
11) Message boards : Number crunching : Excessive workunit fetch (Message 103762) Posted 7 Dec 2021 by Mad_Max Post: He said it about regular (non Python/virtual box) task. Regular R@H tasks RAM usage lies in 0.7-3 Gb range for almost all tasks (for >95% of tasks) and in 0.7-1.5 Gb usually (for >70%). So it is very far from the 8 GB per tasks.
12) Message boards : Number crunching : Excessive workunit fetch (Message 103653) Posted 2 Dec 2021 by Mad_Max Post: It is a confirmed BOINC (not Rosetta) bug with excessive workunit fetch for a project if <max_concurrent> setting is used. It was identified about year ago or so. One of the latest bug reports to BOINC developer: https://github.com/BOINC/boinc/issues/4322 And finally a patch to this problem is in work now: https://github.com/BOINC/boinc/pull/4592 It will be included in one of the future BOINC releases. For now there are few workarounds until new fixed version of BOINC available: 1 - avoid using <max_concurrent> setting completely OR 2 - set work cache size to a really low values OR 3 - use R@H with other projects with stable WUs supply and WITHOUT <max_concurrent> setting applied to this "spare" projects (only to R@H) Because bug itself is a wrong calculation of amount of work BOINC already have in queue for projects limited by <max_concurrent> setting. It just count only up to <max_concurrent> number of tasks in queue and ignores the rest. So calculated work queue size does not increase no matter how many tasks BOINC has already loaded for a such project. And falls into an endless loop of getting new work.
13) Message boards : Number crunching : Some of the erroneous WUs are listed as successful and even pass validation (Message 102326) Posted 31 Jul 2021 by Mad_Max Post: It did not done any useful work. Such WUs error out just few minutes after start before computations of a very first decoy/model completed and results saved. Logs suggest it happens due to R@H app can not open/load all necessary input data for computation (possible a configuration error at WU generation stage?). But still somehow slip through a server validator and marked as Success/Valid.
14) Message boards : Number crunching : Some of the erroneous WUs are listed as successful and even pass validation (Message 101729) Posted 5 May 2021 by Mad_Max Post: I spot some WUs which completely failed, but has " Success" status, and even server validate and grant credits for such WUs with critical errors. Name pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y_1391022_1_0 Workunit 1229929395 Created 5 May 2021, 2:48:08 UTC Sent 5 May 2021, 4:49:41 UTC Report deadline 8 May 2021, 4:49:41 UTC Received 5 May 2021, 5:48:12 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Run time 7 min 18 sec CPU time 6 min 49 sec Validate state Valid Credit 3.27 While logs clearly shows fatal errors: <core_client_version>7.16.11</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2578584 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: [ERROR] Unable to open constraints file: bc1dd6b031238f177cab303f1b5a3aef_0001.MSAcst ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457 08:41:23 (4920): called boinc_finish(0) </stderr_txt> ]]> Links to tasks https://boinc.bakerlab.org/rosetta/result.php?resultid=1376543173 https://boinc.bakerlab.org/rosetta/result.php?resultid=1375927242 https://boinc.bakerlab.org/rosetta/result.php?resultid=1375899817
15) Message boards : Number crunching : Rosetta needs 6675.72 MB RAM: is the restriction really needed? (Message 101728) Posted 5 May 2021 by Mad_Max Post: Indeed Windows and most Linux flavors allocate a swap file the same size as physical memory. For some reason, probably historical, the Raspberry Pi foundation allocate 100MB to this day. Increasing it should allow it to download tasks although I haven't tried it. Not sure about Linux, but on Windows BOINC count only real RAM and it does not matter how many swap space you allocate. It wont help with this abnormal R@H RAM requirements. I still (for more than month already) see errors like 05-May-2021 07:48:46 [Rosetta@home] Scheduler request completed: got 0 new tasks 05-May-2021 07:48:46 [Rosetta@home] No tasks sent 05-May-2021 07:48:46 [Rosetta@home] Rosetta needs 6675.72 MB RAM but only 6117.47 MB is available for use. On computers with 8 GB of RAM + 8 GB of swap space. And after such task finally downloaded they usually use less < 1 GB of RAM per task. And computer run up to 4-8 R@H tasks simultaneously without any problems. But last month usually can not get any because server thinks that there is not enough RAM for just one task and refuse to send any work. Pure stupidity.
16) Message boards : Number crunching : Rosetta needs 6675.72 MB RAM: is the restriction really needed? (Message 101357) Posted 18 Apr 2021 by Mad_Max Post: Total Project RAC is falling too already (about ~30% down in last week) as this problem starts to affect many computers. Which can not get ANY R@H work for a long periods of time because of abnormal RAM requirements of SOME work in server queue.
17) Message boards : Number crunching : Rosetta needs 6675.72 MB RAM: is the restriction really needed? (Message 101296) Posted 14 Apr 2021 by Mad_Max Post: To get new WUs, the PC needs 6675.72 MB. Is the restriction really needed? No. There is a problem with the configuration of many of the current Work Units, they are requesting more RAM & Disk space than they require, making it impossible for some systems to process them. Its even worse - such bad WUs block the receipt and processing of normal WUs. Every time when such a task, with incorrectly set memory requirements comes across in the queue for distribution (on the server side), the server refuses to issue ANY tasks to machine, incl. normally configured. In recent days, my 8GB computers are often completely spent on the processing of backup projects instead of R@H. Because cannot receive tasks from Rosetta, due to the fact that the server refuses to issue any other tasks until someone else (with a sufficiently large amounts of free RAM) picks up all these tasks with abnormal RAM request from server queue. But by this time, my machines have already managed to fill their work queue with WUs from backup projects.
18) Message boards : Number crunching : 12 CPU WUs (Message 97702) Posted 27 Jun 2020 by Mad_Max Post: What are you talking about? There is no such thing as Multi-threaded Rosetta app. And so all WUs are single thread ONLY. It is not possible to create such WUs because current R@H application does not support MT processing.
19) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 97694) Posted 27 Jun 2020 by Mad_Max Post: Could you perhaps also measure the power consumption difference between running 8 WU vs. 16 WU on your Ryzen? My theory is that if the CPU is blocked for longer, it may consume a bit less energy. This would also increase the credit/watt ratio more than 17% then.. Interesting and simple test indeed. There is a results. I do not have have external hardware tools to measure power consumption right now. So i used internal CPU monitoring: CPU package power (SMU) values collected by HWiNFO64. With 16 WUs running in parallel its shows 72 W as average (waited a couple of minutes to collect average values) CPU package power With only 8 WUs running in parallel it drops down to 60 W. CPU temperature also slowly decreased by about 3 degrees as well, confirming power monitoring values. CPU frequencies and voltages stays the same during comparison (3.34 GHz and 1.02 V - its stock values for this CPU, no manual tuning). And SMT was not entirely disabled i just reduced number of WUs running without system reboot, so additional 8 SMT threads was still present in the system, just were not been used for computation. So "there is no magic" (c) - more real work is done, more energy is consumed by CPU. Given the CPU stays the same. And it almost perfect linear correlation in my case: 16 WUs running on 8 cores produce about 17% more total computation throughput and consume about same 17% more power (about 20% actually: 72/60 = ~1.2) So credit/watt ratio stays about the same. But only CPU wise. There is some additional power consumptions (RAM, disk, Motherboard components) which should not be affected or affected negligible so total system is a bit more energy efficient if it runs all 16 WUs on all threads with SMT compared to just 8 WUs . And it definitely more efficient from a "credit/$ cost of the system" point of view as use of SMT cost nothing. P.S. Note: +17% performance gain from SMT was measured few month ago. While power comparison made today. Current SMT boost may slightly different due to different tasks being processed in R@H queue. Its still same Rosetta and BOINC, but a different tasks/protein targets alter work/load profiles sightly. To direct comparison and accurate energy efficiency calculation 2 new performance tests needed(with 8 and 16 WUs running), but it take a lot of time to do it and i do not have enough spare currently.
20) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 97690) Posted 27 Jun 2020 by Mad_Max Post: ANY CPU running R@H with default 8 hours target runtime will give about 2.5-3 GFLOPS as average processing rate: 80000/(86060). Regardless of real CPU speed. Pentium 4 and Core i9 have similar values because FLOPS count is fixed and runtime is fixed too. If you change target CPU time - you will get significant change in "average processing rate" reported by BOINC. Which would explain some of the APR values i've seen on some systems. If the FLOPs values used for the the wu.fpops_est were set to proportionally match the Target CPU runtime (eg 2hr Runtime- wu.fpops_est, 4hr Runtime- wu.fpops_est * 2, 8hr Runtime- wu.fpops_est * 4, 36hr Runtime wu.fpops_est * 18) then the APRs would be more representative of computation done, as would the Credit awarded. Tasks that run longer or shorter than the Target CPU time will still cause variations. But the Credit awarded & APR would be a lot more representative of the processing a given CPU has done, and initial Estimated completion times for new Tasks and particularly new applications shouldn't be nearly as far out as they presently are. Yes, i agree this should be fixed. By a simple multiplier at least as a fast/simpler solution. E.g 10 000 FLOPs her hour of target CPU time to be in line with current baseline of 80000 GFLOPs and default 8 hr target runtime. It will both improve Cr calculation accuracy and help a LOT to BOINC client adapts Estimated completion times and queue size faster if user changed target CPU setting. Without it BOINC starts to correct these values slowly and only after some WUs finished as it does not aware that WUs became longer or shorted out of sudden. And only see it after some WUs already finished. With wu.fpops_est altered in a proportion to target CPU time BOINC client will know that new WUs will be shorter or longer in advance: right after downloading and before even starting processing the first of it.

Next 20

Posts by Mad_Max