Message boards : Number crunching : COVID 19 WU Errors
Author | Message |
---|---|
TheMoD Send message Joined: 10 Feb 06 Posts: 3 Credit: 839,827 RAC: 125 |
Hello everybody I have problems with the COVID 19 workunits. They run too long and all produce an error: Calculation error Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x757D4192 What can i do? TheMoD |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,235 |
These work units are using huge amounts of RAM, over 1 GB per task. You need to allow more memory or run less rosetta workunits. |
TheMoD Send message Joined: 10 Feb 06 Posts: 3 Credit: 839,827 RAC: 125 |
Thanks a lot I use a Celeron 4 core processor with 4GB RAM. So far there have never been any problems with Rosetta or any other application. What would I have to change, to make COVID WU's work? Actually, no work units should be delivered that do not match the requirements, right? Greetings TheMoD |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,235 |
Run less concurrent Rosetta@home workunits. These COVID-19 WU's are using a lot of RAM. You can do it manually or by editing a .xml file, which I'm afraid I do not recall how. Hopefully someone can help you further. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It seems that systems with 1GB of memory per CPU are having problems running some of the tasks being sent. You might try running World Community Grid with the same resource share as R@h. WCG tasks tend to be lower memory usage, so running both projects typically results in a mix of low and high memory tasks. Another approach is to reduce the number of CPUs that BOINC is allowed to use. This is a setting in your preferences. Rosetta Moderator: Mod.Sense |
TheMoD Send message Joined: 10 Feb 06 Posts: 3 Credit: 839,827 RAC: 125 |
Thank you for your answers I'll try it. Have a nice weekend |
Miklos M Send message Joined: 8 Dec 13 Posts: 29 Credit: 5,277,251 RAC: 0 |
At the rate I am going it could take days for each task to get done. These are fast computers with fast cpu cards also the time lapsed moves inaccurately, showing 5 minutes after several hours. Time remaining is about 4-6 hours. What am I doing wrong here? Thank you, Miklos |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,235 |
The WUs should run for the duration set on this page https://boinc.bakerlab.org/rosetta/prefs.php?subset=project I think the cut-off limit is +4 hours beyond whatever was set on that page, so if you have the default 10 hours, they should go for no more than 14 CPU hours. That is my understanding. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Falconet's description of the "watch-dog" is correct. So, from your description, it sounds like tasks are running longer than your preference. When this happens, as the estimated runtime gets under 5 minutes, things are adjusted to correctly indicate that forward progress is being made, but since it has no better estimate to show you, it scales time exponentially into those last 5 minutes. There is nothing that you need to do. If the work unit is hitting a long-running model that is causing it to run long, that gets reported back with the result so the algorithm being used can be reviewed. Rosetta Moderator: Mod.Sense |
Miklos M Send message Joined: 8 Dec 13 Posts: 29 Credit: 5,277,251 RAC: 0 |
I tweaked my cpu's power setting and it is on 90% and I am not running too many tasks at once, though it is a tiny bit faster now, it still looks like it will take over 10 hours to finish each task, the other two computers are just as fast as this one below: GenuineIntel Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz [Family 6 Model 85 Stepping 4] (36 processors) [3] NVIDIA GeForce RTX 2080 Ti (4095MB) driver: 43521 Linux Ubuntu Ubuntu 18.04.4 LTS [5.3.0-40-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Can the run times be speeded up? Thank you, Miklos |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
There is a Rosetta preference (configured from the website rather than the BOINC preferences on your machine) where you can define your workunit runtime preference... if that is what you meant. Rosetta Moderator: Mod.Sense |
Miklos M Send message Joined: 8 Dec 13 Posts: 29 Credit: 5,277,251 RAC: 0 |
Did someone here say that 4 hours is the maximum time after that run time, the wu is erroring out? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
No, 4 hours is not the maximum time. Yes, the watch-dog will kick in and clean up the WU if it runs longer than 4 hours more than the runtime preference. The runtime preference is between 1 and 24 hours. Rosetta Moderator: Mod.Sense |
Dayle Send message Joined: 6 Jan 14 Posts: 13 Credit: 792,486 RAC: 3,118 |
Just resubscribed to this project on a reliable PC that's been running Rosetta software on World Community Grid (their Microbiome Immunity Project) without issue. Set WU size to 24 hours and ran a mix of the two feeds, weighted 50-50. PC has 32 Threads and 16 gig of RAM. When I went to bed last night there were two gigs of free system memory and a 16 GB page file just in case. Looks like at some point there was a spike in RAM usage (while otherwise idle), and 5 work units errored without credit. Total loss: two days, four hours of work on a modern system (plus five more hours of WCG tasks). Maybe nothing over time but quite painful all at once, and not a great trend if it continues. One of the failures didn't mention RAM, just "finish file present too long". I'm hypothesizing that this task encountered a problem and got bigger and bigger, crashing the rest? Output text is below. It's also possible the crash took place when minirosetta tasks finished and were replaced by full size COVID tasks. If anybody has any thoughts, they'd be appreciated. Thanks, Dayle Task 1134921561 Name rb_03_27_19542_19448_ab_t000__h002_robetta_IGNORE_THE_REST_11_09_903961_5_0 Workunit 1022160017 Created 27 Mar 2020, 20:53:25 UTC Sent 27 Mar 2020, 21:14:52 UTC Report deadline 4 Apr 2020, 21:14:52 UTC Received 28 Mar 2020, 22:40:43 UTC Server state Over Outcome Computation error Client state Compute error Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT Computer ID 3925665 Run time 21 hours 14 min 43 sec CPU time 21 hours 14 min 43 sec Validate state Invalid Credit 0.00 Device peak FLOPS 4.49 GFLOPS Application version Rosetta v4.07 windows_x86_64 Peak working set size 1,283.20 MB Peak swap size 1,491.86 MB Peak disk usage 492.17 MB Stderr output <core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_x86_64.exe @rb_03_27_19542_19448_ab_t000__h002_robetta_FLAGS -in::file::fasta t000__h002.fasta -psipred_ss2 t000__h002.spider3_ss2 -kill_hairpins t000__h002.nobuformat.spider3_ss2 -abinitio::use_filters true -in:file:boinc_wu_zip rb_03_27_19542_19448_ab_t000__h002_robetta.zip -frag3 rb_03_27_19542_19448_ab_t000__h002_robetta.200.3mers.index.gz -fragA rb_03_27_19542_19448_ab_t000__h002_robetta.200.9mers.index.gz -fragB rb_03_27_19542_19448_ab_t000__h002_robetta.200.11mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2413747 Starting watchdog... Watchdog active. ====================================================== DONE :: 1 starting structures 76648.4 cpu seconds This process generated 5 decoys from 5 attempts ====================================================== BOINC :: WS_max 1.34554e+09 BOINC :: Watchdog shutting down... 13:39:55 (14096): called boinc_finish(0) </stderr_txt> <message> finish file present too long</message> ]]> |
Miklos M Send message Joined: 8 Dec 13 Posts: 29 Credit: 5,277,251 RAC: 0 |
Thank you for clarifying it. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
@Dayle the machine registered with your account shows is has 32GB of memory, are you saying that BOINC is configured to only use half of the memory? It seems the COVID tasks are great lovers of memory. I'm unclear why the BOINC Manager is having trouble making things work. And the Project Team will be tagging these tasks are being more memory intensive in the future. Obviously they are the new kids on the block here. So there are still tweaks to be made in how the WUs are created that will help things run smoother. ...but yes, It's possible one or more minirosetta tasks finished and were replaced by one or more full size COVID tasks. Just by what I'm seeing people reporting, I'm saying that the COVID tasks need 2GB per active thread. In your case, running 50/50 with WCG is a great idea. Because WCG typically has tasks that run in a much smaller memory footprint. However, even if you get a balance of 16 WCG threads and 16 R@h threads, you still push hard against my 2GB per thread observation (and if you are only allowing BOINC to use 16GB, then this would still be too much). I am hopeful that the BOINC Manager's ability to suspend a task in a "waiting for memory" state, will be more stable once all of the COVID WUs have the higher memory requirement defined in them. There is a computing preference, in the disk and memory tab, that is checked to "leave applications in memory while suspended". When a task gets to a "waiting for memory" state, it is "suspended" by the BOINC Manager. I wonder if the BOINC Manager hesitates to act, when the WU is not near a checkpoint, if it will not be left in memory while suspended. I also point out that things do not truly stay "in memory", instead they go out to your swap space. And I recommend that folks with multiple projects, or in these high memory consumption cases, check the box, and do keep in memory while suspended. Because you mentioned it, a 16GB page file starts to sound rather small as well. But I would think Windows would have been spitting messages if that were filling. Rosetta Moderator: Mod.Sense |
ktamail666 Send message Joined: 25 Jun 06 Posts: 1 Credit: 379,335 RAC: 0 |
I had similar "out of memory" issues, but I don't understand because my machine has 32gb ram. Limit is allow to use 26gb: 24-Mar-2020 19:09:18 [---] max memory usage when active: 26168.39 MB 24-Mar-2020 19:09:18 [---] max memory usage when idle: 26168.39 MB This machine has 6 cpu core and it's ran 6 WU same time. https://boinc.bakerlab.org/rosetta/result.php?resultid=1136619062 https://boinc.bakerlab.org/rosetta/result.php?resultid=1135909251 https://boinc.bakerlab.org/rosetta/result.php?resultid=1133534586 If I calculate with peak ram usage 1.5GB * 6core that is also just 9 GB memory. Of course the 32 bit applications able to use 4gb per process. Currently I limited the run time to 1 hour, to avoid big WU loses. But as you see OOM happend in 1136619062 at minute 47. Does Linux version use 64bit or just 32bit wrapper as I read it in another thread? What do you think about these memory issue? |
Miklos M Send message Joined: 8 Dec 13 Posts: 29 Credit: 5,277,251 RAC: 0 |
I got two units running so far for a day plus 10 hours and still less than 70% finished, with the estimated time to go 19 hours. Keep running or abort? |
Dayle Send message Joined: 6 Jan 14 Posts: 13 Credit: 792,486 RAC: 3,118 |
Hello Mod.Sence, Thanks for taking the time to investigate this. When I posted, I had 16 GB of memory in my system, and BOINC was allowed to use 90% of memory. I have since cannibalized memory from another system, which is why it's now showing 32 GB. The memory is mismatched, and even though it's DDR4 it's showing speeds lower then what I thought was possible for that standard (1067 MHz). I also updated Rosetta to a one third share, with WCG at two thirds. BOINC doesn't seem to care, and is running only Rosetta on all 32 threads, as if to make up for lost time. Since adding 16 more gigabytes of memory, I've still lost a task to OOM errors. I've always left applications in memory while suspended. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I also updated Rosetta to a one third share, with WCG at two thirds. Give BOINC Manager some time to get used to the new project resource share. It will balance out when the work cache is refreshed. Yes, there are OOM errors occurring. I am not certain when the WU configuration to indicate they require more memory will roll out, nor can I say I know exactly what BOINC Manager will do with the better info. on the tasks. But hoping things settle down next week. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
COVID 19 WU Errors
©2024 University of Washington
https://www.bakerlab.org