Message boards : Number crunching : 23.3 GB RAM per Rosetta WU
Author | Message |
---|---|
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0 |
true_relax_SAVE_ALL_OUT_1044288_60_1 is using 100% of the RAM on its host that has 24 GB RAM. When it arrived last night 23 WUs were running on that computer. Then they started going into Suspended: Waiting for Memory mode one by one. This morning all the 22 WUs that crowded managed to complete but no new ones can run and 45 WUs are Ready to Run. Is this where Rosetta is headed, one WU per computer??? |
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0 |
It gets better, my wingman UH UIT HPC erred out with 258 GB RAM. Looking for a new wingman. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1164499996 https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=4011154 |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,506 |
The wingman should have taken 8 hours to finish (28800 seconds from the log, his other valids confirm the 8 hour preference) yet it took 18 hours until it errored out so it sounds like the 4 hour cut-off by the watchdog didn't even work? That wingman host even had another task with the same issue: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1164500140 Bad batch? Probably best to go ahead and abort, I suppose. |
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0 |
It's at 62.6% and 7.5 hours. I'll let it run and see what happens. Fine with me if they do a server abort. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
sounds like the 4 hour cut-off by the watchdog didn't even work?Watchdog is 10 hours. It’s the 36000s on top of the target 28800s in the log output BOINC:: CPU time: 64803.2s, 36000s + 28800s[2020-11-29 23:26:16:] :: BOINC |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,506 |
My bad. Thanks! |
amgthis Send message Joined: 25 Mar 06 Posts: 81 Credit: 203,879,282 RAC: 0 |
I just noticed that this morning. Basically locked up my box. lol Memory leak??? I'm deleting all I see. |
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0 |
Just happened to notice this big boy was at 98.6% so I watched it finish. It hit 100% and I lost contact with Rig-33. When it returned this WU was tagged as Aborted but not by me. I don't recall what they call that last step between 100% and Uploading, rollup maybe. But it may have needed additional RAM memory of which there was none since it had already consumed everything that computer had except the 16 GB swap file that I did not see it use. Hopefully the responsible party will fix the bug. I saw memory leak mentioned. If it was a memory leak wouldn't all WUs run by rosetta_4.20_x86_64-pc-linux-gnu suffer from it??? Is it possible to have a WU specific memory leak??? |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Is it possible to have a WU specific memory leak???Definitely. Rosetta is a huge toolbox of modelling protocols and analysis methods. It’s also massively data-driven: each work unit comes with its own set of input values, giving it a starting point different from all others. So it’s certainly plausible that one type of task could be using modules that others don’t, or that one work unit has inputs that lead an algorithm to become unstable. |
Cygnus X-1 Send message Joined: 20 Feb 06 Posts: 3 Credit: 5,390,955 RAC: 0 |
Is this happening to anyone else with more than enough RAM to run one of these? Rosetta is currently only running on 3 out of 16 threads right now. As the other regular tasks finish new ones are not being started. I'm aware that this happens when there is not enough RAM, I've seen this happen on my other systems with less RAM with the horns5 tasks, but in this case I have a little over 20GB of RAM still available. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1679 Credit: 17,818,615 RAC: 22,741 |
I had a true_fold_and_dock_SAVE_ALL_OUT_1044288_598 that errored out after 4hrs 35 to 38min for myself & my wingman. It only used just under 4GB of RAM. <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @true_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_true_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3049654 Using database: database_357d5d93529_n_methylminirosetta_database bad torsion type for JumpAtom: 1 ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 103 BOINC:: Error reading and gzipping output datafile: default.out 12:45:58 (22644): called boinc_finish(1) </stderr_txt> ]]> Grant Darwin NT |
Cygnus X-1 Send message Joined: 20 Feb 06 Posts: 3 Credit: 5,390,955 RAC: 0 |
I'm now down to that task being the only one running. Its now at about 90% done after running for over 16 hours (I'm using the default 8 hours). |
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0 |
Got another one. This is an E5-2697 v4 18c32t and it's using 21 of 32 GB RAM. It's the only WU left running and has also used 4.8 of 16 GB Swap. Aborted. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1164500044 true_relax_SAVE_ALL_OUT_1044287_62 true_relax_SAVE_ALL_OUT_1044288_60 |
Cygnus X-1 Send message Joined: 20 Feb 06 Posts: 3 Credit: 5,390,955 RAC: 0 |
Well, it finished running after 18 hours, looked liked it was successful too but it actually ended with an error: https://boinc.bakerlab.org/rosetta/result.php?resultid=1300333774 it was also sent to someone's Raspberry Pi with 4GB of RAM. https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=5847663 That's not going to end well. |
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0 |
true_relax_SAVE_ALL_OUT_1044287_54 true_relax_SAVE_ALL_OUT_1044287_62 true_relax_SAVE_ALL_OUT_1044288_60 |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,277,304 RAC: 1,589 |
Just happened to notice this big boy was at 98.6% so I watched it finish. It hit 100% and I lost contact with Rig-33. When it returned this WU was tagged as Aborted but not by me. Easily, if the leak is in a section of the program that only that WU uses. |
wolfman1360 Send message Joined: 18 Feb 17 Posts: 72 Credit: 18,450,036 RAC: 0 |
Thanks for the heads up. Going through my tasks to search for these. Hopefully these get yanked from the server sooner than later. A lot of my machines can handle around 2 gb per thread. The Ryzen 3700x can handle about 4, as well as a few others. |
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0 |
true_relax_SAVE_ALL_OUT_1044287_24 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1164498828 true_relax_SAVE_ALL_OUT_1044287_54 true_relax_SAVE_ALL_OUT_1044287_62 true_relax_SAVE_ALL_OUT_1044288_60 |
Message boards :
Number crunching :
23.3 GB RAM per Rosetta WU
©2024 University of Washington
https://www.bakerlab.org