23.3 GB RAM per Rosetta WU

Message boards : Number crunching : 23.3 GB RAM per Rosetta WU

To post messages, you must log in.

AuthorMessage
Aurum

Send message
Joined: 12 Jul 17
Posts: 32
Credit: 38,158,977
RAC: 1,011
Message 99821 - Posted: 30 Nov 2020, 14:44:39 UTC

true_relax_SAVE_ALL_OUT_1044288_60_1 is using 100% of the RAM on its host that has 24 GB RAM.
When it arrived last night 23 WUs were running on that computer. Then they started going into Suspended: Waiting for Memory mode one by one. This morning all the 22 WUs that crowded managed to complete but no new ones can run and 45 WUs are Ready to Run.
Is this where Rosetta is headed, one WU per computer???
ID: 99821 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aurum

Send message
Joined: 12 Jul 17
Posts: 32
Credit: 38,158,977
RAC: 1,011
Message 99822 - Posted: 30 Nov 2020, 14:52:19 UTC

It gets better, my wingman UH UIT HPC erred out with 258 GB RAM. Looking for a new wingman.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1164499996
https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=4011154
ID: 99822 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,017,068
RAC: 357
Message 99825 - Posted: 30 Nov 2020, 16:43:23 UTC

The wingman should have taken 8 hours to finish (28800 seconds from the log, his other valids confirm the 8 hour preference) yet it took 18 hours until it errored out so it sounds like the 4 hour cut-off by the watchdog didn't even work?

That wingman host even had another task with the same issue: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1164500140

Bad batch? Probably best to go ahead and abort, I suppose.
ID: 99825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aurum

Send message
Joined: 12 Jul 17
Posts: 32
Credit: 38,158,977
RAC: 1,011
Message 99826 - Posted: 30 Nov 2020, 16:46:54 UTC

It's at 62.6% and 7.5 hours. I'll let it run and see what happens.
Fine with me if they do a server abort.
ID: 99826 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99827 - Posted: 30 Nov 2020, 16:52:27 UTC - in response to Message 99825.  

sounds like the 4 hour cut-off by the watchdog didn't even work?
Watchdog is 10 hours. It’s the 36000s on top of the target 28800s in the log output
BOINC:: CPU time: 64803.2s, 36000s + 28800s[2020-11-29 23:26:16:] :: BOINC
ID: 99827 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,017,068
RAC: 357
Message 99829 - Posted: 30 Nov 2020, 17:01:43 UTC - in response to Message 99827.  
Last modified: 30 Nov 2020, 17:02:55 UTC

My bad.
Thanks!
ID: 99829 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
amgthis

Send message
Joined: 25 Mar 06
Posts: 81
Credit: 203,879,282
RAC: 0
Message 99833 - Posted: 30 Nov 2020, 18:42:59 UTC - in response to Message 99821.  

I just noticed that this morning. Basically locked up my box. lol
Memory leak??? I'm deleting all I see.
ID: 99833 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aurum

Send message
Joined: 12 Jul 17
Posts: 32
Credit: 38,158,977
RAC: 1,011
Message 99840 - Posted: 30 Nov 2020, 21:38:11 UTC

Just happened to notice this big boy was at 98.6% so I watched it finish. It hit 100% and I lost contact with Rig-33. When it returned this WU was tagged as Aborted but not by me.
I don't recall what they call that last step between 100% and Uploading, rollup maybe. But it may have needed additional RAM memory of which there was none since it had already consumed everything that computer had except the 16 GB swap file that I did not see it use.
Hopefully the responsible party will fix the bug.

I saw memory leak mentioned. If it was a memory leak wouldn't all WUs run by rosetta_4.20_x86_64-pc-linux-gnu suffer from it??? Is it possible to have a WU specific memory leak???
ID: 99840 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99842 - Posted: 30 Nov 2020, 22:13:27 UTC - in response to Message 99840.  

Is it possible to have a WU specific memory leak???
Definitely. Rosetta is a huge toolbox of modelling protocols and analysis methods. It’s also massively data-driven: each work unit comes with its own set of input values, giving it a starting point different from all others. So it’s certainly plausible that one type of task could be using modules that others don’t, or that one work unit has inputs that lead an algorithm to become unstable.
ID: 99842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Cygnus X-1

Send message
Joined: 20 Feb 06
Posts: 3
Credit: 5,390,955
RAC: 0
Message 99845 - Posted: 1 Dec 2020, 3:51:32 UTC

Is this happening to anyone else with more than enough RAM to run one of these?

Rosetta is currently only running on 3 out of 16 threads right now. As the other regular tasks finish new ones are not being started.

I'm aware that this happens when there is not enough RAM, I've seen this happen on my other systems with less RAM with the horns5 tasks, but in this case I have a little over 20GB of RAM still available.
ID: 99845 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1481
Credit: 14,594,347
RAC: 15,036
Message 99846 - Posted: 1 Dec 2020, 6:07:57 UTC

I had a true_fold_and_dock_SAVE_ALL_OUT_1044288_598 that errored out after 4hrs 35 to 38min for myself & my wingman. It only used just under 4GB of RAM.
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
 - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @true_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_true_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3049654
Using database: database_357d5d93529_n_methylminirosetta_database
bad torsion type for JumpAtom: 1

ERROR:: Exit from: ......srccorekinematicstreeJumpAtom.cc line: 103
BOINC:: Error reading and gzipping output datafile: default.out
12:45:58 (22644): called boinc_finish(1)

</stderr_txt>
]]>

Grant
Darwin NT
ID: 99846 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Cygnus X-1

Send message
Joined: 20 Feb 06
Posts: 3
Credit: 5,390,955
RAC: 0
Message 99850 - Posted: 1 Dec 2020, 12:48:24 UTC

I'm now down to that task being the only one running. Its now at about 90% done after running for over 16 hours (I'm using the default 8 hours).
ID: 99850 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aurum

Send message
Joined: 12 Jul 17
Posts: 32
Credit: 38,158,977
RAC: 1,011
Message 99852 - Posted: 1 Dec 2020, 14:39:21 UTC
Last modified: 1 Dec 2020, 14:52:12 UTC

Got another one. This is an E5-2697 v4 18c32t and it's using 21 of 32 GB RAM. It's the only WU left running and has also used 4.8 of 16 GB Swap. Aborted.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1164500044
true_relax_SAVE_ALL_OUT_1044287_62
true_relax_SAVE_ALL_OUT_1044288_60
ID: 99852 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Cygnus X-1

Send message
Joined: 20 Feb 06
Posts: 3
Credit: 5,390,955
RAC: 0
Message 99855 - Posted: 1 Dec 2020, 15:46:53 UTC

Well, it finished running after 18 hours, looked liked it was successful too but it actually ended with an error:

https://boinc.bakerlab.org/rosetta/result.php?resultid=1300333774

it was also sent to someone's Raspberry Pi with 4GB of RAM.

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=5847663

That's not going to end well.
ID: 99855 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aurum

Send message
Joined: 12 Jul 17
Posts: 32
Credit: 38,158,977
RAC: 1,011
Message 99856 - Posted: 1 Dec 2020, 15:57:12 UTC

true_relax_SAVE_ALL_OUT_1044287_54
true_relax_SAVE_ALL_OUT_1044287_62
true_relax_SAVE_ALL_OUT_1044288_60
ID: 99856 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,843,555
RAC: 1,697
Message 99858 - Posted: 1 Dec 2020, 17:17:48 UTC - in response to Message 99840.  

Just happened to notice this big boy was at 98.6% so I watched it finish. It hit 100% and I lost contact with Rig-33. When it returned this WU was tagged as Aborted but not by me.
I don't recall what they call that last step between 100% and Uploading, rollup maybe. But it may have needed additional RAM memory of which there was none since it had already consumed everything that computer had except the 16 GB swap file that I did not see it use.
Hopefully the responsible party will fix the bug.

I saw memory leak mentioned. If it was a memory leak wouldn't all WUs run by rosetta_4.20_x86_64-pc-linux-gnu suffer from it??? Is it possible to have a WU specific memory leak???

Easily, if the leak is in a section of the program that only that WU uses.
ID: 99858 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 72
Credit: 18,450,036
RAC: 0
Message 99867 - Posted: 2 Dec 2020, 1:33:10 UTC

Thanks for the heads up.
Going through my tasks to search for these.
Hopefully these get yanked from the server sooner than later. A lot of my machines can handle around 2 gb per thread. The Ryzen 3700x can handle about 4, as well as a few others.
ID: 99867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aurum

Send message
Joined: 12 Jul 17
Posts: 32
Credit: 38,158,977
RAC: 1,011
Message 99870 - Posted: 2 Dec 2020, 3:30:13 UTC

true_relax_SAVE_ALL_OUT_1044287_24
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1164498828
true_relax_SAVE_ALL_OUT_1044287_54
true_relax_SAVE_ALL_OUT_1044287_62
true_relax_SAVE_ALL_OUT_1044288_60
ID: 99870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : 23.3 GB RAM per Rosetta WU



©2024 University of Washington
https://www.bakerlab.org