More checkpointing problems

Message boards : Number crunching : More checkpointing problems

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,417,016
RAC: 7,306
Message 89028 - Posted: 30 May 2018, 12:14:41 UTC

This time it's the tasks named nRoCM....
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89028 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,417,016
RAC: 7,306
Message 89066 - Posted: 6 Jun 2018, 11:40:17 UTC

Got another nRoCM... task with over three hours of uncheckpointed work on it, and I want to shut down the computer now. That and the cursed 3-day deadline tasks are making this project into too much of a headache, notwithstanding having passed 7 million points...
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 934
Credit: 3,585,450
RAC: 2,004
Message 89068 - Posted: 6 Jun 2018, 15:11:08 UTC - in response to Message 89066.  

Got another nRoCM... task with over three hours of uncheckpointed work on it, and I want to shut down the computer now....


Rosetta runs different and heterogeneous simulations. For some it's possible to have checkpoint, for others not (1 decoys in 3 hours).
It's normal.
ID: 89068 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3545
Credit: 0
RAC: 0
Message 89074 - Posted: 7 Jun 2018, 17:40:17 UTC

Some protocols do checkpoint within a model as well. But the additional coding required to do such additional checkpoints is often not done as the protocols are first being developed.
Rosetta Moderator: Mod.Sense
ID: 89074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,417,016
RAC: 7,306
Message 89079 - Posted: 8 Jun 2018, 8:41:04 UTC - in response to Message 89068.  

Once again I'm trying to shut down the computer and there's a task with a lot of uncheckpointed work. It's an rb... this time, which sometimes happens.

It might be "normal", but it's an excuse and everyone has 'em and they all... I certainly hope the code is properly checked and tested on the real scientific results side, but on the volunteer side, it sure looks like they aren't particularly competent coders. As I've noted before, if I were still refereeing papers for the journals, and someone submitted a paper that was based on results from rosetta@home, I would be extremely curious and concerned about the quality of the code.

Another interpretation is that they just don't care about how much of the donors' efforts and electricity they waste. If they actually did care, they would actually be able to see the results of reduced throughput for tasks with long checkpoints. In some cases, a computer could get stuffed with tasks that never make progress, constantly restarting until they get killed for passing their deadlines.

Right now I just nuke 3-day deadline tasks and nRoCM tasks on sight, as long as they haven't done much work. That way I eliminate most of the problems in advance, at the cost of wasting some bandwidth for discarded data.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89079 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 934
Credit: 3,585,450
RAC: 2,004
Message 89081 - Posted: 8 Jun 2018, 16:13:21 UTC - in response to Message 89079.  

It might be "normal", but it's an excuse and everyone has 'em and they all... I certainly hope the code is properly checked and tested on the real scientific results side, but on the volunteer side, it sure looks like they aren't particularly competent coders. As I've noted before, if I were still refereeing papers for the journals, and someone submitted a paper that was based on results from rosetta@home, I would be extremely curious and concerned about the quality of the code.


Sometimes is simply IMPOSSIBLE to have checkpoints.
Other projects, for example, use virtual machine to resolve that problem.
ID: 89081 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,417,016
RAC: 7,306
Message 89096 - Posted: 10 Jun 2018, 23:52:44 UTC
Last modified: 10 Jun 2018, 23:55:50 UTC

Almost 10 hours of work on that task. Refused to checkpoint at any point, so all of the work was apparently held in memory with NO intermediate results. Suddenly ended with a computation error and presumably no credit received.

Not motivating.

As a volunteer the demotivating part might be my main concern, but as a wannabe or former or retired scientist of some sort, my primary concern is actually what it says about the quality of the code. GIGO is not the only way to produce worthless results. Even the best data with bad analysis or with programming flaws will also produce garbage.

Right now I have another task that looks extremely similar to the one that just died in spasms of computation error. I am NOT predicting a happy ending for it.

By the way, it was also one of those especially troublesome 3-day-deadline tasks. At this point I think it's looking like it's in a race condition between timing out, blowing up in a computation error, or perhaps getting aborted by the project. (Just saw one of those hit a checkpointed task with 3 hours of work that was apparently tossed.) This rush task stuff reminds me of "More haste, less speed."
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 250
Credit: 8,037,564
RAC: 0
Message 89100 - Posted: 11 Jun 2018, 21:33:45 UTC - in response to Message 89096.  

Almost 10 hours of work on that task. Refused to checkpoint at any point, so all of the work was apparently held in memory with NO intermediate results. Suddenly ended with a computation error and presumably no credit received.

Not motivating.

As a volunteer the demotivating part might be my main concern, but as a wannabe or former or retired scientist of some sort, my primary concern is actually what it says about the quality of the code. GIGO is not the only way to produce worthless results. Even the best data with bad analysis or with programming flaws will also produce garbage.

Right now I have another task that looks extremely similar to the one that just died in spasms of computation error. I am NOT predicting a happy ending for it.

By the way, it was also one of those especially troublesome 3-day-deadline tasks. At this point I think it's looking like it's in a race condition between timing out, blowing up in a computation error, or perhaps getting aborted by the project. (Just saw one of those hit a checkpointed task with 3 hours of work that was apparently tossed.) This rush task stuff reminds me of "More haste, less speed."


A number of the WU I looked at were failing with an Out of Memory error. That failure will be preceded by paging that will cause all the jobs to run VERY slowly and also take a VERY long time to complete enough work for the program to think that it needs to do a checkpoint. It may have spent a long time crunching, but it was spending all its time accessing the disk. This has happened to me in the past with more memory than you have.

You can use the Windows Task Manager to monitor memory usage and disk activity. Ubuntu you can use "vmstat 1".

Computer 1758415 Memory 3956.48 MB (4 processors) Win10
ID: 89100 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 250
Credit: 8,037,564
RAC: 0
Message 89102 - Posted: 12 Jun 2018, 4:29:31 UTC - in response to Message 89100.  

I just noticed this Rosetta job and it was consuming 751.12MB of memory.

Rosetta@home 4.07 Rosetta DRH_curve_X_h24_l2_h28_l3_13785_1_2_loop_73_0001_one_capped_0001_fragments_relax_SAVE_ALL_OUT_655868_14_0 751.12 MB
ID: 89102 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 934
Credit: 3,585,450
RAC: 2,004
Message 89105 - Posted: 12 Jun 2018, 10:31:48 UTC - in response to Message 89102.  

I just noticed this Rosetta job and it was consuming 751.12MB of memory.


I've some wus over 1.2 Gb of ram....
ID: 89105 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 250
Credit: 8,037,564
RAC: 0
Message 89106 - Posted: 12 Jun 2018, 13:47:36 UTC - in response to Message 89105.  

I just noticed this Rosetta job and it was consuming 751.12MB of memory.


I've some wus over 1.2 Gb of ram....


I just looked at 2 that are running right now and they are taking 680.61MB and 949.69MB.
If you don't have 1GB/Rosetta WU on your system, the multiple jobs will consume all of physical memory and start paging.
Systems are typically designed to allow/support memory requirements that are TWICE the physical memory size, BUT when you start executing "OFF DISK" ... jobs will run many, many times slower. Best to pick a more well behaved project to crunch.

NOTE that any one of the admins can run a script and identify machines in trouble and send messages to the owners ... OR the developers can identify the problem and fix it.


Rosetta@home 4.07 Rosetta rb_06_11_344_504__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_670132_2542_0 Running 680.61 MB
Rosetta@home 4.07 Rosetta rb_06_11_344_504__t000__4_C1_SAVE_ALL_OUT_IGNORE_THE_REST_670132_6336_0 Running 949.69 MB
ID: 89106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 301
Credit: 9,534,921
RAC: 17,015
Message 89107 - Posted: 12 Jun 2018, 15:06:24 UTC - in response to Message 89106.  

There was a suggestion a while ago to have the option to run large memory work units, as by a checkbox on the preferences page. A lot of us supported it, but Rosetta decided to tame their work units instead. But it seems that they creep up in memory usage from time to time. That is OK with me, but they need to monitor their stuff and take the appropriate action, whatever it is.
ID: 89107 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 250
Credit: 8,037,564
RAC: 0
Message 89108 - Posted: 12 Jun 2018, 20:23:37 UTC - in response to Message 89107.  

There was a suggestion a while ago to have the option to run large memory work units, as by a checkbox on the preferences page. A lot of us supported it, but Rosetta decided to tame their work units instead. But it seems that they creep up in memory usage from time to time. That is OK with me, but they need to monitor their stuff and take the appropriate action, whatever it is.


The Rosetta structure chosen is to bundle up all the code for all the models in one binary. It makes for a sparse CPU execution loop and requires more memory PAGES than individual binaries. Since they chose the bundled binary approach, it is difficult for them to control the system demands and performance.

Running 11 Rosetta WU on my Fedora 27 box, you can see that they "typically" consume 400MB - 1GB. 1GB range memory requirement seems to be the rule rather than the exception.
I have not gathered the data to make a guess what the culprit option or condition is. I have a couple ideas, but they all imply developers who do not completely understand what they are doing.


"top ic" command sorted by M(emory) (clipped to show Rosetta WU)

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23557 boinc 39 19 1072080 823140 75816 R 95.7 5.0 195:26.88 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_06_11_3+
22056 boinc 39 19 1060828 809004 69432 R 98.0 4.9 266:15.05 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -relax::minimize_bond_lengths 1 -frag3 00001.+
22071 boinc 39 19 1026124 774204 69308 R 99.7 4.7 247:04.42 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -abinitio::rsd_wt_helix 0.5 -frag9 00001.200.+
22075 boinc 39 19 811068 751228 88700 R 96.7 4.6 241:42.56 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_06_+
22077 boinc 39 19 978008 726072 69292 R 90.4 4.4 238:35.55 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -frag9 00001.200.9mers -optimization::default+
26011 boinc 39 19 786288 725100 88580 R 97.7 4.4 50:54.43 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -run:protocol jd2_scripting @P16917_group+
22064 boinc 39 19 976800 724668 69100 R 99.3 4.4 261:03.21 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -abinitio::rsd_wt_helix 0.5 -ex2aro 1 -relax:+
25660 boinc 39 19 894232 642400 69452 R 96.0 3.9 70:13.54 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -relax::dualspace 1 -out:file:silent_struct_t+
22067 boinc 39 19 650420 588960 88224 R 99.0 3.6 248:00.18 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_06_+
22073 boinc 39 19 403668 342504 87380 R 91.0 2.1 242:34.70 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -frag3 00001.200.3mers -frag9 00001.200.9+
22054 boinc 39 19 389228 329200 87352 R 97.7 2.0 271:14.86 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -frag3 00001.200.3mers -frag9 00001.200.9+
22059 boinc 39 19 373852 313404 87400 R 99.3 1.9 262:33.95 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 0+
ID: 89108 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,417,016
RAC: 7,306
Message 89127 - Posted: 20 Jun 2018, 14:11:33 UTC

Just confirmed a new version of the checkpointing problem. I had suspected something along those lines. It was a 3-day rb... unit this time. The Properties showed that it had about 30 minutes until it would finish, but it had been checkpointed 00:00 minutes ago. Usually that's supposed to mean it just finished checkpointing, but the value didn't change over several minutes. Highly suspicious. So I went ahead and shut down the machine anyway, and sure enough, it was the status % that was correct, and after I booted the machine the next time, it suddenly was 4 hours from completion--which basically guarantees the task will miss its deadline.

Do I need to say again that the 3-day deadline is fundamentally unreasonable, and much less reasonable when the checkpointing code is buggy, too.

Right now I suspect a lot of these rush units are really caused by what I regard as essentially bad project management and buggy programming. I would send along the details, but right now this Linux box is also unable to open the BOINC Manager. Happens pretty often, and I'm pretty sure the trick is to get it open (on both of my Linux boxen before the Rosetta tasks have eaten up too much memory. For a long time I thought that was a BOINC-level problem, but considering some of the memory allocation problems mentioned elsewhere in this thread, I'm leaning back towards the caused-by-Rosetta hypothesis.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89127 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,417,016
RAC: 7,306
Message 89128 - Posted: 20 Jun 2018, 14:15:30 UTC - in response to Message 89127.  

Managed to capture the Properties after all. Looks like it may have been a regular unit, but if so, it must have been delayed by intervening 3-day tasks:

Application
Rosetta Mini 3.78
Name
rb_06_13_83780_125820__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_670302_390
State
Running
Received
Thu 14 Jun 2018 08:28:54 AM JST
Report deadline
Fri 22 Jun 2018 08:28:53 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
03:55:32
CPU time since checkpoint
00:00:42
Elapsed time
04:01:49
Estimated time remaining
03:54:42
Fraction done
48.720%
Virtual memory size
348.68 MB
Working set size
290.45 MB
Directory
slots/3
Process ID
1485
Progress rate
11.520% per hour
Executable
minirosetta_3.78_x86_64-pc-linux-gnu
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3545
Credit: 0
RAC: 0
Message 89130 - Posted: 21 Jun 2018, 15:04:19 UTC

Please review what CASP13 is, and what the timeframe is between issuance of a protein target and the delivery deadlines of the predicted model are before posting further rants about 3 day work units and project management. I'm sure you think the posting the same complaint several times a week somehow strengthens your case or bolsters support for your stance. It doesn't.

The project sends some tasks with 3 day deadlines. If this causes problems in your operating environments, then R@h is not an appropriate BOINC project for you.

CASP13
Rosetta Moderator: Mod.Sense
ID: 89130 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,417,016
RAC: 7,306
Message 89132 - Posted: 21 Jun 2018, 22:36:50 UTC - in response to Message 89130.  

NOT a constructive reply. As if you appreciate your volunteers.

Actually, what it most reminds me of is spineless chicken hawks who thank me for my service. That is NOT why I enlisted, and I do NOT care about your gratitude or pretenses of gratitude or even the opposite in this case. If you served, then you know why and we don't have to thank each other. If you didn't serve when you could have, then I mostly doubt you have any understanding of what service is or why people should do it. (I'm NOT limiting that to military service, by the way. That's another newfangled form of fake patriotism.)

The kindest thing I can say is that 3-day deadlines are bad service in some form, and I don't care about your whiny excuses. Reminds me of an old military expression, which in the cleaned up version goes "Excuses are like armpits. Everyone's got 'em and they all stink."

Oh yeah. Two more things. (1) Large numbers of computation errors, mostly at the beginning and it seems more often under Linux, and (2) Eight more hours of computation lost due to the checkpointing problems.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89132 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 250
Credit: 8,037,564
RAC: 0
Message 89136 - Posted: 22 Jun 2018, 15:24:29 UTC - in response to Message 89132.  
Last modified: 22 Jun 2018, 16:03:29 UTC

IMO, (based on the problems you have described) the "checkpoint problem" you are seeing is NOT a "checkpoint problem".

When I looked at the WU with Compute Errors, they all had "Out of Memory" errors. I looked at the "nRoCM_*" results and they all had "Out of Memory" in the stderr output file.
It is a problem with Rosetta needing more memory than available on the machine.

When the system runs low on PHYSICAL memory, the machine will PAGE out CODE/DATA to disk and allocate that PHYSICAL memory to the other job. When condition happens and gets worse, the machine is BUSY, but it is not accomplishing any work. Since it is not accomplishing any work, the job will not NEED to checkpoint. No progress has been made.

When a machine gets into this condition (executing off of DISK instead of MEMORY), WU will not make progress .... WU will not complete .... following WU will not start .... and TIME OUT. It is pretty tough for the machine to heal by itself.

This PAGING condition will greatly accelerate the hardware wear and then failure of the disk drive ... SSD or HDD drives.

----
Maybe I can help. If you are not interested, that is fine too.
Which machine is struggling the most. Lets figure out what the problem is.
ID: 89136 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,417,016
RAC: 7,306
Message 89145 - Posted: 25 Jun 2018, 9:31:41 UTC

Here's another good example of the new checkpointing problem, though perhaps it's better to describe it as lost work possible. I noticed that the CPU time is also frozen, though the elapsed time is increasing. Based on prior experience with these ones, the checkpoint will never take place, but the task will never be completed no matter how long it runs. Buggy, buggy, buggy.

Application
Rosetta Mini 3.78
Name
rb_06_06_83627_125669__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_669729_480
State
Running
Received
Fri 22 Jun 2018 08:20:14 AM JST
Report deadline
Sat 30 Jun 2018 08:20:13 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
02:25:26
CPU time since checkpoint
00:00:00
Elapsed time
08:39:27
Estimated time remaining
03:31:59
Fraction done
20.198%
Virtual memory size
155.29 MB
Working set size
51.39 MB
Directory
slots/4
Process ID
2359
Progress rate
2.160% per hour
Executable
minirosetta_3.78_x86_64-pc-linux-gnu

At the same time I notice this machine has a couple of computation error tasks. Let's see if I can catch their Properties, too...


Application
Rosetta 4.07
Name
DRH_curve_X_h30_l3_h23_l2_16685_3_2_loop_21_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_663453_44
State
Computation error
Received
Mon 25 Jun 2018 06:15:24 PM JST
Report deadline
Tue 03 Jul 2018 06:15:24 PM JST
Estimated computation size
80,000 GFLOPs
CPU time
---
Elapsed time
---
Executable
rosetta_4.07_x86_64-pc-linux-gnu


Application
Rosetta 4.07
Name
DRH_curve_X_h19_l3_h26_l3_06738_1_loop_63_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_666406_131
State
Computation error
Received
Mon 25 Jun 2018 06:15:24 PM JST
Report deadline
Tue 03 Jul 2018 06:15:24 PM JST
Estimated computation size
80,000 GFLOPs
CPU time
---
Elapsed time
00:00:09
Executable
rosetta_4.07_x86_64-pc-linux-gnu

Also several more of those appeared, all DRH tasks. Buggy, buggy, buggy.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89145 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 187
Credit: 11,417,016
RAC: 7,306
Message 89151 - Posted: 25 Jun 2018, 23:17:08 UTC

And from a Windows 10 machine, a 4-hour computation error that is probably a checkpointing error in disguise, since it happened when the machine was booted after being shut down. Perhaps diagnostic that another task from the same sub-project managed to complete in just over 4 hours?

Can't paste the Properties from Windows 10. Not even as an image.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89151 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : More checkpointing problems



©2019 University of Washington
http://www.bakerlab.org