Message boards : Number crunching : More checkpointing problems
Author | Message |
---|---|
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
This time it's the tasks named nRoCM.... #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Got another nRoCM... task with over three hours of uncheckpointed work on it, and I want to shut down the computer now. That and the cursed 3-day deadline tasks are making this project into too much of a headache, notwithstanding having passed 7 million points... #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,622,132 RAC: 9,522 |
Got another nRoCM... task with over three hours of uncheckpointed work on it, and I want to shut down the computer now.... Rosetta runs different and heterogeneous simulations. For some it's possible to have checkpoint, for others not (1 decoys in 3 hours). It's normal. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Some protocols do checkpoint within a model as well. But the additional coding required to do such additional checkpoints is often not done as the protocols are first being developed. Rosetta Moderator: Mod.Sense |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Once again I'm trying to shut down the computer and there's a task with a lot of uncheckpointed work. It's an rb... this time, which sometimes happens. It might be "normal", but it's an excuse and everyone has 'em and they all... I certainly hope the code is properly checked and tested on the real scientific results side, but on the volunteer side, it sure looks like they aren't particularly competent coders. As I've noted before, if I were still refereeing papers for the journals, and someone submitted a paper that was based on results from rosetta@home, I would be extremely curious and concerned about the quality of the code. Another interpretation is that they just don't care about how much of the donors' efforts and electricity they waste. If they actually did care, they would actually be able to see the results of reduced throughput for tasks with long checkpoints. In some cases, a computer could get stuffed with tasks that never make progress, constantly restarting until they get killed for passing their deadlines. Right now I just nuke 3-day deadline tasks and nRoCM tasks on sight, as long as they haven't done much work. That way I eliminate most of the problems in advance, at the cost of wasting some bandwidth for discarded data. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,622,132 RAC: 9,522 |
It might be "normal", but it's an excuse and everyone has 'em and they all... I certainly hope the code is properly checked and tested on the real scientific results side, but on the volunteer side, it sure looks like they aren't particularly competent coders. As I've noted before, if I were still refereeing papers for the journals, and someone submitted a paper that was based on results from rosetta@home, I would be extremely curious and concerned about the quality of the code. Sometimes is simply IMPOSSIBLE to have checkpoints. Other projects, for example, use virtual machine to resolve that problem. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Almost 10 hours of work on that task. Refused to checkpoint at any point, so all of the work was apparently held in memory with NO intermediate results. Suddenly ended with a computation error and presumably no credit received. Not motivating. As a volunteer the demotivating part might be my main concern, but as a wannabe or former or retired scientist of some sort, my primary concern is actually what it says about the quality of the code. GIGO is not the only way to produce worthless results. Even the best data with bad analysis or with programming flaws will also produce garbage. Right now I have another task that looks extremely similar to the one that just died in spasms of computation error. I am NOT predicting a happy ending for it. By the way, it was also one of those especially troublesome 3-day-deadline tasks. At this point I think it's looking like it's in a race condition between timing out, blowing up in a computation error, or perhaps getting aborted by the project. (Just saw one of those hit a checkpointed task with 3 hours of work that was apparently tossed.) This rush task stuff reminds me of "More haste, less speed." #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,051,657 RAC: 8,071 |
Almost 10 hours of work on that task. Refused to checkpoint at any point, so all of the work was apparently held in memory with NO intermediate results. Suddenly ended with a computation error and presumably no credit received. A number of the WU I looked at were failing with an Out of Memory error. That failure will be preceded by paging that will cause all the jobs to run VERY slowly and also take a VERY long time to complete enough work for the program to think that it needs to do a checkpoint. It may have spent a long time crunching, but it was spending all its time accessing the disk. This has happened to me in the past with more memory than you have. You can use the Windows Task Manager to monitor memory usage and disk activity. Ubuntu you can use "vmstat 1". Computer 1758415 Memory 3956.48 MB (4 processors) Win10 |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,051,657 RAC: 8,071 |
I just noticed this Rosetta job and it was consuming 751.12MB of memory. Rosetta@home 4.07 Rosetta DRH_curve_X_h24_l2_h28_l3_13785_1_2_loop_73_0001_one_capped_0001_fragments_relax_SAVE_ALL_OUT_655868_14_0 751.12 MB |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,622,132 RAC: 9,522 |
I just noticed this Rosetta job and it was consuming 751.12MB of memory. I've some wus over 1.2 Gb of ram.... |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,051,657 RAC: 8,071 |
I just noticed this Rosetta job and it was consuming 751.12MB of memory. I just looked at 2 that are running right now and they are taking 680.61MB and 949.69MB. If you don't have 1GB/Rosetta WU on your system, the multiple jobs will consume all of physical memory and start paging. Systems are typically designed to allow/support memory requirements that are TWICE the physical memory size, BUT when you start executing "OFF DISK" ... jobs will run many, many times slower. Best to pick a more well behaved project to crunch. NOTE that any one of the admins can run a script and identify machines in trouble and send messages to the owners ... OR the developers can identify the problem and fix it. Rosetta@home 4.07 Rosetta rb_06_11_344_504__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_670132_2542_0 Running 680.61 MB Rosetta@home 4.07 Rosetta rb_06_11_344_504__t000__4_C1_SAVE_ALL_OUT_IGNORE_THE_REST_670132_6336_0 Running 949.69 MB |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
There was a suggestion a while ago to have the option to run large memory work units, as by a checkbox on the preferences page. A lot of us supported it, but Rosetta decided to tame their work units instead. But it seems that they creep up in memory usage from time to time. That is OK with me, but they need to monitor their stuff and take the appropriate action, whatever it is. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,051,657 RAC: 8,071 |
There was a suggestion a while ago to have the option to run large memory work units, as by a checkbox on the preferences page. A lot of us supported it, but Rosetta decided to tame their work units instead. But it seems that they creep up in memory usage from time to time. That is OK with me, but they need to monitor their stuff and take the appropriate action, whatever it is. The Rosetta structure chosen is to bundle up all the code for all the models in one binary. It makes for a sparse CPU execution loop and requires more memory PAGES than individual binaries. Since they chose the bundled binary approach, it is difficult for them to control the system demands and performance. Running 11 Rosetta WU on my Fedora 27 box, you can see that they "typically" consume 400MB - 1GB. 1GB range memory requirement seems to be the rule rather than the exception. I have not gathered the data to make a guess what the culprit option or condition is. I have a couple ideas, but they all imply developers who do not completely understand what they are doing. "top ic" command sorted by M(emory) (clipped to show Rosetta WU) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23557 boinc 39 19 1072080 823140 75816 R 95.7 5.0 195:26.88 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_06_11_3+ 22056 boinc 39 19 1060828 809004 69432 R 98.0 4.9 266:15.05 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -relax::minimize_bond_lengths 1 -frag3 00001.+ 22071 boinc 39 19 1026124 774204 69308 R 99.7 4.7 247:04.42 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -abinitio::rsd_wt_helix 0.5 -frag9 00001.200.+ 22075 boinc 39 19 811068 751228 88700 R 96.7 4.6 241:42.56 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_06_+ 22077 boinc 39 19 978008 726072 69292 R 90.4 4.4 238:35.55 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -frag9 00001.200.9mers -optimization::default+ 26011 boinc 39 19 786288 725100 88580 R 97.7 4.4 50:54.43 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -run:protocol jd2_scripting @P16917_group+ 22064 boinc 39 19 976800 724668 69100 R 99.3 4.4 261:03.21 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -abinitio::rsd_wt_helix 0.5 -ex2aro 1 -relax:+ 25660 boinc 39 19 894232 642400 69452 R 96.0 3.9 70:13.54 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -relax::dualspace 1 -out:file:silent_struct_t+ 22067 boinc 39 19 650420 588960 88224 R 99.0 3.6 248:00.18 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_06_+ 22073 boinc 39 19 403668 342504 87380 R 91.0 2.1 242:34.70 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -frag3 00001.200.3mers -frag9 00001.200.9+ 22054 boinc 39 19 389228 329200 87352 R 97.7 2.0 271:14.86 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -frag3 00001.200.3mers -frag9 00001.200.9+ 22059 boinc 39 19 373852 313404 87400 R 99.3 1.9 262:33.95 ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 0+ |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Just confirmed a new version of the checkpointing problem. I had suspected something along those lines. It was a 3-day rb... unit this time. The Properties showed that it had about 30 minutes until it would finish, but it had been checkpointed 00:00 minutes ago. Usually that's supposed to mean it just finished checkpointing, but the value didn't change over several minutes. Highly suspicious. So I went ahead and shut down the machine anyway, and sure enough, it was the status % that was correct, and after I booted the machine the next time, it suddenly was 4 hours from completion--which basically guarantees the task will miss its deadline. Do I need to say again that the 3-day deadline is fundamentally unreasonable, and much less reasonable when the checkpointing code is buggy, too. Right now I suspect a lot of these rush units are really caused by what I regard as essentially bad project management and buggy programming. I would send along the details, but right now this Linux box is also unable to open the BOINC Manager. Happens pretty often, and I'm pretty sure the trick is to get it open (on both of my Linux boxen before the Rosetta tasks have eaten up too much memory. For a long time I thought that was a BOINC-level problem, but considering some of the memory allocation problems mentioned elsewhere in this thread, I'm leaning back towards the caused-by-Rosetta hypothesis. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Managed to capture the Properties after all. Looks like it may have been a regular unit, but if so, it must have been delayed by intervening 3-day tasks: Application Rosetta Mini 3.78 Name rb_06_13_83780_125820__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_670302_390 State Running Received Thu 14 Jun 2018 08:28:54 AM JST Report deadline Fri 22 Jun 2018 08:28:53 AM JST Estimated computation size 80,000 GFLOPs CPU time 03:55:32 CPU time since checkpoint 00:00:42 Elapsed time 04:01:49 Estimated time remaining 03:54:42 Fraction done 48.720% Virtual memory size 348.68 MB Working set size 290.45 MB Directory slots/3 Process ID 1485 Progress rate 11.520% per hour Executable minirosetta_3.78_x86_64-pc-linux-gnu #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Please review what CASP13 is, and what the timeframe is between issuance of a protein target and the delivery deadlines of the predicted model are before posting further rants about 3 day work units and project management. I'm sure you think the posting the same complaint several times a week somehow strengthens your case or bolsters support for your stance. It doesn't. The project sends some tasks with 3 day deadlines. If this causes problems in your operating environments, then R@h is not an appropriate BOINC project for you. CASP13 Rosetta Moderator: Mod.Sense |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
NOT a constructive reply. As if you appreciate your volunteers. Actually, what it most reminds me of is spineless chicken hawks who thank me for my service. That is NOT why I enlisted, and I do NOT care about your gratitude or pretenses of gratitude or even the opposite in this case. If you served, then you know why and we don't have to thank each other. If you didn't serve when you could have, then I mostly doubt you have any understanding of what service is or why people should do it. (I'm NOT limiting that to military service, by the way. That's another newfangled form of fake patriotism.) The kindest thing I can say is that 3-day deadlines are bad service in some form, and I don't care about your whiny excuses. Reminds me of an old military expression, which in the cleaned up version goes "Excuses are like armpits. Everyone's got 'em and they all stink." Oh yeah. Two more things. (1) Large numbers of computation errors, mostly at the beginning and it seems more often under Linux, and (2) Eight more hours of computation lost due to the checkpointing problems. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,051,657 RAC: 8,071 |
IMO, (based on the problems you have described) the "checkpoint problem" you are seeing is NOT a "checkpoint problem". When I looked at the WU with Compute Errors, they all had "Out of Memory" errors. I looked at the "nRoCM_*" results and they all had "Out of Memory" in the stderr output file. It is a problem with Rosetta needing more memory than available on the machine. When the system runs low on PHYSICAL memory, the machine will PAGE out CODE/DATA to disk and allocate that PHYSICAL memory to the other job. When condition happens and gets worse, the machine is BUSY, but it is not accomplishing any work. Since it is not accomplishing any work, the job will not NEED to checkpoint. No progress has been made. When a machine gets into this condition (executing off of DISK instead of MEMORY), WU will not make progress .... WU will not complete .... following WU will not start .... and TIME OUT. It is pretty tough for the machine to heal by itself. This PAGING condition will greatly accelerate the hardware wear and then failure of the disk drive ... SSD or HDD drives. ---- Maybe I can help. If you are not interested, that is fine too. Which machine is struggling the most. Lets figure out what the problem is. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Here's another good example of the new checkpointing problem, though perhaps it's better to describe it as lost work possible. I noticed that the CPU time is also frozen, though the elapsed time is increasing. Based on prior experience with these ones, the checkpoint will never take place, but the task will never be completed no matter how long it runs. Buggy, buggy, buggy. Application Rosetta Mini 3.78 Name rb_06_06_83627_125669__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_669729_480 State Running Received Fri 22 Jun 2018 08:20:14 AM JST Report deadline Sat 30 Jun 2018 08:20:13 AM JST Estimated computation size 80,000 GFLOPs CPU time 02:25:26 CPU time since checkpoint 00:00:00 Elapsed time 08:39:27 Estimated time remaining 03:31:59 Fraction done 20.198% Virtual memory size 155.29 MB Working set size 51.39 MB Directory slots/4 Process ID 2359 Progress rate 2.160% per hour Executable minirosetta_3.78_x86_64-pc-linux-gnu At the same time I notice this machine has a couple of computation error tasks. Let's see if I can catch their Properties, too... Application Rosetta 4.07 Name DRH_curve_X_h30_l3_h23_l2_16685_3_2_loop_21_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_663453_44 State Computation error Received Mon 25 Jun 2018 06:15:24 PM JST Report deadline Tue 03 Jul 2018 06:15:24 PM JST Estimated computation size 80,000 GFLOPs CPU time --- Elapsed time --- Executable rosetta_4.07_x86_64-pc-linux-gnu Application Rosetta 4.07 Name DRH_curve_X_h19_l3_h26_l3_06738_1_loop_63_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_666406_131 State Computation error Received Mon 25 Jun 2018 06:15:24 PM JST Report deadline Tue 03 Jul 2018 06:15:24 PM JST Estimated computation size 80,000 GFLOPs CPU time --- Elapsed time 00:00:09 Executable rosetta_4.07_x86_64-pc-linux-gnu Also several more of those appeared, all DRH tasks. Buggy, buggy, buggy. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
And from a Windows 10 machine, a 4-hour computation error that is probably a checkpointing error in disguise, since it happened when the machine was booted after being shut down. Perhaps diagnostic that another task from the same sub-project managed to complete in just over 4 hours? Can't paste the Properties from Windows 10. Not even as an image. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Message boards :
Number crunching :
More checkpointing problems
©2024 University of Washington
https://www.bakerlab.org