More checkpointing problems

Message boards : Number crunching : More checkpointing problems

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,051,657
RAC: 8,071
Message 89152 - Posted: 26 Jun 2018, 2:44:15 UTC - in response to Message 89145.  

This computer 2283771 is running Ubuntu 18.04, 4GB memory and has the GLIBC 2.27 problem.
This is a Rosetta link problem.
They know about the problem and are likely looking at the problem, but it is NOT a checkpointing problem.


If you look at the TASK DETAILS file, you will find the STDERR message, ... you will see the ASSERT error.
---------------------------------------------------------------------------------------------------------
rosetta_4.07_x86_64-pc-linux-gnu: loadlocale.c:129: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed.


DRH_curve_X_h19_l3_h26_l3_06738_1_loop_63_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_666406_131
Stderr output
<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -out:file:silent_struct_type binary -beta 1 -abinitio::rg_reweight 0.5 -ex2aro 1 -ignore_unrecognized_res 1 -abinitio::rsd_wt_loop 0.5 -in:file:native 00001.pdb -abinitio::increase_cycles 10 -abinitio::rsd_wt_helix 0.5 -relax::minimize_bond_angles 1 -ex1 1 -frag9 00001.200.9mers -abinitio::fastrelax 1 -frag3 00001.200.3mers -relax::minimize_bond_lengths 1 -abinitio::detect_disulfide_before_relax 1 -abinitio::use_filters false -beta_cart 1 -relax::dualspace 1 -relax::default_repeats 2 -optimization::default_max_cycles 200 -in:file:boinc_wu_zip DRH_curve_X_h19_l3_h26_l3_06738_1_loop_63_0001_one_capped_0001_fragments_data.zip -out:file:silent default.out -silent_gz 1 -mute all -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3122350
rosetta_4.07_x86_64-pc-linux-gnu: loadlocale.c:129: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed.
SIGABRT: abort called
Stack trace (17 frames):




Here's another good example of the new checkpointing problem, though perhaps it's better to describe it as lost work possible. I noticed that the CPU time is also frozen, though the elapsed time is increasing. Based on prior experience with these ones, the checkpoint will never take place, but the task will never be completed no matter how long it runs. Buggy, buggy, buggy.

Application
Rosetta Mini 3.78
Name
rb_06_06_83627_125669__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_669729_480
State
Running
Received
Fri 22 Jun 2018 08:20:14 AM JST
Report deadline
Sat 30 Jun 2018 08:20:13 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
02:25:26
CPU time since checkpoint
00:00:00
Elapsed time
08:39:27
Estimated time remaining
03:31:59
Fraction done
20.198%
Virtual memory size
155.29 MB
Working set size
51.39 MB
Directory
slots/4
Process ID
2359
Progress rate
2.160% per hour
Executable
minirosetta_3.78_x86_64-pc-linux-gnu

At the same time I notice this machine has a couple of computation error tasks. Let's see if I can catch their Properties, too...


Application
Rosetta 4.07
Name
DRH_curve_X_h30_l3_h23_l2_16685_3_2_loop_21_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_663453_44
State
Computation error
Received
Mon 25 Jun 2018 06:15:24 PM JST
Report deadline
Tue 03 Jul 2018 06:15:24 PM JST
Estimated computation size
80,000 GFLOPs
CPU time
---
Elapsed time
---
Executable
rosetta_4.07_x86_64-pc-linux-gnu


Application
Rosetta 4.07
Name
DRH_curve_X_h19_l3_h26_l3_06738_1_loop_63_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_666406_131
State
Computation error
Received
Mon 25 Jun 2018 06:15:24 PM JST
Report deadline
Tue 03 Jul 2018 06:15:24 PM JST
Estimated computation size
80,000 GFLOPs
CPU time
---
Elapsed time
00:00:09
Executable
rosetta_4.07_x86_64-pc-linux-gnu

Also several more of those appeared, all DRH tasks. Buggy, buggy, buggy.
ID: 89152 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 89251 - Posted: 9 Jul 2018, 21:27:54 UTC - in response to Message 89152.  

In case it isn't clear enough, I'm trying not to care more than the project is worth. These days I have doubts it is worth too much.

Actually my reason for visiting today was not checkpointing problems, though they persist and are still annoying. On the machine that has the most constraints, I just periodically check the status, and if all of the active tasks have recently checkpointed, then I jump on the opportunity to shut down the machine. When I can't and still get forced, I'm trying to use the sleep solution.

So back to today's problem. Frequent computation errors on DRH tasks. Perhaps Linux specific? I initially thought it was something I was doing, but now I don't think so. Just another bug of some sort.

Since this is a kind of catchall thread (though I did search for more relevant threads to use instead), I'll go ahead and wonder aloud about the "Aborted by project" tasks, There were a bunch of those a while back, then they seem to have gone away, but now they seem to be returning. Definitely a waste of bandwidth to send me the data and then abort the task from their end... Or maybe it's a race condition between volunteers?
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89251 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
4LG5zSZM7uiF1nVGZVqTRrjkXA6i

Send message
Joined: 7 Mar 10
Posts: 14
Credit: 111,252,570
RAC: 0
Message 89252 - Posted: 9 Jul 2018, 21:42:21 UTC - in response to Message 89251.  

You can look at the WU and see if someone returned a result. Many times, the original WU was sent, no response by the deadline, so it gets sent back out. However, the original computer could have been crunching it, it finished it and returned it after the deadline, so that means the resent WU should be canceled since a result was received. So while you view it as wasted bandwidth, having you crunch it would be wasted computing. Which would you rather have; wasted bandwidth or 12 computing hours?
ID: 89252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,051,657
RAC: 8,071
Message 89254 - Posted: 10 Jul 2018, 17:10:35 UTC - in response to Message 89251.  

In case it isn't clear enough, I'm trying not to care more than the project is worth. These days I have doubts it is worth too much.

Actually my reason for visiting today was not checkpointing problems, though they persist and are still annoying. On the machine that has the most constraints, I just periodically check the status, and if all of the active tasks have recently checkpointed, then I jump on the opportunity to shut down the machine. When I can't and still get forced, I'm trying to use the sleep solution.

So back to today's problem. Frequent computation errors on DRH tasks. Perhaps Linux specific? I initially thought it was something I was doing, but now I don't think so. Just another bug of some sort.

Since this is a kind of catchall thread (though I did search for more relevant threads to use instead), I'll go ahead and wonder aloud about the "Aborted by project" tasks, There were a bunch of those a while back, then they seem to have gone away, but now they seem to be returning. Definitely a waste of bandwidth to send me the data and then abort the task from their end... Or maybe it's a race condition between volunteers?



In case it isn't clear enough, the "checkpointing problems" are not "checkpointing" problems, but problems with your machine. They will persist until you fix it. Only you can help you.

If you would give a MACHINE and WU number, you would get some help.

I suspect that your DRH problems are happening on your Ubuntu machines. If so ...

Ubuntu 18.04 machines have a newer version of GLIBC that is incompatible with the statically linked Rosetta 4.07. Every machine on earth has that problem. Any Linux distribution with the newer GLIBC will have this problem ... Ubuntu 18.04, Fedora 28, ....
There might be some LOCALE settings that can be configured to avoid this, but no one has set down and figured them out.

If the problem WU is a "rosetta_4.07_x86_64-pc-linux-gnu" WU on your Ubuntu 18.04 machine, it will ALWAYS fail the same as it does on every other Ubuntu 18.04 machine. There is an incompatibility with the GLIBC libraries when STATICALLY linking like Rosetta does.

If you look at the STDERR file returned with the WU, you will see the error:
rosetta_4.07_x86_64-pc-linux-gnu: loadlocale.c:129: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed.
SIGABRT: abort called
Like:
https://boinc.bakerlab.org/result.php?resultid=1014041100


I see no problems on the Windows machines other than you seem to be caching more work that your machine can complete.
ID: 89254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 89255 - Posted: 10 Jul 2018, 18:00:12 UTC - in response to Message 89254.  

Ubuntu 18.04 machines have a newer version of GLIBC that is incompatible with the statically linked Rosetta 4.07. Every machine on earth has that problem. Any Linux distribution with the newer GLIBC will have this problem ... Ubuntu 18.04, Fedora 28, ....
There might be some LOCALE settings that can be configured to avoid this, but no one has set down and figured them out.

Is this the problem you are referring to?
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12242&postid=88954#88954

If so, you are too modest. It fixed it for me.
ID: 89255 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
4LG5zSZM7uiF1nVGZVqTRrjkXA6i

Send message
Joined: 7 Mar 10
Posts: 14
Credit: 111,252,570
RAC: 0
Message 89256 - Posted: 10 Jul 2018, 21:00:41 UTC - in response to Message 89254.  


I see no problems on the Windows machines other than you seem to be caching more work that your machine can complete.


I'd say a lot of people fall into that category, every resend that is sent and then later is cancelled because the returned a late result falls into this category, especially when the WU deadline is 7 days.

I do agree with you in that he says he has an issue with 3-day deadline WU's. It sounds like he doesn't keep his machines running constantly and shuts them down. With that said, the OP just needs to figure out what kind of buffer to run for his machines for how long they will be powered on. Given that the default is to run a WU for ~12 hours of CPU time, that is where he should start. If his machine only runs for 8 hours day, then he should be using at most a 1 day buffer. I personably run a low buffer. If there is an outage that lasts for say 6 hours, I'll have an issue. Those are rare for the most part and if there is no work, other projects will get the computing resources. No big deal. People that have a buffer where their machine cannot complete a WU in a week need to seriously shrink their buffer.
ID: 89256 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,051,657
RAC: 8,071
Message 89257 - Posted: 10 Jul 2018, 21:30:43 UTC - in response to Message 89255.  

Ubuntu 18.04 machines have a newer version of GLIBC that is incompatible with the statically linked Rosetta 4.07. Every machine on earth has that problem. Any Linux distribution with the newer GLIBC will have this problem ... Ubuntu 18.04, Fedora 28, ....
There might be some LOCALE settings that can be configured to avoid this, but no one has set down and figured them out.

Is this the problem you are referring to?
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12242&postid=88954#88954

If so, you are too modest. It fixed it for me.


That is the problem. I tried it out on a Virtualbox installation of 18.04 and it did not work for me. Maybe I botched something up.
thanks
ID: 89257 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 89357 - Posted: 30 Jul 2018, 22:12:23 UTC

Look, I'm just reporting the problems. It would be nice if they got fixed, but I don't really care. Not sure I ever cared regarding Rosetta, but I can say that I used to care more when I was running WCG and their inability to fix similar problems was probably the main reason I stopped running their projects. Only about a million units of work there, while I'm approaching 8 million on this project.

I continue to believe that the #1 cause of problems and lost work is the use of short-deadline tasks. I do NOT feel any urgency. Just annoyance.

From a scientific perspective, what worries me is NOT the obvious bugs or even the appearance of bugginess if I'm misunderstanding what is going on. What bothers me is that it looks like sloppy coding practices, mostly at the Rosetta end, but also at the BOINC level. In one example discussed elsewhere in this thread, it should actually be a responsibility of the BOINC client to prevent attempted execution of tasks that are incompatible with the particular machine. Remember the first computer proof of the 4-color theorem? Retracted for bugs, though they fixed them later.

Rosetta should also have economic concerns about paying for wasted bandwidth. Downloading lots of data and getting no results is not helping anyone.

I am absolutely uninterested in wasting more of my time trying to tinker with the settings of my various machines to avoid the wastage. I am somewhat annoyed when I have "invested" in electricity and the resulting contribution is lost for reasons outside of my scope. Today's example is only 8 hours and 18 minutes of an rb task that has been stuck on Uploading for several days, and which has now gone past its deadline:

Application
Rosetta Mini 3.78
Name
rb_07_18_84731_126613_ab_stage0_t000___robetta_IGNORE_THE_REST_06_18_682267_4
State
Uploading
Received
Sat 21 Jul 2018 09:55:21 AM JST
Report deadline
Sun 29 Jul 2018 09:55:21 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
07:36:03
Elapsed time
08:18:41
Executable
minirosetta_3.78_x86_64-pc-linux-gnu
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89357 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
4LG5zSZM7uiF1nVGZVqTRrjkXA6i

Send message
Joined: 7 Mar 10
Posts: 14
Credit: 111,252,570
RAC: 0
Message 89358 - Posted: 30 Jul 2018, 22:39:42 UTC - in response to Message 89357.  

Some of what you're reporting is on your side though. So if it isn't your responsibility to fix it, then who?

So now you're complaining about the BOINC software that Rosetta@Home has no responsibility over.

Also, the first thing that should never be brought up is how much work you've done....there is always a bigger fish. For example, you are closing in on 8 million; that is about a months worth of work for me. I don't seem to have all of the issues that you do. Could things be better? Sure. Could they be worse? Yes.

Yes, I think you have some misunderstandings of what is going on. When you're talking about a scientific perspective, is that on the computer science side or the field the scientists actually work in? You could have a computer science PHD make things work great...but the results may not meet what the scientists are actually looking for. A computer science PHD would not know the science of what they are creating for.

You have some Windows machines correct? There is sloppy coding in Windows too.

Every so often a WU gets stuck in the upload state; I've had three or four out of over 77 million points worth. I'm not going to worry about it.
ID: 89358 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,621,941
RAC: 9,507
Message 89360 - Posted: 31 Jul 2018, 7:28:34 UTC - in response to Message 89108.  

The Rosetta structure chosen is to bundle up all the code for all the models in one binary. It makes for a sparse CPU execution loop and requires more memory PAGES than individual binaries. Since they chose the bundled binary approach, it is difficult for them to control the system demands and performance.


I'm not sure, but why not use app_plan classes??
ID: 89360 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,051,657
RAC: 8,071
Message 89366 - Posted: 1 Aug 2018, 16:16:55 UTC - in response to Message 89360.  

The Rosetta structure chosen is to bundle up all the code for all the models in one binary. It makes for a sparse CPU execution loop and requires more memory PAGES than individual binaries. Since they chose the bundled binary approach, it is difficult for them to control the system demands and performance.


I'm not sure, but why not use app_plan classes??


I am not very familiar with app_plan classes, but anything that will help the compiler and linker clump the used code and data together is a win.
I recommended that the Rosetta developers add a dummy "4th dimension" to their 3-dimensional coordinate math so the compiler could take advantage of PACKED vector math instead of SCALAR. I think that would be done with a new class.
ID: 89366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 89450 - Posted: 27 Aug 2018, 19:12:23 UTC

Hmm... Seems at least as relevant as the other "active" thread about fewer hosts. The checkpointing problems are continuing, though they do seem less severe these days. Recent ones have mostly involved the bad ol' rb tasks...

However today's proximate problem appears to be a lack of fresh tasks. Not yet critical, but the unreliable supply of work is why I have to keep larger buffers on my machines which then results in throwing away deadline-constrained tasks on slower machines which means that some of the project's bandwidth is being wasted... That used to be a concern, at least at the university level.

Anyway, the server status appears to be nominal. I've never been fully clear on the difference between the "Tasks ready to send" at the upper right and the "Unsent" tasks farther down the page, under the heading of "Tasks by application". The top number is 18,082, which seems to indicate that there is plenty of work to send and it's just not getting sent to my computer. In contrast, the lower numbers could mean that there is almost no work to send and I'm just not lucky enough to get any of it. Under Rosetta it only shows 22 in the Unsent column and Rosetta Mini has 0. Not even certain of this, but pretty sure that my machines are not eligible for the third application category, "Rosetta for Android", even though there are 9991 unsent tasks there.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 89461 - Posted: 31 Aug 2018, 10:22:55 UTC

Tasks stopped flowing again after a period of flow.

Checkpointing problems roughly unchanged. Most of the problematic ones I've notices are still rb... tasks.

I looked at a couple of other threads first in hopes of find some explanation of the problems, but if there was any explanation of the fix, it would seem not so much.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 89463 - Posted: 31 Aug 2018, 12:28:09 UTC - in response to Message 89461.  

I looked at a couple of other threads first in hopes of find some explanation of the problems, but if there was any explanation of the fix, it would seem not so much.

I would like an explanation also. In particular, is it an operational problem, or are all the researchers still away on summer break?

We don't get much feedback for all our efforts. If we were a computer center, I think they would tell us when to turn off the machines.
ID: 89463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 89483 - Posted: 4 Sep 2018, 2:22:39 UTC

Beware the wrath of PF units?

Sort of joking, but pretty sure that all time invested in the current sick puppy of the PF stripe is going to be wasted. Haven't seen too many of this kind of problem recently, but it has accumulated over 5 hours of run time without a checkpoint. It seems to be making progress, but more slowly than the normal PF tasks.

This is not a heavy use computer, so I probably can't run it long enough to find out for sure, but I'm pretty sure this story ends with failure at some point, and presumably no credit. Insult on the injury, or vice versa? In the heavy usage scenario it would just use up a lot of time until it runs past its deadline and dies for that reason, but in the low usage reality of this computer, it will almost surely get nuked after a reboot has zeroed it. (Obviously there is no reason to repeat the same mistake again and attempt to recompute work that will never earn credit.)

As I've said before, if the tasks are buggy in any visible ways, then that casts doubt on ALL the work of the project. The less visible bugs are the ones to worry about most, but the visible bugs are sufficient to prove the existence of bugs in the project code.

I'd paste the details (Properties) here, but no easy way to do so under Windows 10. I think that's mostly a BOINC-level problem...
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89483 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 89492 - Posted: 6 Sep 2018, 21:00:58 UTC - in response to Message 89483.  

Beware the wrath of PF units?

Sort of joking, but pretty sure that all time invested in the current sick puppy of the PF stripe is going to be wasted. Haven't seen too many of this kind of problem recently, but it has accumulated over 5 hours of run time without a checkpoint. It seems to be making progress, but more slowly than the normal PF tasks.


What are you calling "normal PF tasks", as compared to the "sick puppy... PF"? There must be more to the names you are referring to.
Rosetta Moderator: Mod.Sense
ID: 89492 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 89493 - Posted: 7 Sep 2018, 1:59:59 UTC - in response to Message 89492.  

A normal PF unit is one that checkpoints on a reasonable schedule, while a sick puppy is one that can't checkpoint. I also regard tasks that run significantly longer than 8 hours as sick puppies, though this is less sick than the units that can't checkpoint.

Actually, the reason I dropped by today was to ask if there is some difference, some way to predict, which PF units are okay and which are bad. My newest theory is that I should let a possible sick-puppy task run for an hour, and if it hasn't checkpointed, then I should nuke it. That's for the machine that normally runs for short time periods and the rule only applies when there are only rb and PF units coming. Maybe I should reduce the time to 30 minutes?

If there are a mix of units coming, then the optimum algorithm appears to be to nuke the rb and PF units before they waste any run time at all, and just try to make sure the queue is full of units that are unlikely to be sick puppies. I already have a kill-on-sight policy for the short-deadline tasks.

Remember the objective: Avoid wasting computing time on tasks that earn no credit. From my side that seems to be the only metric I can apply. However there are still plenty of times when work appears to be wasted. However I also think computational efficiency should be one of the objectives of the rosetta project.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89493 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,621,941
RAC: 9,507
Message 89494 - Posted: 7 Sep 2018, 6:50:02 UTC - in response to Message 89493.  

Remember the objective: Avoid wasting computing time on tasks that earn no credit. From my side that seems to be the only metric I can apply. However there are still plenty of times when work appears to be wasted. However I also think computational efficiency should be one of the objectives of the rosetta project.


I cannot understand why they cannot introduce the possibility, in user's profile, to select which kind of simulation to run (with or without checkpoint, rb priority, etc) in addition to the simple choice of duration.
ID: 89494 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2124
Credit: 41,226,850
RAC: 11,023
Message 89506 - Posted: 9 Sep 2018, 21:26:10 UTC - in response to Message 89492.  

Beware the wrath of PF units?

Sort of joking, but pretty sure that all time invested in the current sick puppy of the PF stripe is going to be wasted. Haven't seen too many of this kind of problem recently, but it has accumulated over 5 hours of run time without a checkpoint. It seems to be making progress, but more slowly than the normal PF tasks.

What are you calling "normal PF tasks", as compared to the "sick puppy... PF"? There must be more to the names you are referring to.

Sorry I've not been around too much recently, but I think I have one of these "sick puppy PF" tasks too


Application Rosetta 4.07
Name PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044
State Running
Received 08/09/2018 10:16:37
Report deadline 16/09/2018 10:16:36
Estimated computation size 80,000 GFLOPs
CPU time 08:38:41
CPU time since checkpoint 01:15:19
Elapsed time 14:30:57
Estimated time remaining 00:16:47
Fraction done 98.108%
Virtual memory size 417.46 MB
Working set size 419.03 MB
Directory slots/6
Process ID 7900
Progress rate 6.840% per hour
Executable rosetta_4.07_windows_x86_64.exe

It's not consuming very much memory (I have a separate 1.2Gb task as well but it's running fine]
What drew my attention to it s that my machine isn't running at 100%. Viewing the task manager on Windows 7 each one of my 8 tasks shows 13% of total CPU time being used except this one, showing just 2 or 3%. Instead of 100% CPU time being consumed by Rosetta it's 90-93% which is unusual for me.
I'm running 6 other PF tasks and they're all running fine with no signs of slowdown except this one.

It'll finish soon (hopefully) and will be this one I think

PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044_0
The "good" PF tasks are in a different number range if that makes any difference
PF06980.10
PF06980.10
PF06650.11
PF04620.11
PF09362.9
PF10124.8

Don't know if any of that helps. It's not the first I've seen, but it does seems to be quite rare. I let them run to completion anyway.
ID: 89506 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 89507 - Posted: 9 Sep 2018, 21:52:02 UTC - in response to Message 89506.  

I've actually started looking at the stats. Easier just now since there are nothing but PF units. I have two primary machines that are running twelve tasks between them, and usually 25% to 33% are in the sick puppy category. I have one on this machine that has over 8 hours of computation without a checkpoint. Just my feeling, but I think it will finish around 12 hours, but I doubt it will get the extra 50% of work points that it should get for the extra time...

There's a second task here that's just about to hit two hours without a checkpoint. I'm pretty sure that qualifies as another sick puppy. On the other machine... Two of the four appear to be sick puppies. Grand total is 4/12 sick puppies for the 33% reading, which is typical.

For my machines that run for less than 8 hours at a time, there is no reason to attempt running a sick puppy, but the question is "How soon can I be sure it's a sick puppy and abort it?" There's a startup period when normal tasks aren't checkpointed, but it seems to be variable. There may also be cases of sick puppies that only checkpoint at random intervals, sometimes longer or shorter.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89507 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : More checkpointing problems



©2024 University of Washington
https://www.bakerlab.org