Message boards : Number crunching : More checkpointing problems
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,754,653 RAC: 8,725 |
This computer 2283771 is running Ubuntu 18.04, 4GB memory and has the GLIBC 2.27 problem. This is a Rosetta link problem. They know about the problem and are likely looking at the problem, but it is NOT a checkpointing problem. If you look at the TASK DETAILS file, you will find the STDERR message, ... you will see the ASSERT error. --------------------------------------------------------------------------------------------------------- rosetta_4.07_x86_64-pc-linux-gnu: loadlocale.c:129: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed. DRH_curve_X_h19_l3_h26_l3_06738_1_loop_63_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_666406_131 Stderr output <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63)</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -out:file:silent_struct_type binary -beta 1 -abinitio::rg_reweight 0.5 -ex2aro 1 -ignore_unrecognized_res 1 -abinitio::rsd_wt_loop 0.5 -in:file:native 00001.pdb -abinitio::increase_cycles 10 -abinitio::rsd_wt_helix 0.5 -relax::minimize_bond_angles 1 -ex1 1 -frag9 00001.200.9mers -abinitio::fastrelax 1 -frag3 00001.200.3mers -relax::minimize_bond_lengths 1 -abinitio::detect_disulfide_before_relax 1 -abinitio::use_filters false -beta_cart 1 -relax::dualspace 1 -relax::default_repeats 2 -optimization::default_max_cycles 200 -in:file:boinc_wu_zip DRH_curve_X_h19_l3_h26_l3_06738_1_loop_63_0001_one_capped_0001_fragments_data.zip -out:file:silent default.out -silent_gz 1 -mute all -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3122350 rosetta_4.07_x86_64-pc-linux-gnu: loadlocale.c:129: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed. SIGABRT: abort called Stack trace (17 frames): Here's another good example of the new checkpointing problem, though perhaps it's better to describe it as lost work possible. I noticed that the CPU time is also frozen, though the elapsed time is increasing. Based on prior experience with these ones, the checkpoint will never take place, but the task will never be completed no matter how long it runs. Buggy, buggy, buggy. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
In case it isn't clear enough, I'm trying not to care more than the project is worth. These days I have doubts it is worth too much. Actually my reason for visiting today was not checkpointing problems, though they persist and are still annoying. On the machine that has the most constraints, I just periodically check the status, and if all of the active tasks have recently checkpointed, then I jump on the opportunity to shut down the machine. When I can't and still get forced, I'm trying to use the sleep solution. So back to today's problem. Frequent computation errors on DRH tasks. Perhaps Linux specific? I initially thought it was something I was doing, but now I don't think so. Just another bug of some sort. Since this is a kind of catchall thread (though I did search for more relevant threads to use instead), I'll go ahead and wonder aloud about the "Aborted by project" tasks, There were a bunch of those a while back, then they seem to have gone away, but now they seem to be returning. Definitely a waste of bandwidth to send me the data and then abort the task from their end... Or maybe it's a race condition between volunteers? #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
4LG5zSZM7uiF1nVGZVqTRrjkXA6i Send message Joined: 7 Mar 10 Posts: 14 Credit: 111,252,570 RAC: 0 |
You can look at the WU and see if someone returned a result. Many times, the original WU was sent, no response by the deadline, so it gets sent back out. However, the original computer could have been crunching it, it finished it and returned it after the deadline, so that means the resent WU should be canceled since a result was received. So while you view it as wasted bandwidth, having you crunch it would be wasted computing. Which would you rather have; wasted bandwidth or 12 computing hours? |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,754,653 RAC: 8,725 |
In case it isn't clear enough, I'm trying not to care more than the project is worth. These days I have doubts it is worth too much. In case it isn't clear enough, the "checkpointing problems" are not "checkpointing" problems, but problems with your machine. They will persist until you fix it. Only you can help you. If you would give a MACHINE and WU number, you would get some help. I suspect that your DRH problems are happening on your Ubuntu machines. If so ... Ubuntu 18.04 machines have a newer version of GLIBC that is incompatible with the statically linked Rosetta 4.07. Every machine on earth has that problem. Any Linux distribution with the newer GLIBC will have this problem ... Ubuntu 18.04, Fedora 28, .... There might be some LOCALE settings that can be configured to avoid this, but no one has set down and figured them out. If the problem WU is a "rosetta_4.07_x86_64-pc-linux-gnu" WU on your Ubuntu 18.04 machine, it will ALWAYS fail the same as it does on every other Ubuntu 18.04 machine. There is an incompatibility with the GLIBC libraries when STATICALLY linking like Rosetta does. If you look at the STDERR file returned with the WU, you will see the error: rosetta_4.07_x86_64-pc-linux-gnu: loadlocale.c:129: _nl_intern_locale_data: Assertion `cnt < (sizeof (_nl_value_type_LC_TIME) / sizeof (_nl_value_type_LC_TIME[0]))' failed. SIGABRT: abort called Like: https://boinc.bakerlab.org/result.php?resultid=1014041100 I see no problems on the Windows machines other than you seem to be caching more work that your machine can complete. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Ubuntu 18.04 machines have a newer version of GLIBC that is incompatible with the statically linked Rosetta 4.07. Every machine on earth has that problem. Any Linux distribution with the newer GLIBC will have this problem ... Ubuntu 18.04, Fedora 28, .... Is this the problem you are referring to? https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12242&postid=88954#88954 If so, you are too modest. It fixed it for me. |
4LG5zSZM7uiF1nVGZVqTRrjkXA6i Send message Joined: 7 Mar 10 Posts: 14 Credit: 111,252,570 RAC: 0 |
I'd say a lot of people fall into that category, every resend that is sent and then later is cancelled because the returned a late result falls into this category, especially when the WU deadline is 7 days. I do agree with you in that he says he has an issue with 3-day deadline WU's. It sounds like he doesn't keep his machines running constantly and shuts them down. With that said, the OP just needs to figure out what kind of buffer to run for his machines for how long they will be powered on. Given that the default is to run a WU for ~12 hours of CPU time, that is where he should start. If his machine only runs for 8 hours day, then he should be using at most a 1 day buffer. I personably run a low buffer. If there is an outage that lasts for say 6 hours, I'll have an issue. Those are rare for the most part and if there is no work, other projects will get the computing resources. No big deal. People that have a buffer where their machine cannot complete a WU in a week need to seriously shrink their buffer. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,754,653 RAC: 8,725 |
Ubuntu 18.04 machines have a newer version of GLIBC that is incompatible with the statically linked Rosetta 4.07. Every machine on earth has that problem. Any Linux distribution with the newer GLIBC will have this problem ... Ubuntu 18.04, Fedora 28, .... That is the problem. I tried it out on a Virtualbox installation of 18.04 and it did not work for me. Maybe I botched something up. thanks |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Look, I'm just reporting the problems. It would be nice if they got fixed, but I don't really care. Not sure I ever cared regarding Rosetta, but I can say that I used to care more when I was running WCG and their inability to fix similar problems was probably the main reason I stopped running their projects. Only about a million units of work there, while I'm approaching 8 million on this project. I continue to believe that the #1 cause of problems and lost work is the use of short-deadline tasks. I do NOT feel any urgency. Just annoyance. From a scientific perspective, what worries me is NOT the obvious bugs or even the appearance of bugginess if I'm misunderstanding what is going on. What bothers me is that it looks like sloppy coding practices, mostly at the Rosetta end, but also at the BOINC level. In one example discussed elsewhere in this thread, it should actually be a responsibility of the BOINC client to prevent attempted execution of tasks that are incompatible with the particular machine. Remember the first computer proof of the 4-color theorem? Retracted for bugs, though they fixed them later. Rosetta should also have economic concerns about paying for wasted bandwidth. Downloading lots of data and getting no results is not helping anyone. I am absolutely uninterested in wasting more of my time trying to tinker with the settings of my various machines to avoid the wastage. I am somewhat annoyed when I have "invested" in electricity and the resulting contribution is lost for reasons outside of my scope. Today's example is only 8 hours and 18 minutes of an rb task that has been stuck on Uploading for several days, and which has now gone past its deadline: Application Rosetta Mini 3.78 Name rb_07_18_84731_126613_ab_stage0_t000___robetta_IGNORE_THE_REST_06_18_682267_4 State Uploading Received Sat 21 Jul 2018 09:55:21 AM JST Report deadline Sun 29 Jul 2018 09:55:21 AM JST Estimated computation size 80,000 GFLOPs CPU time 07:36:03 Elapsed time 08:18:41 Executable minirosetta_3.78_x86_64-pc-linux-gnu #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
4LG5zSZM7uiF1nVGZVqTRrjkXA6i Send message Joined: 7 Mar 10 Posts: 14 Credit: 111,252,570 RAC: 0 |
Some of what you're reporting is on your side though. So if it isn't your responsibility to fix it, then who? So now you're complaining about the BOINC software that Rosetta@Home has no responsibility over. Also, the first thing that should never be brought up is how much work you've done....there is always a bigger fish. For example, you are closing in on 8 million; that is about a months worth of work for me. I don't seem to have all of the issues that you do. Could things be better? Sure. Could they be worse? Yes. Yes, I think you have some misunderstandings of what is going on. When you're talking about a scientific perspective, is that on the computer science side or the field the scientists actually work in? You could have a computer science PHD make things work great...but the results may not meet what the scientists are actually looking for. A computer science PHD would not know the science of what they are creating for. You have some Windows machines correct? There is sloppy coding in Windows too. Every so often a WU gets stuck in the upload state; I've had three or four out of over 77 million points worth. I'm not going to worry about it. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1985 Credit: 9,362,147 RAC: 7,841 |
The Rosetta structure chosen is to bundle up all the code for all the models in one binary. It makes for a sparse CPU execution loop and requires more memory PAGES than individual binaries. Since they chose the bundled binary approach, it is difficult for them to control the system demands and performance. I'm not sure, but why not use app_plan classes?? |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,754,653 RAC: 8,725 |
The Rosetta structure chosen is to bundle up all the code for all the models in one binary. It makes for a sparse CPU execution loop and requires more memory PAGES than individual binaries. Since they chose the bundled binary approach, it is difficult for them to control the system demands and performance. I am not very familiar with app_plan classes, but anything that will help the compiler and linker clump the used code and data together is a win. I recommended that the Rosetta developers add a dummy "4th dimension" to their 3-dimensional coordinate math so the compiler could take advantage of PACKED vector math instead of SCALAR. I think that would be done with a new class. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Hmm... Seems at least as relevant as the other "active" thread about fewer hosts. The checkpointing problems are continuing, though they do seem less severe these days. Recent ones have mostly involved the bad ol' rb tasks... However today's proximate problem appears to be a lack of fresh tasks. Not yet critical, but the unreliable supply of work is why I have to keep larger buffers on my machines which then results in throwing away deadline-constrained tasks on slower machines which means that some of the project's bandwidth is being wasted... That used to be a concern, at least at the university level. Anyway, the server status appears to be nominal. I've never been fully clear on the difference between the "Tasks ready to send" at the upper right and the "Unsent" tasks farther down the page, under the heading of "Tasks by application". The top number is 18,082, which seems to indicate that there is plenty of work to send and it's just not getting sent to my computer. In contrast, the lower numbers could mean that there is almost no work to send and I'm just not lucky enough to get any of it. Under Rosetta it only shows 22 in the Unsent column and Rosetta Mini has 0. Not even certain of this, but pretty sure that my machines are not eligible for the third application category, "Rosetta for Android", even though there are 9991 unsent tasks there. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Tasks stopped flowing again after a period of flow. Checkpointing problems roughly unchanged. Most of the problematic ones I've notices are still rb... tasks. I looked at a couple of other threads first in hopes of find some explanation of the problems, but if there was any explanation of the fix, it would seem not so much. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I looked at a couple of other threads first in hopes of find some explanation of the problems, but if there was any explanation of the fix, it would seem not so much. I would like an explanation also. In particular, is it an operational problem, or are all the researchers still away on summer break? We don't get much feedback for all our efforts. If we were a computer center, I think they would tell us when to turn off the machines. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Beware the wrath of PF units? Sort of joking, but pretty sure that all time invested in the current sick puppy of the PF stripe is going to be wasted. Haven't seen too many of this kind of problem recently, but it has accumulated over 5 hours of run time without a checkpoint. It seems to be making progress, but more slowly than the normal PF tasks. This is not a heavy use computer, so I probably can't run it long enough to find out for sure, but I'm pretty sure this story ends with failure at some point, and presumably no credit. Insult on the injury, or vice versa? In the heavy usage scenario it would just use up a lot of time until it runs past its deadline and dies for that reason, but in the low usage reality of this computer, it will almost surely get nuked after a reboot has zeroed it. (Obviously there is no reason to repeat the same mistake again and attempt to recompute work that will never earn credit.) As I've said before, if the tasks are buggy in any visible ways, then that casts doubt on ALL the work of the project. The less visible bugs are the ones to worry about most, but the visible bugs are sufficient to prove the existence of bugs in the project code. I'd paste the details (Properties) here, but no easy way to do so under Windows 10. I think that's mostly a BOINC-level problem... #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Beware the wrath of PF units? What are you calling "normal PF tasks", as compared to the "sick puppy... PF"? There must be more to the names you are referring to. Rosetta Moderator: Mod.Sense |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
A normal PF unit is one that checkpoints on a reasonable schedule, while a sick puppy is one that can't checkpoint. I also regard tasks that run significantly longer than 8 hours as sick puppies, though this is less sick than the units that can't checkpoint. Actually, the reason I dropped by today was to ask if there is some difference, some way to predict, which PF units are okay and which are bad. My newest theory is that I should let a possible sick-puppy task run for an hour, and if it hasn't checkpointed, then I should nuke it. That's for the machine that normally runs for short time periods and the rule only applies when there are only rb and PF units coming. Maybe I should reduce the time to 30 minutes? If there are a mix of units coming, then the optimum algorithm appears to be to nuke the rb and PF units before they waste any run time at all, and just try to make sure the queue is full of units that are unlikely to be sick puppies. I already have a kill-on-sight policy for the short-deadline tasks. Remember the objective: Avoid wasting computing time on tasks that earn no credit. From my side that seems to be the only metric I can apply. However there are still plenty of times when work appears to be wasted. However I also think computational efficiency should be one of the objectives of the rosetta project. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1985 Credit: 9,362,147 RAC: 7,841 |
Remember the objective: Avoid wasting computing time on tasks that earn no credit. From my side that seems to be the only metric I can apply. However there are still plenty of times when work appears to be wasted. However I also think computational efficiency should be one of the objectives of the rosetta project. I cannot understand why they cannot introduce the possibility, in user's profile, to select which kind of simulation to run (with or without checkpoint, rb priority, etc) in addition to the simple choice of duration. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2098 Credit: 40,822,968 RAC: 11,520 |
Beware the wrath of PF units? Sorry I've not been around too much recently, but I think I have one of these "sick puppy PF" tasks too
It's not consuming very much memory (I have a separate 1.2Gb task as well but it's running fine] What drew my attention to it s that my machine isn't running at 100%. Viewing the task manager on Windows 7 each one of my 8 tasks shows 13% of total CPU time being used except this one, showing just 2 or 3%. Instead of 100% CPU time being consumed by Rosetta it's 90-93% which is unusual for me. I'm running 6 other PF tasks and they're all running fine with no signs of slowdown except this one. It'll finish soon (hopefully) and will be this one I think PF13731.5_jmps_aivan_SAVE_ALL_OUT_03_09_686650_5044_0 The "good" PF tasks are in a different number range if that makes any difference PF06980.10 PF06980.10 PF06650.11 PF04620.11 PF09362.9 PF10124.8 Don't know if any of that helps. It's not the first I've seen, but it does seems to be quite rare. I let them run to completion anyway. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
I've actually started looking at the stats. Easier just now since there are nothing but PF units. I have two primary machines that are running twelve tasks between them, and usually 25% to 33% are in the sick puppy category. I have one on this machine that has over 8 hours of computation without a checkpoint. Just my feeling, but I think it will finish around 12 hours, but I doubt it will get the extra 50% of work points that it should get for the extra time... There's a second task here that's just about to hit two hours without a checkpoint. I'm pretty sure that qualifies as another sick puppy. On the other machine... Two of the four appear to be sick puppies. Grand total is 4/12 sick puppies for the 33% reading, which is typical. For my machines that run for less than 8 hours at a time, there is no reason to attempt running a sick puppy, but the question is "How soon can I be sure it's a sick puppy and abort it?" There's a startup period when normal tasks aren't checkpointed, but it seems to be variable. There may also be cases of sick puppies that only checkpoint at random intervals, sometimes longer or shorter. #1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech) |
Message boards :
Number crunching :
More checkpointing problems
©2024 University of Washington
https://www.bakerlab.org