Progess going backwards?

Message boards : Number crunching : Progess going backwards?

To post messages, you must log in.

AuthorMessage
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 88010 - Posted: 4 Jan 2018, 7:10:56 UTC

Here's a glitch I haven't seen before. I happened to be looking at another one of those tasks that gets stuck over 8 hours. This machine has two of them right now, both approaching 12 hours. I wouldn't mind that so much except that as near as I can tell, the 12-hour tasks don't get much if any extra credit for the time. Suddenly the progress of one of the tasks dropped by about 10%, from around 98% to 88%. After that, it started to make rapid upward progress, in the normal jumps rather than the 0.001% jumps of a stuck task. A few minutes later, both of them cleared, so I didn't see exactly what happened.

As noted before, the credit is not that important. However the apparent bugginess of some parts of the system (for calculating the progress and the credit) cast doubts on the more important parts of the system that are supposed to be calculating significant scientific results. Why should other scientists believe that since the known bugs don't matter much there aren't other bugs of greater consequence?
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 88010 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 88014 - Posted: 5 Jan 2018, 0:05:01 UTC - in response to Message 88010.  

What work units are having issues? It may be specific to the particular job(s). Maybe a large protein? Or particular score filters that are failing more often than usual for the particular protein. Or maybe boinc checkpointing is not properly implemented for the particular protocol.

The core molecular modeling software is developed, tested, and used by many academic institutions around the world through the Rosetta Commons. Rosetta is freely available to academics including the source code. I would not jump to a conclusion that the issues you mention may also reflect the science.
ID: 88014 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 88016 - Posted: 5 Jan 2018, 11:00:31 UTC - in response to Message 88014.  
Last modified: 5 Jan 2018, 11:01:14 UTC

Well, it might be a BOINC-level problem since it affects many of the work units, even including some of the rb units that used to run in 4 hours. These days there are occasional work units under 8 hours, but rarely. When I've looked at those fast units, the credit usually seems to be appropriately reduced in accord with their shorter run times.

For a while I was trying to see if there were some specific project name associated with the the 12-hour units, but couldn't figure out any pattern. Some of the tasks just go into a slow progress mode with a remaining time around 10-1/2 minutes. The progress will be advancing in very small increments, usually 0.001% at a time, which is about 10 or 15 seconds. The remaining time just stays constant, with an occasional 1-second flick of a smaller time. Usually it goes down by one second and then it flips right back up.

The checkpoint problems are different, but continuing. Haven't noticed as many of them these days as I used to, but I've also stopped paying so much attention. There are definitely times when I find that some task has not been checkpointed for a long time, but usually they are within 5 minutes, except at the beginning, when it often takes longer for the first checkpoint.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 88016 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 88018 - Posted: 5 Jan 2018, 14:26:44 UTC

Do you have a checkmark for the "Leave applications in memory while suspended" computing preference? If not, and BOINC Manager decides to transition to another project, the task may lose progress when suspended. I would expect such a loss would not be displayed until the task is resumed.
Rosetta Moderator: Mod.Sense
ID: 88018 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 88065 - Posted: 14 Jan 2018, 22:13:25 UTC - in response to Message 88018.  

I'm guessing you [Mod.Sense] mean "Leave non-GPU tasks in memory while suspended" on the "Disk and memory" tab. It is checked, but the task was NOT suspended when I noticed the drop in progress. I'm also hard pressed to imagine how it could have lost progress even if it had been suspended. The progress is somehow related to time rather than being a simple metric of work completed? The remaining is obviously related to time and I can see how suspension might confuse that one, but it's already an obviously flaky and nonlinear metric of whatever it's estimating.

Current annoyance is actually the checkpointing, especially on this machine. Whenever I want to shut it down, it seems like at least one of the active tasks has a large time since the last checkpoint. Right now I have a task that is almost half finished as it approaches 5 hours, but it's been more than 1-1/2 hours since the last checkpoint. (This one is a nRoCM... task, if that's worth knowing.) Can't suspend this machine because it's a cross-booter and I need the other OS sometimes.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 88065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Progess going backwards?



©2024 University of Washington
https://www.bakerlab.org