More checkpoint problems

Message boards : Number crunching : More checkpoint problems

To post messages, you must log in.

AuthorMessage
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 77503 - Posted: 24 Sep 2014, 8:09:58 UTC
Last modified: 24 Sep 2014, 8:11:44 UTC

For example, a current tusc work unit has been running for an hour without committing a single checkpoint. On some of my machines I'll just sleep them when I notice these things, but not an option on this machine. (Security rules.) I've noticed that a lot of these work units have weird annotations, in this example, the current workunit name includes the words 'tusc closed IGNORE THE REST', which suggests I should be able to nuke it without a scientific loss? Or not? Certainly looks like my machine loses an hour of work if I nuke it.

P.S. In case it isn't obvious, I'd prefer the projects run without problems. These things most often call themselves to my attention when I see work units that are hung for several days, apparently restarting from zero each time the machine is booted.
ID: 77503 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77506 - Posted: 24 Sep 2014, 17:07:35 UTC - in response to Message 77503.  

the current workunit name includes the words 'tusc closed IGNORE THE REST', which suggests I should be able to nuke it without a scientific loss? Or not?


Rosetta task unit names are normally incromprehensible to anyone not intimately familiar with protein science and the Rosetta system. However, in general terms the names usually contain details of the protein being worked on and the investigation method used.

I am not sure what exactly "Ignore the rest" means but I suspect it is something like "save all results showing a particular characteristic and ignore the rest". You will also sometimes see instead "save all out", which I suspect is a less specific tool that records all data discovered in the task.

If you post a link to the task page or give the full task name there may be other elements that we can have a go at translating (assuming that one of the scientists doesn't jump in with the real answer).

These things most often call themselves to my attention when I see work units that are hung for several days, apparently restarting from zero each time the machine is booted.


When you see one hanging around for a few days or constantly losing progress please report it in the Minirosetta 3.52 thread (or equivalent thread for a later version of Minirosetta). Giving the task number or task name will also help the scientists trace the problem and fix it in the next version update.
ID: 77506 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 77572 - Posted: 12 Oct 2014, 3:16:53 UTC - in response to Message 77506.  

Okay, just the ACK for now, but I'll try to keep my eyes open. I have discovered that the Properties button will tell you if there is a major discrepancy in the save time, but I suspect the deeper problem is that there is simply no progress being made...
ID: 77572 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : More checkpoint problems



©2024 University of Washington
https://www.bakerlab.org