Rosetta WU's restart

Questions and Answers : Windows : Rosetta WU's restart

To post messages, you must log in.

AuthorMessage
Kim Schreiber

Send message
Joined: 29 Mar 09
Posts: 2
Credit: 1,675,649
RAC: 0
Message 63817 - Posted: 25 Oct 2009, 11:54:52 UTC

Can anybody tell me why my Rosetta WU's starts all over again when i have had my computer turned off. Have to finish a WU if I don't want to start from 0. Other project WU's continue from what it has reached.
ID: 63817 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63819 - Posted: 25 Oct 2009, 15:08:32 UTC

Preserving the work done so far, is called checkpointing. Different types of Rosetta tasks checkpoint with different levels of regularity. Overall, most tasks checkpoint about every 15 minutes of runtime.

If your computer is on, but perhaps set to only run BOINC when idle, I would suggest you also set your preference to leave tasks in memory while suspended. That way, even if you pop on and off of your computer, the work you've done so far stays in memory for when it can run again and eventually reach a checkpoint.
Rosetta Moderator: Mod.Sense
ID: 63819 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,387,662
RAC: 11,688
Message 64875 - Posted: 9 Jan 2010, 11:10:35 UTC

Seems what some of Rosetta's WU's do not checkpoint at all or do it incorrectly. For example WU named "ha_notyr....." - after several hours of computing "CPU time at last checkpoint" stays "-----" (none). If I restart(or shut down) computer while this WU running - all results are lost and after restart computation starts from begining.

And other ones WU writing in logs about checkpoint, BUT actually is not checkpointing. For example WU named "lr_mix..." (example url: https://boinc.bakerlab.org/rosetta/result.php?resultid=309128812) my computer crunch one about 3 hours, boinc manager shows "CPU time at last checkpoint" correctly (only few minutes less compare to total CPU time), "show graphics" shows that 38 models already done. After that i shut down computer, and on next day when computer and BOINC/Rosetta started again this WU restarts from 0% (in "show graphics" 0 models too), so i abort this WU.

It is a significant problem for me, because i NEED to restart this computer from time to time, and it always shut down when i sleep (because of noise).
Besides similar too most occurs at automatic switching between projects if BOINС calculate some projects simultaneously (for example R@H and E@H).

What it is possible to make with it? (well except refusal of the Rosetta and transition to calculations of other projects)?

P.S.
Sorry for my English - i studied it only at basic school and for me was not enough practice.
ID: 64875 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64876 - Posted: 9 Jan 2010, 17:17:48 UTC

Mad Max, while you are correct that there are still some types of work units that only checkpoint after a model is completed, I believe your main problem is patience. It takes a work unit a minute or so to really get restarted. So, I think you just aborted it before it had a chance to wake up and realize it had already completed the 38 models. Either way, if you would let it run to completion rather then aborting it, and then post to the appropriate version's thread on the Number Crunching board with a link to the WU, that would be valuable information for the Project Team to have to resolve the problem. I'm sure they see lots of odd results (and aborts), but without that observation and knowledge of what causes them, it is often difficult to understand what areas require correction.
Rosetta Moderator: Mod.Sense
ID: 64876 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,387,662
RAC: 11,688
Message 64878 - Posted: 9 Jan 2010, 18:40:20 UTC
Last modified: 9 Jan 2010, 18:49:33 UTC

I think i waited enough - before cancelling this WU it have had time to calculate 2 more new models, but it were 2 NEW models (counting has gone with 0,1,2), no tags of 38 models calculated before turn off existing.
And in any case there is a question with other type WU "ha_notyr..."
One of such is computing right now, BOINC Manager shows 77 % of progress, "show graphics" shows 297 calculated models, but this task has no checkpoints at all (фs well as the previous WUs of this type).
Look at example:


This will be a correct branch of discussion:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5186 (if I use minirosetta 2.03)?
I should copy my "report" there?

P.S.
The screenshot above is normally visible?
ID: 64878 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64880 - Posted: 9 Jan 2010, 21:55:25 UTC

Yes, I see your screenshot just fine. Yes that would be the thread to post to. It sounds like you have something interesting there. The task is definitely "awake" when it's completed the next model.

You should be seeing a checkpoint saved at the end of every model. Are you familar with setting options in the cc_config.xml file? I think there is a setting to debug the checkpointing. Perhaps an error was encountered when a checkpoint was attempted.
Rosetta Moderator: Mod.Sense
ID: 64880 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64881 - Posted: 9 Jan 2010, 22:05:17 UTC

Any time you turn off your PC, you should first completely shutdown the BOINC Manager. So that means right click the icon and Exit. This assures it has closed all of it's files first. Is that what you've been doing?
Rosetta Moderator: Mod.Sense
ID: 64881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,387,662
RAC: 11,688
Message 64894 - Posted: 10 Jan 2010, 14:52:20 UTC
Last modified: 10 Jan 2010, 14:53:05 UTC

2 Mod.Sense
Yes, usually I finish work so (except a case if the computer completely hangup because of other processes executable on it or a power fail). Moreover, under "restarts", I meant not only a computer hard reset, but also simply turn off BOINС and start it again. I did it some times specially to try to catch a problem - the same results.
No, while I know nothing about operation with cc_config.xml file.

While I have troubleshot so: has exposed small "CPU target time" (2 hours at present instead of 10 which used earlier) to reduce possible losses of useful calculations to a minimum.
But I think, it only partial and is far not the best solution...

P.S.
I have transferred my "report" on a problem to an appropriate thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5186
So further discussion I suggest to continue there.
ID: 64894 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Questions and Answers : Windows : Rosetta WU's restart



©2024 University of Washington
https://www.bakerlab.org