Message boards : Number crunching : Remembering last position after exit
Author | Message |
---|---|
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
Hi, I am sure this must have been raised elsewhere, but is there any way for me to force Rosetta to remember its last position before I exit the BOINC client for the day? Last night I was working on AB_CASP6_t248_465_6867_0, which had reached 18.53% complete and was part way through its 4th model. This morning - after starting BOINC again - I found it had dropped to 18.50% and a check of the graphics showed it had also dropped to step 0 of model 4. Whilst in this particular case it had only lost 15 minutes of processing time, I expect this loss is more significant when the number of users is factored in. I hope someone has the answer or can point me in the direction of an existing thread. Thanks. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
There's no way you can force it to. "checkpoints" are built in to the system, and when you stopped it, it fell back to the last checkpoint when it started. Do you have the application "kept in memory"? |
anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0 |
Hi, Hi Murasaki Welkome to Rosetta. This thread hopfully gives you the info you seek. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1449 Anders n |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
Thanks for the information. I was thinking it was a problem with the checkpoints because the FAQs say "checkpoints occur each time the percentage advances" and the WU had lost 0.03%. However, if this is how checkpoints work then that is fine. I have now set my preferences to stay in memory and switch applications every 120 minutes as suggested in the FAQs, so hopefully the effect of this problem will reduce. Thanks again. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
"checkpoints occur each time the percentage advances" I don't think that is strictly accurate. Especially now that they fairly recently made the percentage advance more frequently. Could you post a reference to where you saw that? Was that in the FAQs? It should probably be reworded. But yes, what you saw is normal. The project just released new code which does the "checkpoints" much more frequently. As a result you "only" lost a tiny fraction of a model, rather than an hour (or more) of work. And so this new software will improve the efficiency of most all the clients in the user base, YEAH! We should see credit issued per day take a nice upward trend this week. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
"checkpoints occur each time the percentage advances" I spotted it in the answer to: Why should I set BOINC to keep application in Memory? |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Thanks for the reference, it says (in part): However, If the application is removed from memory during an application swap, it will loose the work performed since the last checkpoint. In the case of Rosetta the checkpoints occur each time the percentage advances. If you do not keep the application in memory, and you set the swap interval to less than the time it takes your machine to reach a checkpoint, the work units can appear to be 'hung". I suggest a moderator remove the sentence I highligted in bold. It would be nice if we could get a more clear description of when the new checkpoints are now possible. Are they evenly disbursed across a WU? Or each "Stage"? Or based on how the energy estimates progress? But I think we just proved that the sentence as it stands now is not accurate. Thanks for posting back Murasaki. The project is already improved with you here :) Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Thanks for the reference, it says (in part): The checkpoints occur "about" every 20 min. When a checkpoint occurs the percent will advance. If you stop the Work Unit in a way that removes it from memory, when it restarts it will fall back to the last checkpoint. The time of 20 min. is an estimate based on a project benchmark system. Faster systems do it more often than slower ones. This is precisely the behavior you have described. The Work Unit had Check pointed, it continued to process for some period of time short of the next check point, and you removed it from memory. When you restarted it, it began processing at the last checkpoint, and the percent complete fell back accordingly. EDIT: Ok I see the problem. There are now smaller increments to the percent increase so the percent can increase by .001. The checkpoint do not occur at that degree of fineness. The small increments were add as a diagnostic tool to identify the hang point on the Work Units. I will have to think about how this should read, because when the work unit checkpoints the percent does increase. But it is still possible for the percent to fall back if the work is interrupted. Moderator9 ROSETTA@home FAQ Moderator Contact |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Well, if I understand it correctly, they took me up on my suggestion. Leave the checkpoints in ALL WUs, (not just the long ones) and if you hit one of these points where you are ABLE to checkpoint, and it's been more than 20min since your last checkpoint, then do so. That was a balance between wasting resources writing all the checkpoints, and losing work due to not having enough. And it meant that the shorter proteins might go through 60% of a model before a checkpoint, and a longer protein might checkpoint after say 10% of a model. But I never heard how frequently they are ABLE to take a checkpoint. Here's Bin's postabout it. This basically means you will checkpoint not MORE than every 20min (unless you hit the end of a model I suppose)... but it doesn't say how long it might be before you reach "a stage where checkpointing is possible". Bottom line, how many checkpointable stages are there in a model? Or does it depend upon how large the protein is? Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Well, if I understand it correctly, they took me up on my suggestion. Leave the checkpoints in ALL WUs, (not just the long ones) and if you hit one of these points where you are ABLE to checkpoint, and it's been more than 20min since your last checkpoint, then do so. It has a lot to do with the protein size. There is no set number of check points for a model. It is also not a set hard time. My understanding is that it is based on a combination of factors, but it will generally work out to around 20 min. It varies significantly with machine speed. Moderator9 ROSETTA@home FAQ Moderator Contact |
Message boards :
Number crunching :
Remembering last position after exit
©2024 University of Washington
https://www.bakerlab.org