Why can't Rosetta checkpoint more often (compared to WCG)? +feedback

Message boards : Number crunching : Why can't Rosetta checkpoint more often (compared to WCG)? +feedback

To post messages, you must log in.

AuthorMessage
John

Send message
Joined: 18 Mar 07
Posts: 24
Credit: 0
RAC: 0
Message 38182 - Posted: 23 Mar 2007, 20:52:26 UTC

Rosetta, 1st credit, invalidated after many hours. WCG, "finished", not updated on website, I quit distributed computing for a while (couple days). Then I come back to WCG to see if it updated.. and it did! Whoo, after many hours, it didn't reset, the timer or the checkpoint (at least not significantly)...

I'm wondering. Can anyone explain the process of Rosetta processing vs WCG FightAidsAtHome/Genome Comparison (the only two I've done/doing)? Why those two can checkpoint at good intervals, while Rosetta goes for hours at 1%, I exit, then I have no idea if I start it again, timer resets to 0, I don't know if "actual" % is 1 or 4 hours worth.

For example, Rosetta processing is like a house of cards in the face of the wind, it must always need your "shielding".. Or with a PC, when your "shield" or RAM goes away, house of cards goes away. That would be my example of why Rosetta is quirky?
===
"Q: Progress Percent not advancing?
A: Rosetta recomputes the progress percent at the end of each model."
Ok... why does WCG's seem to know "how much" is total/needed/done? Can someone explain the differences in the workloads..

Q: "To completion" time is going UP!
Answer is not normal.. Come on, 1 second increments? How about recalculating it every ~10 minutes or something so that you won't have the randomness of download managers but still... a guesser that makes sense.
ID: 38182 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nemesis
Avatar

Send message
Joined: 12 Mar 06
Posts: 149
Credit: 21,395
RAC: 0
Message 38184 - Posted: 23 Mar 2007, 21:31:34 UTC - in response to Message 38182.  

Rosetta, 1st credit, invalidated after many hours. WCG, "finished", not updated on website, I quit distributed computing for a while (couple days). Then I come back to WCG to see if it updated.. and it did! Whoo, after many hours, it didn't reset, the timer or the checkpoint (at least not significantly)...

I'm wondering. Can anyone explain the process of Rosetta processing vs WCG FightAidsAtHome/Genome Comparison (the only two I've done/doing)? Why those two can checkpoint at good intervals, while Rosetta goes for hours at 1%, I exit, then I have no idea if I start it again, timer resets to 0, I don't know if "actual" % is 1 or 4 hours worth.

For example, Rosetta processing is like a house of cards in the face of the wind, it must always need your "shielding".. Or with a PC, when your "shield" or RAM goes away, house of cards goes away. That would be my example of why Rosetta is quirky?
===
"Q: Progress Percent not advancing?
A: Rosetta recomputes the progress percent at the end of each model."
Ok... why does WCG's seem to know "how much" is total/needed/done? Can someone explain the differences in the workloads..

Q: "To completion" time is going UP!
Answer is not normal.. Come on, 1 second increments? How about recalculating it every ~10 minutes or something so that you won't have the randomness of download managers but still... a guesser that makes sense.


You're singing my song!

Maybe this will become my personal crusade - to get the 1% and Completion Time problems fixed.

Right now, there has been no acknowledgement that the Rosetta programmers are working on it, or that they intend to work on it.

BTW, there is an entire thread devoted to this topic.

Nemesis n. A righteous infliction of retribution manifested by an appropriate agent.


ID: 38184 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John

Send message
Joined: 18 Mar 07
Posts: 24
Credit: 0
RAC: 0
Message 38185 - Posted: 23 Mar 2007, 21:45:50 UTC

I realize that..
I don't think my question of why Rosetta doesn't checkpoint more often, why Rosetta resets timer (maybe everything also) has been answered.. Someone said it's because of "dumping memory" (for timer reset) but WCG is also set to "dump memory" but it doesn't reset time.
ID: 38185 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nemesis
Avatar

Send message
Joined: 12 Mar 06
Posts: 149
Credit: 21,395
RAC: 0
Message 38186 - Posted: 23 Mar 2007, 22:24:31 UTC - in response to Message 38185.  

I realize that..
I don't think my question of why Rosetta doesn't checkpoint more often, why Rosetta resets timer (maybe everything also) has been answered.. Someone said it's because of "dumping memory" (for timer reset) but WCG is also set to "dump memory" but it doesn't reset time.

Because Rosetta doesn't checkpoint until the end of the model, if it's stopped it has to start over from the last completed model, or if in the first model from the beginning, and the clock starts over as well if in the first model.

I've never run WCG, but it sounds like it does a checkpoint and saves the crunching time info when you stop it. That's totally up to the science app programmers and how they decide to do it.

Nemesis n. A righteous infliction of retribution manifested by an appropriate agent.


ID: 38186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John

Send message
Joined: 18 Mar 07
Posts: 24
Credit: 0
RAC: 0
Message 38187 - Posted: 23 Mar 2007, 22:28:45 UTC

So I'd like to hear from a Rosetta dev why they can't resume work (save often) in the middle of a crunching.. I mean, we can suspend BOINC then resume in a few minutes, so why can't we suspend over a shutdown?
ID: 38187 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38196 - Posted: 23 Mar 2007, 23:25:54 UTC - in response to Message 38187.  

So I'd like to hear from a Rosetta dev why they can't resume work (save often) in the middle of a crunching...


Bin Qian's comments from when checkpointing was originally added to Rosetta almost a year ago. As mentioned in that thread, the new version of BOINC also has new features to try and preempt one project to begin another only at a checkpoint.

Rosetta Moderator: Mod.Sense
ID: 38196 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John

Send message
Joined: 18 Mar 07
Posts: 24
Credit: 0
RAC: 0
Message 38233 - Posted: 24 Mar 2007, 14:37:36 UTC

Thanks for the direct reply, quoted from a year ago... So I'm guessing no progress there?


"Ok... why does WCG's seem to know "how much" is total/needed/done per "work unit"? Can someone explain the differences in the workloads"
Hello

"Answer is not normal.. Come on, 1 second increments? How about recalculating it every ~10 minutes or something so that you won't have the randomness of download managers but still... a guesser that makes sense."

If any of the download managers behaved like this they probably wouldn't be downloaded any more. I'm not understanding why, if Rosetta thinks that it takes 6 hours, it needs to count via 1 second increments from 4 hours instead of just dynamically adjusting...
ID: 38233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38234 - Posted: 24 Mar 2007, 15:06:09 UTC

Well, Rosetta doesn't count in 1 second increments. It's not like Rosetta made a calculation every second, showing the time remaining increasing. It's just BOINC showing you *IT's* guess based on the increasing CPU time. So, 1 second of time passing increases the CPU time used by ~1 second, and BOINC takes the total CPU time used so far, along with the % complete, and it's history on how long it took you to complete tasks in the past and shows you the result as estimated time to completion. You can see this better if you let a task run longer. Later in the run, when % completed is over 50%, the runtime still increases one second at a time, but the estimated time to completion doesn't change every second.

I mention this simply to point out that the numbers you are observing are a level removed from the numbers Rosetta's programs are working with. So it further complicates reaching the goal of a smoothly declining timeline.

I don't know all the details about how the numbers get revised and how they are communicated back to the BOINC Manager. Nor am I the one that can improve how it works. I'm just trying to explain the parts that I can. The need for, and benefits of improvement are pretty clear. So, I'm confident we will see some improvements in future releases.
Rosetta Moderator: Mod.Sense
ID: 38234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John

Send message
Joined: 18 Mar 07
Posts: 24
Credit: 0
RAC: 0
Message 38236 - Posted: 24 Mar 2007, 15:25:45 UTC

Of course, not always ticking up 1 seconds, sometimes it goes down a few, generally up though in 1+.. I've never seen another time predictor (DVD burn, download, XP reformat) that is constantly adds time to completion.. And even if it did, did it dynamically, rather than ticking up to what it thinks is the time..


"Ok... why does WCG's seem to know "how much" is total/needed/done per "work unit"? Can someone explain the differences in the workloads"
And any thoughts on this?
ID: 38236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Angus

Send message
Joined: 17 Sep 05
Posts: 412
Credit: 321,053
RAC: 0
Message 38238 - Posted: 24 Mar 2007, 15:42:48 UTC - in response to Message 38234.  

I don't know all the details about how the numbers get revised and how they are communicated back to the BOINC Manager. Nor am I the one that can improve how it works. I'm just trying to explain the parts that I can. The need for, and benefits of improvement are pretty clear. So, I'm confident we will see some improvements in future releases.


And still, the developers and real project people are strangely silent on this.

No comments or acknowledgements in the 1% thread.

Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)
ID: 38238 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John

Send message
Joined: 18 Mar 07
Posts: 24
Credit: 0
RAC: 0
Message 38245 - Posted: 24 Mar 2007, 17:42:21 UTC

Also would like to note that WCG Boinc (not sure about United Devices) also has the "longer time to completion" problem. Yes, I realize if I come back to PC after say 2 hours it'll obviously be lower, but it still makes no sense to tick in basically +1 increments.
ID: 38245 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John

Send message
Joined: 18 Mar 07
Posts: 24
Credit: 0
RAC: 0
Message 38246 - Posted: 24 Mar 2007, 18:01:57 UTC

http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=11332
"Help Cure Muscular Dystrophy: When the PDB code / Protein Symbols in the 'I' screen left hand bottom change and the 2 proteins in the main graph assume the same colours. Colour changes are pure random, thus could on outside chance assume same colour even if checkpoint was reached. Watch the PDB code change for absolute indication! (See Sample Image and FAQ for description)

Genome Comparison: Approximately every 20 minutes (See Sample Image and FAQ for description)

Help Defeat Cancer: at 25% intervals - writes large files (See Sample Image and FAQ for description)

Human Proteome Folding 2: Occurs after each structure attempt. Look at the graphics, one can see how far along an attempt is. When the 3 line graphs reach the end of the X axis and restart at the left, the structure attempt is complete and a checkpoint occurs (See Sample Image for UD Agent, BOINC Agent and FAQ for description)

FightAIDS@Home: When the Best Energy C graph green line has reached the end and returns to the beginning, whilst rescaling the graph and adding a red line indicating the path of the previous attempt. (See Sample Image and FAQ for description)"

I would imagine that each of these workloads are a good amount different, yet they are each able to save progress in a mindful matter...
ID: 38246 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 38254 - Posted: 24 Mar 2007, 20:34:36 UTC - in response to Message 38246.  

Hi John, Mod.Sense, and others:

Thanks for bringing this up. More checkpointing and better time-to-completion feedback were big causes of controversy last year (and again now!) -- we did put checkpointing into larger jobs, but never really addressed the problem of accurately estimating time to completion. We've been too busy getting rid of early bugs and putting new science modes into Rosetta! Things have settled down, though. the development team will discuss both issues early next week.

Thanks,
Rhiju

http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=11332
"Help Cure Muscular Dystrophy: When the PDB code / Protein Symbols in the 'I' screen left hand bottom change and the 2 proteins in the main graph assume the same colours. Colour changes are pure random, thus could on outside chance assume same colour even if checkpoint was reached. Watch the PDB code change for absolute indication! (See Sample Image and FAQ for description)

Genome Comparison: Approximately every 20 minutes (See Sample Image and FAQ for description)

Help Defeat Cancer: at 25% intervals - writes large files (See Sample Image and FAQ for description)

Human Proteome Folding 2: Occurs after each structure attempt. Look at the graphics, one can see how far along an attempt is. When the 3 line graphs reach the end of the X axis and restart at the left, the structure attempt is complete and a checkpoint occurs (See Sample Image for UD Agent, BOINC Agent and FAQ for description)

FightAIDS@Home: When the Best Energy C graph green line has reached the end and returns to the beginning, whilst rescaling the graph and adding a red line indicating the path of the previous attempt. (See Sample Image and FAQ for description)"

I would imagine that each of these workloads are a good amount different, yet they are each able to save progress in a mindful matter...


ID: 38254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[STS]LoB

Send message
Joined: 18 Mar 07
Posts: 4
Credit: 673,181
RAC: 0
Message 38995 - Posted: 4 Apr 2007, 19:45:50 UTC - in response to Message 38254.  

Hey Rhiju, has Version 5.59 increased the checkpointing frequency? I ask because of the heavily increased rate of updates to the progress display (xx%)...


Hi John, Mod.Sense, and others:

Thanks for bringing this up. More checkpointing and better time-to-completion feedback were big causes of controversy last year (and again now!) -- we did put checkpointing into larger jobs, but never really addressed the problem of accurately estimating time to completion. We've been too busy getting rid of early bugs and putting new science modes into Rosetta! Things have settled down, though. the development team will discuss both issues early next week.

Thanks,
Rhiju

ID: 38995 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38997 - Posted: 4 Apr 2007, 21:23:25 UTC

Version 5.59 tackled the % complete. Additional checkpoints will be added in the coming weeks. See Rhiju's post on the Ralph boards.
Rosetta Moderator: Mod.Sense
ID: 38997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[STS]LoB

Send message
Joined: 18 Mar 07
Posts: 4
Credit: 673,181
RAC: 0
Message 38998 - Posted: 4 Apr 2007, 21:25:39 UTC - in response to Message 38997.  

Thanks!

Version 5.59 tackled the % complete. Additional checkpoints will be added in the coming weeks. See Rhiju's post on the Ralph boards.

ID: 38998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Why can't Rosetta checkpoint more often (compared to WCG)? +feedback



©2024 University of Washington
https://www.bakerlab.org