More checkpoints please.

Message boards : Number crunching : More checkpoints please.

To post messages, you must log in.

AuthorMessage
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 27182 - Posted: 17 Sep 2006, 20:02:22 UTC
Last modified: 17 Sep 2006, 20:07:00 UTC

On a Windows system I run, I just opened BOINC Manager to see what was happening with the current tasks. I have 6 hour WU's, and the current WU was just over 4 hours done - plenty of pending work, seems to be working well. Then I make a mistake, and instead of closing BOINC Manager by hitting the X in the top right, I accidently go File > Exit. Oops! I've done this with BOINC a few times over the past couple months, so I know I'm going to lose some work for it (because it kills the app).

Unfortunately I dropped back 3 hours to just over 1 hour completed.

Please include more checkpoints in the next release - every 1 hour would be great. Leaving the app in memory is not an option for a fair of my machines since I don't want Rosetta eating up resources while other users need the hardware. I hate seeing that much work wasted!

Thanks for reading.

ST. (back on Rosetta now that it has cooled off around here)
Team Starfire World BOINC
ID: 27182 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 27218 - Posted: 17 Sep 2006, 23:36:32 UTC - in response to Message 27182.  
Last modified: 17 Sep 2006, 23:37:19 UTC

Even with leaving Rosetta in memory, I've had one or two instances of losing over an hour's work.

I "second" the request for more frequent checkpoints.

Leaving the app in memory is not an option for a fair of my machines since I don't want Rosetta eating up resources while other users need the hardware.

ID: 27218 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
casio7131

Send message
Joined: 10 Oct 05
Posts: 35
Credit: 149,748
RAC: 0
Message 27240 - Posted: 18 Sep 2006, 2:47:19 UTC

checkpointing can only be done usefully at certain stages, so it's not simply a matter of adding in more frequent checkpoints.

from Message 14300

The user controllable checkpointing would be ideal. Unfortunately the current checkpointing machinery can only be done in certain stages of the modeling process. In a nutshell, the process has to reach a stage in the modeling process where the previous searching history (which includes a huge amount of data if we were to record all of it) can be discarded, so we can live with only checkpointing a minimum amount of data for the future searches. We can not checkpoint at any point of the modeling process yet.

I've implemented Feet1st's idea below: when the WU reaches a stage where checkpointing is possible, it will see how long it has been since the last checkpointing. If it's over 20 minutes, then checkpoints.
ID: 27240 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 27324 - Posted: 18 Sep 2006, 16:43:51 UTC

I'm all for more checkpoints as well. Noone wants to lose an hour or more of work. However, as casio points out, it requires significant change to the application to do so, and such changes are risky. Especially when your application is deployed across 50,000 machines. So, I would expect it to take significant time for any such change to come about.

Hopefully they can find improvements to the algorythms which will allow them to see further in to the future about which paths are going to be fruitful and which will not. This will make everything more efficient AND make more frequent points in execution where the "previous searching history can be discarded".
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 27324 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mage492

Send message
Joined: 12 Apr 06
Posts: 48
Credit: 17,966
RAC: 0
Message 27378 - Posted: 18 Sep 2006, 19:13:17 UTC
Last modified: 18 Sep 2006, 19:15:11 UTC

Okay, the way I see this (If I'm incorrect, please let me know!), any program could theoretically be paused/restarted in the following way:

1. Write all variables, arrays, etc. to the hard disk (such as step number and the current state of the protein).

2. Write the last instruction completed to the hard disk.

3. Terminate the program.

It could be resumed by "filling in" the variables and starting right where it left off.

I'm sure we've all played board games (such as Monopoly) and been interrupted. It's fairly common to write down how much money each person has, the positions of pieces/houses/hotels, and whose turn just took place, then leave. When you come back, you've written down enough information to easily re-assemble the game and continue where you left off.

The catch is that this would take significantly longer than the current method of checkpointing. There is more to write out to the hard drive. Also, you wouldn't want this to be a routine thing, as it would be a significant increase in overhead. Perhaps you could click a "Do Checkpoint NOW" button that would do this. That way, if you know you'll be restarting, soon (like if you've just done system updates, for example), you can tell it to checkpoint in this fashion, so that you don't lose any work.

Obviously, implementing this might be a challenge, but it's theoretically possible, I think. Looking at this description, though, perhaps this would be something BOINC would do, rather than Rosetta?

Edit: Minor spelling error
"There are obviously many things which we do not understand, and may never be able to."
Leela (From the Mac game "Marathon", released 1995)
ID: 27378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 27404 - Posted: 18 Sep 2006, 19:55:31 UTC

Well, two different signals, two different traps, two different checkpoint types. Since I've never seen the code, I have an uninformed opinion.
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 27404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 27417 - Posted: 18 Sep 2006, 20:33:15 UTC

No it is something Rosetta would do. You are absolutely correct, it is technically possible. But it makes the code more complex, and takes time away from devising better techniques for solving the protein structure. Makes your code run slower when you don't need all the checkpoints. Also makes your download slightly larger... there are lots of reasons, large and small, that it hasn't already been done.

You also have to keep in mind, the outline you present to save all the variables etc... would have to be written in to the program is MANY places. And the way programs work, there is scope of variables and routines. If "A" calls "B" calls "C" and you crunch in C for 90 minutes... it's not easy to obtain the variables that aren't scoped to that present call level. And to reinstate all the variables, you'd basically have to call in to all the routines which had been activated. It gets very complicated.

I wonder! Would it be possible to somehow write out just the timestamps? Don't call it a checkpoint, but use this timestamp or some other means to, once you restart, calculate how much crunch time was lost and report that figure back with the results? That would give the project team a specific and concrete figure to look at to estimate the improvement in project efficiency if they had more checkpoints.

It would help get a big picture. If the data shows the "average" cruncher has the machine on 24hrs a day, and doesn't lose work, then at least you know. The other outcome would be that you find 10% of a days work, project-wide is lost.

I can tell you, I expected the TFLOPS of the project to jump about 10% when they implemented the checkpoints that they now have... and it didn't jump measureably at all. So my conclusion at the time was that most machines were already set to leave applications in memory, and were running all day without powering down the machine.

Footnote: I'm not a member of the project team. I've not seen the code either. But am a computer programmer, and so I have a good grasp of how such things are done.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 27417 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mage492

Send message
Joined: 12 Apr 06
Posts: 48
Credit: 17,966
RAC: 0
Message 27425 - Posted: 18 Sep 2006, 21:18:45 UTC
Last modified: 18 Sep 2006, 21:19:47 UTC

True. I guess we would have to see the project code, to really make an informed decision. I like the idea of timestamps, though. Personally, I know that I don't lose much of anything, as both of my computers currently running Rosetta run 24-7 (Neither are completely dedicated crunchers, but they don't do anything else that's particularly intensive.). But for family computers (especially if BOINC is only set up to run when a single user is logged in), the loss might be much worse.

Regarding variable scope and the need to add that code in several different places, here's another possibility. In Linux and OS X, I know that when a program crashes, it does a "core dump", where it writes the entire memory image of that program to disk (so that it can be run through a debugger, for example). Perhaps we could do a similar type of thing, here. If we wrote the raw memory image to the hard drive, then later recalled it, that would take care of issues like variable scope, wouldn't it? This is an area I don't really know much about (I typically write database-tools and small utilities.), but I figured I'd throw the idea out there.

(Even if it turns out to only be a minor issue, it's still a fun theoretical challenge!)
"There are obviously many things which we do not understand, and may never be able to."
Leela (From the Mac game "Marathon", released 1995)
ID: 27425 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 27613 - Posted: 19 Sep 2006, 22:55:39 UTC

Taking pages of memory and dumping them out is one thing... putting them all back and saying to the operating system "ok now THAT is my stack" is yet another. Also, programs often use pointers, and you wouldn't have a means of correcting them to point to where the restored data resides.

Again... it is POSSIBLE to do... but very loborious, error prone, and would consume resource on the PC to perform. So, it is always a trade-off. Do you create more checkpoints and endure the performance hit on all of the machines crunching for the project? Or do you take the hit of losing more work then you might otherwise lose on those PCs that are not running 24/7? I mean keep in mind that more checkpoints still doesn't mean you don't lose any work. It is just a step to reduce how much work is lost. So my idea about attempting to quantify how much work is presently being lost is really the key starting point I believe.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 27613 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mage492

Send message
Joined: 12 Apr 06
Posts: 48
Credit: 17,966
RAC: 0
Message 27646 - Posted: 20 Sep 2006, 3:58:26 UTC

Feet1st, I agree on the theoretical vs. practical issue with checkpointing on-demand. I thought I had a workaround, with that last one, but back to the drawing board, it seems.

The timestamp idea sounds like a good one, as well. I think it would be especially useful if this information could be made available in the host information page, as well. This might make troubleshooting some things easier (such as the "leave in memory" issue people sometimes have).
"There are obviously many things which we do not understand, and may never be able to."
Leela (From the Mac game "Marathon", released 1995)
ID: 27646 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : More checkpoints please.



©2024 University of Washington
https://www.bakerlab.org