Checkpointing under Rosetta Mini

Message boards : Number crunching : Checkpointing under Rosetta Mini

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60838 - Posted: 27 Apr 2009, 4:48:27 UTC - in response to Message 60831.  

Paul, is it just me? Or does minutes per task blow away the whole rationale for wanting to limit writes in the first place? I mean to say that it should be minutes since BOINC last wrote anything, systemwide, not on a per task basis.

Even with this CPUs times write time you're only getting the desired function ON AVERAGE, assuming that all tasks are attempting to checkpoint all the time (which could not be further from reality).

The age old balance problem ...

What to do and how often to do it.

The point of checkpointing is to be able to recover without too much of a loss of time. Balanced by the fact that most of the time the checkpoint is a waste because it will never be used. So, the optimum strategy is to checkpoint on a basis of how much do you want to make up. A 30 minute task is almost nto worth bothering about because what the heck, starting over is no big deal.

You can almost make that argument up to about an hour. After that ...

The issue kind of comes in when you consider that on an 8 core system that a one hour checkpoint rate means that you are checkpointing once every 7.5 minutes (assuming running tasks more than an hour in length).

That is why the science apps are supposed to check back for the time and checkpoint themselves. Really, allowing BOINC to tell the tasks to checkpoint makes the tasks harder to write ... the easy way is to set a limit and say, check back with me to see if it is time to checkpoint ... puts the common code in the client ... easier science applications.

But it also points up a flaw in BOINC that many of the design decisions are not scaling well as computers get faster and wider with more processing elements. I think we will find a new balance in time, but it is slow because, well, too many don't want to recognize the changes in the landscape ... also because you also have to have a more modern system to see the issues.

Don't know if I answered the questiion, or talked about everything else you did not want to know ... :)
ID: 60838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60842 - Posted: 27 Apr 2009, 13:24:02 UTC
Last modified: 27 Apr 2009, 13:31:10 UTC

My point was simply that the "write at most" setting is apparently my only input in to BOINC to inform it of my willingness to lose work. I'm going to base my setting for that upon how I use my machine. If I frequently turn it off or end BOINC, I'm going to tend to use a lower time setting there. And applications like Rosetta are going to write to disk regardless of the setting when they reach the end of a model (which is sometimes every 5 minutes or so). And I presume all applications write to disk upon completion of a task. So, if I want to reduce writes to my hard drive (or network drive, or flash drive), and I run 24x7, I might crank the write at most setting up to an hour.

But what did it really accomplish? If I've got 8 CPUs running, there is constantly going to be one of them with some reason to write to disk. And if they don't, the client will decide to update the state files or something.

BOINC will behave as though I've made no change to the setting at all, based on observation of the machine. And I'm pretty sure BOINC doesn't even piggy back a checkpoint for one task in to the same disk active time spurred by another that has completed a model or reached task completion. So, I really don't see the point of giving me a dial that's not really hooked up to anything that is controlled. Why show me a tuner on a radio, if the receiver only gets one station?
Rosetta Moderator: Mod.Sense
ID: 60842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60862 - Posted: 28 Apr 2009, 8:43:39 UTC

Hmm, Ok,yes well...

The seed of the answer is in your summation.

If you want BOINC to respect your settings better, change projects ... :)

As you note, Rosetta ignores this setting under a number of conditions, on a "wide" system this means that the effective rate is a lot higher than the desired rate.

I am having this argument on the mailing lists with regard to two other issues where the BOINC default behavior is suboptimal from a system perspective because the choice does not reflect current machine speeds or capacities. This is just more of the same.

In a three hour period (180 minutes), reflective of the concern you have I registered these numbers on my systems:

Request enforce CPU schedule: Checkpoint reached 86 379 129 92 302

With the machines roughly sorted from slowest to fastest.

This shows that on the systems I am doing a disk write at roughly 30 second intervals on some systems, kinda the issue that concerns you. Note that I did not track the non-checkpoint writes and I am not even sure that they are noted as such by the science application.

Sorry I don't have better news ... and even sorrier that the response to my notes has been for good old John McLeod VII to suggest that since I bring it up as an issue, and he and others don't want to hear about it, well, they are going to ignore me ... so what else is new ... sigh ...

Sad days ... no wonder BOINC does not progress very fast ...

Oh, and I looked at systems from both Dell and Apple that have 16 virtual CPUs, dual Xeons with 4 cores and HT capability. Think on that as a system with 3 GTX 295 cards ... 22 processing elements. Add in QCN and FreeHAL and you could have quite a number of tasks in flight at the same time. The disk would never quit. Of course, I am also thinking of getting one of the new solid state disks ... but I digress ...
ID: 60862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60867 - Posted: 28 Apr 2009, 13:19:30 UTC

I don't even mind the exceptions to the rule. I just feel that once it DOES cut out and write to disk, it should write everything that has been buffered at that time, and then begin a new timer for my write interval at which point it will plan to write anything buffered between now and then, unless another exception occurs during the interval. Point being that the timer is on the disk drive, not on each of the manu tasks running.
Rosetta Moderator: Mod.Sense
ID: 60867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60881 - Posted: 29 Apr 2009, 7:54:21 UTC - in response to Message 60867.  

I don't even mind the exceptions to the rule. I just feel that once it DOES cut out and write to disk, it should write everything that has been buffered at that time, and then begin a new timer for my write interval at which point it will plan to write anything buffered between now and then, unless another exception occurs during the interval. Point being that the timer is on the disk drive, not on each of the manu tasks running.

Um, yes, well ...

The problem is that the writes are from the science applications which are asynchronous with each other. The applications can talk to the BOINC Client, but not each other. So, they cannot synchronize.

Even then, the client only says, you should checkpoint now, not that you must checkpoint now.

Again, this is one of the places where the design does not scale well with system growth.

I see almost continual disk blinks which are almost certainly from the operation of BOINC as that is the only thing that is really running on those systems.
ID: 60881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Checkpointing under Rosetta Mini



©2024 University of Washington
https://www.bakerlab.org