Checkpointing under Rosetta Mini

Message boards : Number crunching : Checkpointing under Rosetta Mini

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
SekeRob

Send message
Joined: 7 Sep 06
Posts: 35
Credit: 19,984
RAC: 0
Message 60621 - Posted: 14 Apr 2009, 11:14:55 UTC

Noting the denser checkpointing, the application is not set to check with the core client if checkpoint writing is at all permitted per the preferences. Set it to 5 minutes and the mini just happily ignores and writes one every few minutes. Other projects such as WCG respect the 5 minute restriction and write one on first one to happen after 5 minutes. Client 6.6.23.
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 60621 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60625 - Posted: 14 Apr 2009, 14:27:42 UTC
Last modified: 14 Apr 2009, 14:33:39 UTC

forza, let's open a new thread for the checkpointing issue you describe since the enhanced checkpointing has been around for quite some time, and this could get to be an extended discussion.

Moved from Minirosetta v1.54 bug report thread
Rosetta Moderator: Mod.Sense
ID: 60625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60626 - Posted: 14 Apr 2009, 14:28:39 UTC
Last modified: 14 Apr 2009, 14:34:48 UTC

I wanted to discuss this topic in more detail, so opened this new thread.

Could you describe how you are confirming when a checkpoint is taken? I've found that the cc_config checkpoint_debug messages are misleading. It shows every time the application has requested a checkpoint, but they are actually buffered until the "write to disk at most" wait time is satisfied.

What I would suggest is review of the last change date/times on the files in the slot directory the task is running in.

What version of BOINC are you running?
Rosetta Moderator: Mod.Sense
ID: 60626 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SekeRob

Send message
Joined: 7 Sep 06
Posts: 35
Credit: 19,984
RAC: 0
Message 60627 - Posted: 14 Apr 2009, 16:45:26 UTC - in response to Message 60626.  

I wanted to discuss this topic in more detail, so opened this new thread.

Could you describe how you are confirming when a checkpoint is taken? I've found that the cc_config checkpoint_debug message are misleading. It shows every time the application has requested a checkpoint, but they are actually buffered until the "write to disk at most" wait time is satisfied.

What I would suggest is review of the last change date/times on the files in the slot directory the task is running in.

Well, that the first I've heard of that one. e.g. RICE & DDDT have checkpoints from 1 to 2 minutes, but only write and log them as said after the set time.

afaik, the science app can have a code to check if permitted. QMC is another one that does not have the call in theirs, by their own confirmation last I had exchange with them and promise to implement, which is now good many months ago.

As for buffered and waiting, that is not the description I found. The checkpoint is skipped and the next one up will show, so if the checkpoint were to occur every 4 minutes and the "at most" is 5 minutes, the next checkpoint written to disk is the one occurring at the 8th minute.

Mind you, may have discovered an unrelated bug for sciences that do follow the rule asking the core client (6.6.23) Just reported a continues every minute project preempting at BOINCdev forum and some projects taking 5 minutes "at most" on quad as 4x5 minutes, before writing again, meaning after 20 minutes.

Anyway, the point is, is your application asking the core client for permission?
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 60627 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60628 - Posted: 14 Apr 2009, 17:02:29 UTC

Anyway, the point is, is your application asking the core client for permission?


I believe it is. I am not a Rosetta coder to know for certain. This was posted during the testing on Ralph. I believe that the end of a model might be the only exception. I believe that is not really a checkpoint per se and that a write will always occur when the task reaches that point.

What are you seeing for file revision times in your slots directory?
Rosetta Moderator: Mod.Sense
ID: 60628 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SekeRob

Send message
Joined: 7 Sep 06
Posts: 35
Credit: 19,984
RAC: 0
Message 60629 - Posted: 14 Apr 2009, 17:42:14 UTC - in response to Message 60628.  
Last modified: 14 Apr 2009, 17:52:25 UTC

Anyway, the point is, is your application asking the core client for permission?


I believe it is. I am not a Rosetta coder to know for certain. This was posted during the testing on Ralph. I believe that the end of a model might be the only exception. I believe that is not really a checkpoint per se and that a write will always occur when the task reaches that point.

What are you seeing for file revision times in your slots directory?

This delay of writing for your app, and still writing them all is certainly clarified there given the seeming need to have them all to be able to restore from a system restart for instance.

No, 2 slot files default.out and rng.state.gz get modified at the exact same time stamps as recorded in the log i.e. every few minutes completely ignoring the "at most ..." carefully watching slot content. More interesting (disturbing), there is writes even when not logged and more frequent than the checkpoint log entries and with minutes offset. A whole swat of files gets written with a new chk_chk1_1... through 15. I'd rather you kept that in memory too till checkpoint write time. Now I'm even more uncomfortable with the whirring than I was before just looking at the log frequency.

Edit: Some typos and the checkpoint log for a longer time frame:

14/04/2009 19.11.59 World Community Grid [checkpoint_debug] result E000490_575B_002a0s009_1 checkpointed
14/04/2009 19.12.08 rosetta@home [checkpoint_debug] result lr5_E_no_rama_04_intra_rep_rlbd_1ubi_SAVE_ALL_OUT_10755_841_0 checkpointed
14/04/2009 19.13.09 rosetta@home [checkpoint_debug] result lr5_E_no_rama_04_intra_rep_rlbd_1ubi_SAVE_ALL_OUT_10755_841_0 checkpointed
14/04/2009 19.15.31 World Community Grid [checkpoint_debug] result R00270_b00f5fa921c9e1c31699fcd04438d3aa_01_000_6 checkpointed
14/04/2009 19.16.49 rosetta@home [checkpoint_debug] result lr5_E_no_rama_04_intra_rep_rlbd_1ubi_SAVE_ALL_OUT_10755_841_0 checkpointed
14/04/2009 19.17.10 World Community Grid [checkpoint_debug] result HFCC_t1_00279513_TrkB_0002_0 checkpointed
14/04/2009 19.20.44 rosetta@home [checkpoint_debug] result lr5_E_no_rama_04_intra_rep_rlbd_1ubi_SAVE_ALL_OUT_10755_841_0 checkpointed
14/04/2009 19.26.52 rosetta@home [checkpoint_debug] result lr5_E_no_rama_04_intra_rep_rlbd_1ubi_SAVE_ALL_OUT_10755_841_0 checkpointed
14/04/2009 19.31.02 rosetta@home [checkpoint_debug] result lr5_E_no_rama_04_intra_rep_rlbd_1ubi_SAVE_ALL_OUT_10755_841_0 checkpointed
14/04/2009 19.34.21 rosetta@home [checkpoint_debug] result lr5_E_no_rama_04_intra_rep_rlbd_1ubi_SAVE_ALL_OUT_10755_841_0 checkpointed
14/04/2009 19.35.51 World Community Grid [checkpoint_debug] result R00270_b00f5fa921c9e1c31699fcd04438d3aa_01_000_6 checkpointed
14/04/2009 19.37.40 rosetta@home [checkpoint_debug] result lr5_E_no_rama_04_intra_rep_rlbd_1ubi_SAVE_ALL_OUT_10755_841_0 checkpointed
14/04/2009 19.38.57 World Community Grid [checkpoint_debug] result HFCC_t1_00279513_TrkB_0002_0 checkpointed
14/04/2009 19.41.24 rosetta@home [checkpoint_debug] result lr5_E_no_rama_04_intra_rep_rlbd_1ubi_SAVE_ALL_OUT_10755_841_0 checkpointed
14/04/2009 19.45.03 rosetta@home [checkpoint_debug] result lr5_E_no_rama_04_intra_rep_rlbd_1ubi_SAVE_ALL_OUT_10755_841_0 checkpointed
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 60629 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60636 - Posted: 14 Apr 2009, 23:03:19 UTC

As I said, the messages report when the application called the API to checkpoint. They do not indicate when data was written to disk. Some checkpoints store multiple files. So, between having multiple files, and having several checkpoints buffered into a single write, that may explain why you see "A whole swat of files gets written..." at once.

What have you set for your write at most setting? Are you expecting the C drive of a Windows PC to go idle and spin down to save power? In my experience, that will never happen, whether BOINC is running or not.

What BOINC version are you running?
What operating system are you using?
Rosetta Moderator: Mod.Sense
ID: 60636 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60641 - Posted: 15 Apr 2009, 6:55:42 UTC

The BOINC checkpoint interval is ADVICE that is not binding on the client. The science application is recommended to check with the BOINC Client to see if the minimum interval has passed and only if it has to write a checkpoint.

However, this is a non-binding reccomendation by the client to the science application and it is not demanded by the protocols that the science application adhere to the recommendation by the BOINC Client.

In other words, science applications can checkpoint as they need and see fit and some do ...

Well behaved Science Applications wait and use the test ... but ...YMMV ...
ID: 60641 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SekeRob

Send message
Joined: 7 Sep 06
Posts: 35
Credit: 19,984
RAC: 0
Message 60642 - Posted: 15 Apr 2009, 8:33:33 UTC - in response to Message 60636.  

As I said, the messages report when the application called the API to checkpoint. They do not indicate when data was written to disk. Some checkpoints store multiple files. So, between having multiple files, and having several checkpoints buffered into a single write, that may explain why you see "A whole swat of files gets written..." at once.

What have you set for your write at most setting? Are you expecting the C drive of a Windows PC to go idle and spin down to save power? In my experience, that will never happen, whether BOINC is running or not.

What BOINC version are you running?
What operating system are you using?

For the answers see the opening post you split off. Anyone who sets his disk to spin down, shortens live substantially, so no, mine always go 100%.

Multiple observations and comparing to a test job to QMC confirms that this application does not behave according expectation and some more for it also writes files outside the timepoints when checkpoints are logged, where any other project writes files at the checkpoints.
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 60642 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60644 - Posted: 15 Apr 2009, 12:56:53 UTC

Paul, that was the question, does Rosetta honor the advice from BOINC, or not?

I my self have never been clear if BOINC was handling this setting very well. For example, if we set to write at most every 5 minutes, is this for each active task? Or for all tasks the machine is running? If I have an 8 CPU machine running tasks that checkpoint every 8 minutes, how frequently should I expect disk writes to occur?

What about when a task completes? Does that count as a "checkpoint" and wait when the last write was too recent? Or does the write occur immediately to preserve the completed result?

The Mini version of Rosetta, especially in the latest releases, has rather dramatically improved the checkpointing being done. This means that you are much less likely to lose any significant amount of work when BOINC is ended or the machine is turned off. Prior to these enhancements, certain types of tasks could lose more then an hour of crunching without getting results written to disk to preserve them.
Rosetta Moderator: Mod.Sense
ID: 60644 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60648 - Posted: 15 Apr 2009, 15:03:04 UTC

The only way to know if RaH honors this setting is to have only RaH running and watch the lights ... :)

Well, you can also turn on the checkpoint debug flag and that may tell you also.

As I understand it, each task has its own "timer" so that it is say 8 minutes between "allowed" checkpoint writes. As a practical matter, on an 8 core system you will still see one write every minute or so on average. Since the BOINC client also writes the state file for a whole lot of reasons it is almost impossible to quiet the system so that the disk can spin down.

This is one of the issues that I have been trying to get attention applied to because the number of "wide" systems is increasing and there are a number of interlocking issues that arise because of the things going on internally. For example, BOINC is deciding what tasks to run as often as 5-6 times a minute ... I grant that the time it takes seems to be "minor", but, why is the model run this moment so much better than the one calculated a few seconds ago ... I am still waiting on a explanation for that that makes sense.
ID: 60648 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60649 - Posted: 15 Apr 2009, 15:35:24 UTC

Paul, see the third post to this thread. I've found the checkpoint debug messages report the attempts to checkpoint, not the physical flush to disk.
Rosetta Moderator: Mod.Sense
ID: 60649 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SekeRob

Send message
Joined: 7 Sep 06
Posts: 35
Credit: 19,984
RAC: 0
Message 60654 - Posted: 15 Apr 2009, 19:08:46 UTC - in response to Message 60649.  

Paul, see the third post to this thread. I've found the checkpoint debug messages report the attempts to checkpoint, not the physical flush to disk.

That diametrical opposed to my findings.

WCG Beta of new project, setting Write to Disk at Most to 0 seconds. Logs one every 7 seconds. Then, changed the WTD to 60 seconds. Restarted client and get log entries shortly after 60 seconds have passed. Changed to 5 minutes. Get checkpoint log entries shortly after 5 minutes pass. Still client 6.6.23. The slot file timestamps consistently followed the log entry times.
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 60654 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60655 - Posted: 15 Apr 2009, 19:21:00 UTC

I apologize Forza. I wasn't clear on what your observations were. Perhaps it has changed with the 6.6 version as well.

Which WCG project checkpoints every 7 seconds?
Rosetta Moderator: Mod.Sense
ID: 60655 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 60658 - Posted: 15 Apr 2009, 22:26:14 UTC

Points about checkpointing.

a) THe Checkpoint interval affects the flush-to-disk of checkpoints. THese can be multiple files! The log (as far as i understand) should reflect the flushing.
b) Rosetta tries not to flush to disk more than the setting but will sometimes flush more frequently to prevent excessive memory buildup.
c) when rosetta actuall finishes a decoy it will write to disk. This is completely unaffected by checkpointing and cannot be affected by user settings.

Which jobs are miss behaving on your machine ?

Mike

http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 60658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SekeRob

Send message
Joined: 7 Sep 06
Posts: 35
Credit: 19,984
RAC: 0
Message 60663 - Posted: 16 Apr 2009, 9:27:14 UTC - in response to Message 60655.  

I apologize Forza. I wasn't clear on what your observations were. Perhaps it has changed with the 6.6 version as well.

Which WCG project checkpoints every 7 seconds?

ALL projects of WCG behave like that, most projects everywhere behave like that, making the write call, no change between client 5 when the <options> for checkpoint_debug was added and 6. It so happened that this anonymous beta had a very short checkpointing most convenient for testing. Commonly I document in WCG's FAQ section for folk who like to minimize suspension/progress loss.

Simply, If someone tells 5 minutes or 10 minutes or max of 999 seconds, the design is to have a log entry when a (significant/recovery point) disk save is made. From the 1,2,3 I gather this is not how miniRosetta is coded. The checkpoint is logged and written when actually occurring for a small piece and not when the lowest interval permitted as set in the client. Out with there are writes that are not logged. Personally, no one I think is interested in the log/write of the 2 small files, rather when the large dump is made, for they seem critical to establishing the recovery point. Anyway, now I know.
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 60663 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60666 - Posted: 16 Apr 2009, 14:39:19 UTC
Last modified: 16 Apr 2009, 14:42:22 UTC

Forza, thanks for studying this, and taking the time to report it (even through all of the questions).

I believe you may have part of it right and part a bit off. You see Rosetta has two times it may try to save. One is on a checkpoint, and checkpointing generally occurs every 15 minutes or so. The other is when a model is completed. I believe it is this second point that makes it a bit difficult to observe.

Different proteins and analysis techniques will produce completed models at rather dramatically different rates. Some take just 5 or 10 minutes, and others can take 2 hours.

So if you happen to be observing a protein that is running rapid models, it will be doing a lot of what Mike described as "c", and be writing perhaps more often then the preference. But if you happen to observe a protein with longer models, it will be doing periodic checkpoints which are not written to disk until the preference is reached ("a") or the amount of memory consumed by hanging on to them begins to get excessive and would begin slowing performance ("b").

I believe this "b" area is unique as well. Rosetta has significant memory requirements (minimum now recommended is 512MB), and to operated well in this footpoint, there are cases when checkpoints are written more often. So, again, each protein differs and some will only require minor amounts of memory to hold the checkpoint data until the next write interval is permitted.

So, Rosetta is not blindly disregarding your preference. In fact it is working hard to honor it whenever possible. But it may not be possible as often as it is for other projects.
Rosetta Moderator: Mod.Sense
ID: 60666 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SekeRob

Send message
Joined: 7 Sep 06
Posts: 35
Credit: 19,984
RAC: 0
Message 60819 - Posted: 25 Apr 2009, 15:31:24 UTC - in response to Message 60666.  
Last modified: 25 Apr 2009, 15:35:59 UTC

All parlance aside, something changed in the client from 6.6.20, in a ludicrous way, and whilst reported at the Berkeley developers forums, no-one is home even after a repeat bump. Anyway, if you now set the client to permit 5 minutes write times and you have a quad core, it only allows 1 checkpoint per 20 minutes. So, here's how it now looks for minirosetta 1.54 on 6.6.24:

2009/04/25 16:00:16 rosetta@home Starting task frb_0_8_mike_chosen_cst_hb_t313__IGNORE_THE_REST_1BG2A_4_11064_44_0 using minirosetta version 154
2009/04/25 16:21:27 rosetta@home [checkpoint_debug] result frb_0_8_mike_chosen_cst_hb_t313__IGNORE_THE_REST_1BG2A_4_11064_44_0 checkpointed
2009/04/25 16:45:18 rosetta@home [checkpoint_debug] result frb_0_8_mike_chosen_cst_hb_t313__IGNORE_THE_REST_1BG2A_4_11064_44_0 checkpointed
2009/04/25 17:09:59 rosetta@home [checkpoint_debug] result frb_0_8_mike_chosen_cst_hb_t313__IGNORE_THE_REST_1BG2A_4_11064_44_0 checkpointed
2009/04/25 17:34:47 rosetta@home [checkpoint_debug] result frb_0_8_mike_chosen_cst_hb_t313__IGNORE_THE_REST_1BG2A_4_11064_44_0 checkpointed

mike chosen to only write about every 24 minutes. The slot file timestamps of the fastrelax chkpnt sets do not change in between ;>)
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 60819 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60822 - Posted: 25 Apr 2009, 18:45:25 UTC - in response to Message 60819.  

All parlance aside, something changed in the client from 6.6.20, in a ludicrous way, and whilst reported at the Berkeley developers forums, no-one is home even after a repeat bump. Anyway, if you now set the client to permit 5 minutes write times and you have a quad core, it only allows 1 checkpoint per 20 minutes. So, here's how it now looks for minirosetta 1.54 on 6.6.24:

2009/04/25 16:00:16 rosetta@home Starting task frb_0_8_mike_chosen_cst_hb_t313__IGNORE_THE_REST_1BG2A_4_11064_44_0 using minirosetta version 154
2009/04/25 16:21:27 rosetta@home [checkpoint_debug] result frb_0_8_mike_chosen_cst_hb_t313__IGNORE_THE_REST_1BG2A_4_11064_44_0 checkpointed
2009/04/25 16:45:18 rosetta@home [checkpoint_debug] result frb_0_8_mike_chosen_cst_hb_t313__IGNORE_THE_REST_1BG2A_4_11064_44_0 checkpointed
2009/04/25 17:09:59 rosetta@home [checkpoint_debug] result frb_0_8_mike_chosen_cst_hb_t313__IGNORE_THE_REST_1BG2A_4_11064_44_0 checkpointed
2009/04/25 17:34:47 rosetta@home [checkpoint_debug] result frb_0_8_mike_chosen_cst_hb_t313__IGNORE_THE_REST_1BG2A_4_11064_44_0 checkpointed

mike chosen to only write about every 24 minutes. The slot file timestamps of the fastrelax chkpnt sets do not change in between ;>)

I saw a note by someone indicating that they changed the checkpoint rule to spread it on multi-core systems so the time is now multiplied by the number of cores. So, on a 4 core system a 5 minute interval would indeed be 20 min per individual task.

The point being there are at LONG last starting to recognize that on "wider" systems some of the options and parameters they opted for were sub-optimal for newer systems. Something I have been harping on for some time ... sigh ...

So, if you go back to the 60 second default (I think that was the default) you would see a write about once every 4 minutes per task, or once a minute ...

8 core systems are becoming more common 4 are really the default mid-range and 16 core systems are available (though I have not seen anyone bragging yet that they have one ... :)
ID: 60822 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60831 - Posted: 26 Apr 2009, 15:03:46 UTC

Paul, is it just me? Or does minutes per task blow away the whole rationale for wanting to limit writes in the first place? I mean to say that it should be minutes since BOINC last wrote anything, systemwide, not on a per task basis.

Even with this CPUs times write time you're only getting the desired function ON AVERAGE, assuming that all tasks are attempting to checkpoint all the time (which could not be further from reality).
Rosetta Moderator: Mod.Sense
ID: 60831 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Checkpointing under Rosetta Mini



©2024 University of Washington
https://www.bakerlab.org