Restart failures (on Linux at least)

Message boards : Number crunching : Restart failures (on Linux at least)

To post messages, you must log in.

AuthorMessage
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 66861 - Posted: 12 Jul 2010, 19:24:44 UTC

As I have mentioned before I seem to have issues when bringing up a system and restarting BOINC/Rosetta from checkpoints.

The power company did me a big favor today and when I got home from work my clocks were flashing "12:00" and all my processors were off.

As expected when I started bringing systems back up I saw a number of "computational errors" - with the last few systems I took some time and sort of dug into it a little bit and think I understand what may be going on.

However, without the source code I can't be sure.

With Linux, like most UNIX systems, when you write to a file, it is buffered in memory for a period of time. This latency period is normally 5 seconds or less and is done for performance reasons.

I suspect what is happening when the systems go down hard is that it is catching the system in the process of writing a checkpoint. And it gets trashed.

I am currently processing on more than 40 cores so that leaves a pretty big window of opportunity.

Can anyone tell me if when BOINC/Rosetta writes a checkpoint if it follows it with a flush() - which will cause all data to be written to disk without further delay.

I would also be interested in if the signal handler invoked when you do a "shutdown connected" client has any logic in it to insure checkpoints in progress are complete before killing the work unit.

Thanks

ID: 66861 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1898
Credit: 12,723,752
RAC: 682
Message 66875 - Posted: 13 Jul 2010, 10:02:07 UTC - in response to Message 66861.  

As I have mentioned before I seem to have issues when bringing up a system and restarting BOINC/Rosetta from checkpoints.

The power company did me a big favor today and when I got home from work my clocks were flashing "12:00" and all my processors were off.

As expected when I started bringing systems back up I saw a number of "computational errors" - with the last few systems I took some time and sort of dug into it a little bit and think I understand what may be going on.

However, without the source code I can't be sure.

With Linux, like most UNIX systems, when you write to a file, it is buffered in memory for a period of time. This latency period is normally 5 seconds or less and is done for performance reasons.

I suspect what is happening when the systems go down hard is that it is catching the system in the process of writing a checkpoint. And it gets trashed.

I am currently processing on more than 40 cores so that leaves a pretty big window of opportunity.

Can anyone tell me if when BOINC/Rosetta writes a checkpoint if it follows it with a flush() - which will cause all data to be written to disk without further delay.

I would also be interested in if the signal handler invoked when you do a "shutdown connected" client has any logic in it to insure checkpoints in progress are complete before killing the work unit.

Thanks


You might ask this over on the Boinc Developers email list, they are the writers of Boinc, projects just run it. Paul Buck might know the answer too.
ID: 66875 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 66882 - Posted: 13 Jul 2010, 17:51:37 UTC
Last modified: 13 Jul 2010, 18:19:31 UTC

Since you have plenty of machines running, why not buy those emergency batteries that send a shut down signal when the power goes out? Forgot their names, but I doubt they are expensive.

edit: name is UPS
ID: 66882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Restart failures (on Linux at least)



©2025 University of Washington
https://www.bakerlab.org