Message boards : Number crunching : Restart failures (on Linux at least)
Author | Message |
---|---|
![]() ![]() Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
As I have mentioned before I seem to have issues when bringing up a system and restarting BOINC/Rosetta from checkpoints. The power company did me a big favor today and when I got home from work my clocks were flashing "12:00" and all my processors were off. As expected when I started bringing systems back up I saw a number of "computational errors" - with the last few systems I took some time and sort of dug into it a little bit and think I understand what may be going on. However, without the source code I can't be sure. With Linux, like most UNIX systems, when you write to a file, it is buffered in memory for a period of time. This latency period is normally 5 seconds or less and is done for performance reasons. I suspect what is happening when the systems go down hard is that it is catching the system in the process of writing a checkpoint. And it gets trashed. I am currently processing on more than 40 cores so that leaves a pretty big window of opportunity. Can anyone tell me if when BOINC/Rosetta writes a checkpoint if it follows it with a flush() - which will cause all data to be written to disk without further delay. I would also be interested in if the signal handler invoked when you do a "shutdown connected" client has any logic in it to insure checkpoints in progress are complete before killing the work unit. Thanks |
mikey![]() Send message Joined: 5 Jan 06 Posts: 1898 Credit: 12,723,752 RAC: 682 ![]() |
As I have mentioned before I seem to have issues when bringing up a system and restarting BOINC/Rosetta from checkpoints. You might ask this over on the Boinc Developers email list, they are the writers of Boinc, projects just run it. Paul Buck might know the answer too. |
![]() ![]() Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
Message boards :
Number crunching :
Restart failures (on Linux at least)
©2025 University of Washington
https://www.bakerlab.org