linux script - stopatcheckpoint

Message boards : Number crunching : linux script - stopatcheckpoint

To post messages, you must log in.

AuthorMessage
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 37612 - Posted: 8 Mar 2007, 11:16:40 UTC
Last modified: 8 Mar 2007, 11:29:34 UTC

hi,

This is a bash script to stop BOINC soon after the running task checkpoints. One reason for writing it was to demonstrate the usefulness of the boinc_cmd utility that comes with the linux implementation of BOINC. The other motivation is described under Background, below.

Background

You will know that in contrast to many other BOINC projects, Rosetta checkpoints its results rarely. The reason this is a nuisance is that when you stop BOINC for any reason, work done since the last checkpoint is lost. BOINC and Rosetta between them recover from the situation when BOINC is re-started, but the lost work needs to be re-calulated.

Rosetta and CPDN are the two projects I run where this is an issue. On CPDN it is a particular issue as the project encourage users to back up their work every week or so, and to do that involves stopping BOINC. On both Rosetta and CPDN, it may be wise to stop BOINC before doing computationally excessive work as this might cause theBOINC application. You may want to shut down the machine to go on holiday, install a new DVD drive, etc etc.

This script waits for the next checkpoint and exits. One use would be to reboot or shutdown after the next checkpoint:

$ ~boinc/stopatcheckpoint; reboot

$ ~boinc/stopatcheckpoint; halt

Another use would be to run some other work, then restart BOINC, as in the following command line

~boinc/stopatcheckpoint; /pi/timpi 22 22; /etc/init.d/boinc start

(note that I've got a Debian-style start command for BOINC here, other systems can work out your own version of this)

The script

Copy and past the following into file stopatcheckpoint in the BOINC directory. The script needs to be run from there to make the boinc_cmd utility pick up its passwords. If you need to run it from elsewhere, either copy the password file across, or build the passwords into the script.

After pasting the script into your file system, you need to make it runnable, ie some variation of

chmod 755 stopatcheckpoint

#!/bin/sh
#
# stopatcheckpoint
#
# Author River 2007
#
# Copyright but may be distributed under the GPL
# - see http://www.gnu.org/licenses/gpl.txt
#
# River asserts his moral right to be identified as the Author

echo "wait for BOINC checkpoint..."
cd ~boinc
prv=`./boinc_cmd --get_results|grep -v 0.0000|grep checkpoint`

while (./boinc_cmd --get_results|grep "$prv">>/dev/null) ; do
    clear
    ./boinc_cmd --get_results|grep -v 0.0000|grep "time|----"
    echo "waiting for change to:$prv   ..."
    sleep 3
done

echo
echo "*** new checkpoint ***"
echo
./boinc_cmd --get_results|grep -v 0.0000|grep "time|----"
echo
echo "stopping BOINC..."
echo

./boinc_cmd --quit



Sample output

While waiting the screen displays some info on every WU on the machine:

1) -----------
   checkpoint CPU time: 25081.809985
   current CPU time: 26803.275281
   estimated CPU time remaining: 57799.145096
2) -----------
   estimated CPU time remaining: 78446.605958
waiting for change to:   checkpoint CPU time: 25081.809985   ...


here, WU 1 is running and WU 2 is waiting to run. If a WU is finished then the final CPU time is shown. If there is more than one checkpoint time in the list, there may be problems, as described below.

The current CPU time and time remaining should be changing every few seconds.

After the checkpoint is reached the screen is left with a display like:

1) -----------
   checkpoint CPU time: 25081.809985
   current CPU time: 26878.500846
   estimated CPU time remaining: 57852.131426
2) -----------
   estimated CPU time remaining: 78446.605958
waiting for change to:   checkpoint CPU time: 25081.809985   ...

*** new checkpoint ***

1) -----------
   checkpoint CPU time: 26879.487696
   current CPU time: 26881.467395
   estimated CPU time remaining: 55478.468342
2) -----------
   estimated CPU time remaining: 78446.605958

stopping BOINC...



Where this script does and doesn't work

This script works well on a single cpu machine with only one project, and when there is onlu one task that has started. There can be any number of completed tasks waiting for upload, and any number of unstarted tasks waiting to run.

This script gets confused if there is more than one running / started task, for example:

- on a multi-cpu box; and

- if BOINC starts one task without finishing another, as is normal on a multi-project box, and as may happen on a single project box since BOINC v 5.8 introduced memory management.

Hope this is useful. The script is in Bash (Unix command line) so will not port to windows without more effort than I am going to put in.

River~~
ID: 37612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 37618 - Posted: 8 Mar 2007, 14:24:19 UTC

An additional limitation to this script is that it does not wait for a task that has never checkpointed at all, that is it only waits for tasks that have passed their first checkpoint when the script is called.

R~~
ID: 37618 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 39243 - Posted: 10 Apr 2007, 19:35:19 UTC - in response to Message 37612.  

...The script is in Bash (Unix command line) so will not port to windows without more effort than I am going to put in...


but there is always someone willing to take an idea further. Cristophe has posted an .exe file for windows users that does roughly the same thing. Nice one C!

See this post

R~~
ID: 39243 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kenneth Larsen
Avatar

Send message
Joined: 17 Sep 05
Posts: 3
Credit: 112,217
RAC: 0
Message 39342 - Posted: 13 Apr 2007, 12:00:16 UTC


ID: 39342 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : linux script - stopatcheckpoint



©2024 University of Washington
https://www.bakerlab.org