Issue with checkpointing.

Message boards : Number crunching : Issue with checkpointing.

To post messages, you must log in.

AuthorMessage
Aegis Maelstrom

Send message
Joined: 29 Oct 08
Posts: 61
Credit: 2,137,555
RAC: 0
Message 63640 - Posted: 9 Oct 2009, 20:35:59 UTC

Hi there,

I am writing this report as I have seen a problem with checkpointing - unfortunately again.

Work Unit: lr5_combine_smooth_torsion_it06_A_rlbd_1cg5_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_15145_49

Computed on a portable version of a good old BOINC 5.10.45 prepared by my team (BOINC@Poland) (sorry, can't use non-portable version but this software has been heavily used before).

The WU has obvious problems with checkpointing.

It's been computed on one computer and done in almost 3 hrs 8 models. The progress was something 4x.xx%.

After a restart on another computer, the graphics app showed me a Model 0, Step 0. Suddenly the progress dropped to something around 25% and now a Model 0, Step 25 is being computed.
It looks like a whole work has been wasted.

The stderr.txt file shows logs of two runs of this Work Unit - one in the morning and one right now (in the evening). See:


[2009-10- 9 6:29:47:] :: BOINC:: Initializing ... ok.
[2009-10- 9 6:29:47:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev32257.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lr5_combine_smooth_torsion_it06_A.zip
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr5_1cg5.out.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Fullatom mode ..
# cpu_run_time_pref: 21600
Fullatom mode ..
Fullatom mode ..
Fullatom mode ..
Fullatom mode ..
Fullatom mode ..
Fullatom mode ..
Fullatom mode ..
Fullatom mode ..
Fullatom mode ..
[2009-10- 9 22:16: 1:] :: BOINC:: Initializing ... ok.
[2009-10- 9 22:16: 1:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev32257.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lr5_combine_smooth_torsion_it06_A.zip
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr5_1cg5.out.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Fullatom mode ..
# cpu_run_time_pref: 21600


I am pretty sure I have seen this bug before so it is probable it is not a question of this particular WU.

Can anyone confirm this issue and deliver a solution?

In a few days I will see results of this WU - i.e. how many models will be crunched and how many headers with number of results will be given (see a known bug with multiple headers in the result file).

Have a nice weekend and keep rocking.

Best from Warsaw. :)
a.m.
ID: 63640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63643 - Posted: 10 Oct 2009, 15:38:04 UTC

Checkpoints are taken at the end of each model at the very minimum. So, after initializing, your graphic should have shown the 8 models. Is it possible the client was unable to write the checkpoint to disk for any reason? What is your setting for "write to disk at most... seconds"?

The double header in outfile issue was different then what you are describing here. Yes, if you restart a task, you will see some of the "starting up" type of messages more then once. The other issue was where the actual result summary showing number of models etc. appeared more then once.
Rosetta Moderator: Mod.Sense
ID: 63643 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Issue with checkpointing.



©2024 University of Washington
https://www.bakerlab.org