Workunits getting stuck and aborting

Message boards : Number crunching : Workunits getting stuck and aborting

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 36615 - Posted: 12 Feb 2007, 17:54:02 UTC - in response to Message 36588.  
Last modified: 12 Feb 2007, 18:04:49 UTC

===================
From Thomas Posted 11 Feb 2007 7:07:24 UTC
Original post was deleted due to it throwing out the thread's margins.
===================

Please post any relevant observations on your side. Thank you for your help.


What specifically are you looking for ? I just found one more system that is stuck with one of the DOC_* workunits.

~> ps -ef | egrep '(rosetta|boinc)'
user 6012 5469 0 Jan08 pts/1 00:00:18 ./boinc
user 5671 6012 6 Feb09 pts/1 03:00:40 rosetta_5.45_i686-pc-linux-gnu
dd 1UGH 1 -s 1UGH.rppk -dock -pose -dock_mcm -randomize1 -randomize2
-unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf
-norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all
-nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 10800
-watchdog -constant_seed -jran 2083576
user 5672 5671 0 Feb09 pts/1 00:00:00 rosetta_5.45_i686-pc-linux-gnu
dd 1UGH 1 -s 1UGH.rppk -dock -pose -dock_mcm -randomize1 -randomize2
-unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf
-norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all
-nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 10800
-watchdog -constant_seed -jran 2083576
user 5673 5672 0 Feb09 pts/1 00:00:00 rosetta_5.45_i686-pc-linux-gnu
dd 1UGH 1 -s 1UGH.rppk -dock -pose -dock_mcm -randomize1 -randomize2
-unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf
-norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all
-nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 10800
-watchdog -constant_seed -jran 2083576
user 5674 5672 0 Feb09 pts/1 00:00:00 rosetta_5.45_i686-pc-linux-gnu
dd 1UGH 1 -s 1UGH.rppk -dock -pose -dock_mcm -randomize1 -randomize2
-unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf
-norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all
-nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 10800
-watchdog -constant_seed -jran 2083576
user 11672 11647 0 22:34 pts/7 00:00:00 ./boincmgr
user 11681 11645 0 22:36 pts/6 00:00:00 /bin/grep -E (rosetta|boinc)


~/boinc/BOINC/slots/0> ls -rlast
total 966
0 drwxr-xr-x 5 user users 120 2006-09-13 19:43 ..
4 -rw-r--r-- 1 user users 96 2007-02-09 01:42 rosetta_5.45_i686-pc-linux-gnu
4 -rw-r--r-- 1 user users 82 2007-02-09 01:42 Fij_bp_AM.dat.gz
4 -rw-r--r-- 1 user users 79 2007-02-09 01:42 Fij_AM.dat.gz
4 -rw-r--r-- 1 user users 91 2007-02-09 01:42 Equil_bp_AM.stddev.dat.gz
4 -rw-r--r-- 1 user users 89 2007-02-09 01:42 Equil_bp_AM.mean.dat.gz
4 -rw-r--r-- 1 user users 88 2007-02-09 01:42 Equil_AM.stddev.dat.gz
4 -rw-r--r-- 1 user users 86 2007-02-09 01:42 Equil_AM.mean.dat.gz
4 -rw-r--r-- 1 user users 92 2007-02-09 01:42 DunbrackBBDepRots12.dat.gz
4 -rw-r--r-- 1 user users 85 2007-02-09 01:42 disulf_jumps.dat.gz
4 -rw-r--r-- 1 user users 75 2007-02-09 01:42 bb_hbW.gz
4 -rw-r--r-- 1 user users 88 2007-02-09 01:42 bbdep02.May.sortlib.gz
4 -rw-r--r-- 1 user users 82 2007-02-09 01:42 avgE_from_pdb.gz
4 -rw-r--r-- 1 user users 75 2007-02-09 01:42 sc_hbW.gz
4 -rw-r--r-- 1 user users 86 2007-02-09 01:42 sasa_prob_cdf.txt.gz
4 -rw-r--r-- 1 user users 85 2007-02-09 01:42 sasa_offsets.txt.gz
4 -rw-r--r-- 1 user users 83 2007-02-09 01:42 SASA-masks.dat.gz
4 -rw-r--r-- 1 user users 84 2007-02-09 01:42 SASA-angles.dat.gz
4 -rw-r--r-- 1 user users 95 2007-02-09 01:42 Rama_smooth_dyn.dat_ss_6.4.gz
4 -rw-r--r-- 1 user users 94 2007-02-09 01:42 plane_data_table_1015.dat.gz
4 -rw-r--r-- 1 user users 93 2007-02-09 01:42 phi.theta.36.SS.resmooth.gz
4 -rw-r--r-- 1 user users 93 2007-02-09 01:42 phi.theta.36.HS.resmooth.gz
4 -rw-r--r-- 1 user users 86 2007-02-09 01:42 pdbpairstats_fine.gz
4 -rw-r--r-- 1 user users 80 2007-02-09 01:42 paircutoffs.gz
4 -rw-r--r-- 1 user users 75 2007-02-09 01:42 Paa_pp.gz
4 -rw-r--r-- 1 user users 74 2007-02-09 01:42 Paa_n.gz
4 -rw-r--r-- 1 user users 72 2007-02-09 01:42 Paa.gz
4 -rw-r--r-- 1 user users 90 2007-02-09 01:42 jump_templates_v2.dat.gz
4 -rw-r--r-- 1 user users 87 2007-02-09 01:42 jump_templates.dat.gz
4 -rw-r--r-- 1 user users 122 2007-02-09 01:42 dd1UGH.out.gz
4 -rw-r--r-- 1 user users 85 2007-02-09 01:42 1UGH.unbound.pdb.gz
4 -rw-r--r-- 1 user users 82 2007-02-09 01:42 1UGH.rppk.pdb.gz
4 -rw-r--r-- 1 user users 77 2007-02-09 01:42 1UGH.pdb.gz
4 -rw-r--r-- 1 user users 77 2007-02-09 01:42 1UGH.map.gz
4 -rw-r--r-- 1 user users 3 2007-02-09 01:42 rosetta_init_cnt.txt
92 -rw-r--r-- 1 user users 93521 2007-02-09 01:45 dd1UGH.out.bonds
264 -rw-r--r-- 1 user users 269250 2007-02-09 01:45 dd1UGH.out.rot_templates
4 -rw-r--r-- 1 user users 940 2007-02-09 04:42 rosetta_random.txt
4 -rw-r--r-- 1 user users 7 2007-02-09 04:42 rosetta_decoy_cnt.txt
0 -rw-r--r-- 1 user users 0 2007-02-09 04:42 dd1UGH.rppk_0079.pdb.in_progress
4 -rw-r--r-- 1 user users 7 2007-02-09 04:42 dd1UGH.last_pdb
452 -rw-r--r-- 1 user users 460768 2007-02-09 04:42 stdout.txt
4 -rw-r--r-- 1 user users 3648 2007-02-09 04:42 init_data.xml
0 -rw-r--r-- 1 user users 0 2007-02-09 04:42 boinc_finish_called
2 drwxr-xr-x 2 user users 1712 2007-02-09 04:42 .
4 -rw-r--r-- 1 user users 899 2007-02-09 04:42 stderr.txt


~/boinc/BOINC/slots/0> cat stderr.txt
Graphics are disabled due to configuration...
# random seed: 2083576
# cpu_run_time_pref: 28800
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 0 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: ./dd1UGH.out
SIGABRT: abort called
Stack trace (16 frames):
[0x8ae59f7]
[0x8b018bc]
[0xffffe500]
[0x8b67e94]
[0x8b7cd7f]
[0x8b81ef6]
[0x8b820c3]
[0x8b531c1]
[0x8b54be9]
[0x837b53b]
[0x8b683df]
[0x8af6d5f]
[0x8a91587]
[0x8a92326]
[0x8b02b25]
[0x8b9448a]
SIGSEGV: segmentation violation

Exiting...
Stack trace (17 frames):
[0x8ae59f7]
[0x8b018bc]
[0xffffe500]
[0x893bc67]
[0x86e61ea]
[0x86e73f0]
[0x8065119]
[0x8766ad7]
[0x876b433]
[0x876effa]
[0x87702da]
[0x8302b7d]
[0x84e3d1b]
[0x85faa8b]
[0x85fab34]
[0x8b60dd4]
[0x8048111]

Exiting...


Boinc Manager shows in the Tasks tab:
Project: rosetta@home
Application: rosetta 5.45
Name: DOC_1UGH_R070207_pose_u_global_search_fixbb_1549_740_0
CPU time: 03:00:40
Progress: 100.000%
To completion: ---
Report deadline: Sun 18 Feb 2007 11:53:00 PM PST
Status: Running


Apparently watchdog 'terminated' the workunit yesterday morning, but the rosetta client process is still hanging around letting the boinc client think it is running.

Can someone explain all the time limits involved to me ? I'm running Rosetta with a preference of 8 hours per workunit which corresponds with the 28800 seconds that can be seen in the stderr.txt file for the parameter cpu_run_time_pref. The commandline arguments for the Rosetta client processes however shows -cpu_run_time 10800 which would be only 3 hours. The watchdog uses yet another time of 3600 seconds which is 1 hour.



Rosetta Moderator: Mod.Sense
ID: 36615 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 37172 - Posted: 25 Feb 2007, 4:01:43 UTC - in response to Message 36596.  

I will share your findings and thoughts with other project developers tomorrow to see what this can bring to us.

Has there been any news on this issue. I know that there are now DOC_* workunits that no longer cause problems, but what about the issue of the watchdog timer hang on Linux ?

Do either the new Boinc 5.8.11 client or the new Rosetta 5.46 client address that issue or are watchdog hangs still possible on Linux ?
Team Helix
ID: 37172 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 37173 - Posted: 25 Feb 2007, 5:36:16 UTC
Last modified: 25 Feb 2007, 18:32:45 UTC

Yes Thomas, the Project Team has been working on these watchdog terminations. The watchdog is not a BOINC thing, so that is part of Rosetta. It was created to help improve the ease of use by catching things that don't look right to the watchdog (the "trained eye" if you will) and terminating when things don't seem to be progressing normally. This let's your computer get on with other work units which can be more fruitful. Especially when problems specific to a class of work units are found and the work is pulled off the server to correct it.

The short story is that with the DOC work units, what passes for "normal" is not as simple to assess as it used to be. These tasks often spend considerable time in calculations without specific visible signs of progress. You've probably read posts from concerned users that have aborted tasks because they were "hung". It is difficult to assess without specific details of each case, but do keep in mind that the watchdog was created to make that determination FOR you, and to abort the task when the watchdog feels it is appropriate. So, if the watchdog is functioning correctly, aborting a "hung" task should not be necessary.

The watchdog has been "in training" and is learning that he really does not need to bark at the mailman every day. (the mailman being a normal event which does not require special alarm). The next edition of Rosetta should include some changes for a smarter watchdog.

So having the watchdog end a run will always be possible. But recently it has been ending runs that are not in fact hung. And the changes to correct this issue should be rolled out soon.

I believe we're also seeing some reports of the watchdog NOT ending runs that it should have. That issue is under review as well.
Rosetta Moderator: Mod.Sense
ID: 37173 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 37177 - Posted: 25 Feb 2007, 15:21:37 UTC - in response to Message 37173.  

Yes Thomas, the Projecgt Team has been working on these watchdog terminations. The watchdog is not a BOINC thing, so that is part of Rosetta. It was created to help improve the ease of use by catching things that don't look right to the watchdog (the "trained eye" if you will) and terminating when things don't seem to be progressing normally. This let's your computer get on with other work units which can be more fruitful. Especially when problems specific to a class of work units are found and the work is pulled off the server to correct it.

The short story is that with the DOC work units, what passes for "normal" is not as simple to assess as it used to be. These tasks often spend considerable time in calculations without specific visible signs of progress. You've probably read posts from concerned users that have aborted tasks because they were "hung". It is difficult to assess without specific details of each case, but do keep in mind that the watchdog was created to make that determination FOR you, and to abort the task when the watchdog feels it is appropriate. So, if the watchdog is functioning correctly, aborting a "hung" task should not be necessary.

The watchdog has been "in training" and is learning that he really does not need to bark at the mailman every day. (the mailman being a normal event which does not require special alarm). The next edition of Rosetta should include some changes for a smarter watchdog.

So having the watchdog end a run will always be possible. But recently it has been ending runs that are not in fact hung. And the changes to correct this issue should be rolled out soon.

I believe we're also seeing some reports of the watchdog NOT ending runs that it should have. That issue is under review as well.


Why is this experimenting being done in Rosetta, instead of Ralph?
ID: 37177 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 37179 - Posted: 25 Feb 2007, 18:46:49 UTC - in response to Message 37177.  

Why is this experimenting being done in Rosetta, instead of Ralph?

By "experimenting" I presume you are referring to the fact that the watchdog is not working perfectly for all situations. The purpose of Ralph is to test new tasks and new Rosetta releases prior to their release on Rosetta. This includes all the new science being worked on, all the changes made to the screensaver, the watchdog, etc. And as further changes are made to the watchdog, they will be tested first on Ralph, as was the last round of changes.

It is not uncommon for a few software problems to go unnoticed during testing. The idea is to catch as many as you can. There are only so many ways you can test something. When changes are then released to 70,000 machines on Rosetta, there are certainly user environments that present unique situations.

The other factor to consider is whether your changes improved things. If you have changes that your testing shows improve things, do you wait another couple of weeks to release it because you know it is still not perfect? The last round of watchdog changes was an improvement from what was running on Rosetta prior to the release. It is still not perfect. But testing on Ralph found it to be better then it's predecessor.
Rosetta Moderator: Mod.Sense
ID: 37179 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Workunits getting stuck and aborting



©2024 University of Washington
https://www.bakerlab.org