Workunits getting stuck and aborting

Message boards : Number crunching : Workunits getting stuck and aborting

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
MattDavis
Avatar

Send message
Joined: 22 Sep 05
Posts: 206
Credit: 1,377,748
RAC: 0
Message 36309 - Posted: 8 Feb 2007, 21:44:20 UTC

I've had several of these in the past day, and I see other people making isolated comments of having the same issue, so maybe if we put them all here in one place the scientists can track down the error.

These are mine thus far:

https://boinc.bakerlab.org/rosetta/result.php?resultid=61646685
https://boinc.bakerlab.org/rosetta/result.php?resultid=61635395
https://boinc.bakerlab.org/rosetta/result.php?resultid=61598016
https://boinc.bakerlab.org/rosetta/result.php?resultid=61597212
https://boinc.bakerlab.org/rosetta/result.php?resultid=61589791
ID: 36309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 3,918,634
RAC: 1,122
Message 36313 - Posted: 9 Feb 2007, 1:23:48 UTC

Looks like these are all DOC work units, and all ended by watchdog anywhere from 0.5hrs to 3hrs after starting.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 36313 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 36366 - Posted: 9 Feb 2007, 5:52:12 UTC - in response to Message 36309.  
Last modified: 9 Feb 2007, 6:23:23 UTC

I've had several of these in the past day, and I see other people making isolated comments of having the same issue, so maybe if we put them all here in one place the scientists can track down the error.

These are mine thus far:



I have a whole bunch of these...(all of my crunchers are getting them).... If the ID's are needed, I will list... let me know...

I have aborted these WU's en-mass on my single thread machines... On the big machines, I will sort them out as they have problems..

Looking for a team ??? Join BoincSynergy!!


ID: 36366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 36367 - Posted: 9 Feb 2007, 6:38:44 UTC

I'm getting plenty of those DOC_* workunits as well. At some point between a few minutes and some hours into processing the workunit the boinc manager shows the cpu time no longer progressing and 100% completed for the task, but the status remains "running". A 'ps' shows the rosetta tasks still existing, but no longer consuming any cpu time.

I keep aborting them, since they do not appear to time out.

These systems run the 5.4.9 and 5.4.11 linux boinc clients.
Team Helix
ID: 36367 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 36370 - Posted: 9 Feb 2007, 7:03:53 UTC - in response to Message 36366.  
Last modified: 9 Feb 2007, 7:55:16 UTC

I've had several of these in the past day, and I see other people making isolated comments of having the same issue, so maybe if we put them all here in one place the scientists can track down the error.

These are mine thus far:



I have a whole bunch of these...(all of my crunchers are getting them).... If the ID's are needed, I will list... let me know...

I have aborted these WU's en-mass on my single thread machines... On the big machines, I will sort them out as they have problems..


I think I have the problem on this one.... I tailed a running stdout.txt ...
The 3600 second watchdog timeout is too short for these. The task is still running properly when it gets interupted by the watchdog... Maybe a 7200 second timeout should be set for these....

**Edit** Yep a 7200 timeout is getting it thru so far on a 1.5GHZ P4... Cycle time is 4640 seconds which is why the watchdog was killing it.. It's not done yet... (I run them each for 12 hours on that platform), but, I suspect that it will emerge correctly... I also edited an unstarted WU with the extended timeout.. We will see how it works...



Looking for a team ??? Join BoincSynergy!!


ID: 36370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 36376 - Posted: 9 Feb 2007, 8:21:52 UTC - in response to Message 36309.  

Thanks for reporting. We will look into that and meanwhile those WUs have been temporarily removed from the queue. I am wondering if it is because of the high memory requirement set for those "DOC" WUs. The score is ZERO when the run was stuck, indicating it was not stuck in the middle of the run. From what you guys have described here, it looks that the worker has finished, but for somehow reason it did not signal the other thread properly so that the watchdog thinks it was stuck.

To further help us track down the problem, could you please report what kind of platform your host is? It is definitely happening on linux( from Thomos here and Conan at ralph), what about the rest of you?

[quote]I've had several of these in the past day, and I see other people making isolated comments of having the same issue, so maybe if we put them all here in one place the scientists can track down the error.

These are mine thus far:

https://boinc.bakerlab.org/rosetta/result.php?resultid=61646685
https://boinc.bakerlab.org/rosetta/result.php?resultid=61635395
https://boinc.bakerlab.org/rosetta/result.php?resultid=61598016
https://boinc.bakerlab.org/rosetta/result.php?resultid=61597212
https://boinc.bakerlab.org/rosetta/result.php?resultid=61589791[/quote
ID: 36376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 36377 - Posted: 9 Feb 2007, 8:30:52 UTC - in response to Message 36376.  

Thanks for reporting. We will look into that and meanwhile those WUs have been temporarily removed from the queue. I am wondering if it is because of the high memory requirement set for those "DOC" WUs. The score is ZERO when the run was stuck, indicating it was not stuck in the middle of the run. From what you guys have described here, it looks that the worker has finished, but for somehow reason it did not signal the other thread properly so that the watchdog thinks it was stuck.

To further help us track down the problem, could you please report what kind of platform your host is? It is definitely happening on linux( from Thomos here and Conan at ralph), what about the rest of you?


Mine are all linux systems.. and editing the client_state.xml and adding -watchdog_time 7200 to any of these workunits gets them thru... (at least all of mine to this point...)....



Looking for a team ??? Join BoincSynergy!!


ID: 36377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 36382 - Posted: 9 Feb 2007, 10:29:15 UTC - in response to Message 36376.  
Last modified: 9 Feb 2007, 10:38:22 UTC

Thanks for reporting. We will look into that and meanwhile those WUs have been temporarily removed from the queue. I am wondering if it is because of the high memory requirement set for those "DOC" WUs. The score is ZERO when the run was stuck, indicating it was not stuck in the middle of the run. From what you guys have described here, it looks that the worker has finished, but for somehow reason it did not signal the other thread properly so that the watchdog thinks it was stuck.

To further help us track down the problem, could you please report what kind of platform your host is? It is definitely happening on linux( from Thomos here and Conan at ralph), what about the rest of you?


If you look at https://boinc.bakerlab.org/rosetta/result.php?resultid=61635399

and https://boinc.bakerlab.org/rosetta/result.php?resultid=61635398

It is one MAC and one Windows so it seems to affect all systems.

Anders n

ID: 36382 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BitSpit
Avatar

Send message
Joined: 5 Nov 05
Posts: 33
Credit: 4,147,344
RAC: 0
Message 36385 - Posted: 9 Feb 2007, 12:21:19 UTC

A DOC that hung last night that I had to abort. Stuck at 100%

https://boinc.bakerlab.org/rosetta/result.php?resultid=61387581

# random seed: 2214937
# cpu_run_time_pref: 14400
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 14.2268 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: ./dd1MLC.out
SIGSEGV: segmentation violation
Stack trace (26 frames):
[0x8ae59f7]
[0x8b018bc]
[0xffffe420]
[0x8b83c29]
[0x8b528d7]
[0x8b54cc1]
[0x80b7cf8]
[0x89671a9]
[0x896f069]
[0x86480f2]
[0x8649421]
[0x89718c2]
[0x8975e94]
[0x897940f]
[0x89a8549]
[0x89a9f75]
[0x804d236]
[0x876b46f]
[0x876effa]
[0x87702da]
[0x8302b7d]
[0x84e3d1b]
[0x85faa8b]
[0x85fab34]
[0x8b60dd4]
[0x8048111]

Exiting...
FILE_LOCK::unlock(): close failed.: Bad file descriptor
ID: 36385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 36386 - Posted: 9 Feb 2007, 12:47:34 UTC - in response to Message 36377.  
Last modified: 9 Feb 2007, 12:48:36 UTC

Thanks for reporting. We will look into that and meanwhile those WUs have been temporarily removed from the queue. I am wondering if it is because of the high memory requirement set for those "DOC" WUs. The score is ZERO when the run was stuck, indicating it was not stuck in the middle of the run. From what you guys have described here, it looks that the worker has finished, but for somehow reason it did not signal the other thread properly so that the watchdog thinks it was stuck.

To further help us track down the problem, could you please report what kind of platform your host is? It is definitely happening on linux( from Thomos here and Conan at ralph), what about the rest of you?


Mine are all linux systems.. and editing the client_state.xml and adding -watchdog_time 7200 to any of these workunits gets them thru... (at least all of mine to this point...)....


UPDATE ... Nope.....I was on the wrong track, they last longer, but, still hang later in the process.....


Looking for a team ??? Join BoincSynergy!!


ID: 36386 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 36397 - Posted: 9 Feb 2007, 17:50:04 UTC

This morning I also checked our local windows and mac platforms. Consistent with what have been reported here, I also saw several "Watchdog ending stuck runs" for "DOC" WUs. However, those stuck WUs were terminated by the watchdog thread properly (returned as success) and none of them hang in the boinc manager( which have to be aborted manulally). So my speculation is:

1. the "DOC" WUs have some problems whose trajectories get stuch more frequently than Rosetta average. We will look into this issue and come up with the fix.

2. when a stuck WU is terminated by the watchdog thread, it has some problem of completely removing it from the task list on linux platform (but not windows and mac platform ???) and needs to be aborted by users. This speculation has to wait more user feedbacks yet to be confirmed.

Please post any relevant observations on your side. Thank you for your help.
ID: 36397 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4016
Credit: 0
RAC: 0
Message 36474 - Posted: 11 Feb 2007, 16:16:40 UTC

The watchdog is looking out for problems, so they can be terminated. If a work unit runs for an hour and remains stuck on the same Rosetta score, the watchdog will end the task (at least when it's working properly, there seems to be some quirks with that at this point). So, regardless of your runtime preference, a score not moving for an hour is one of the things the watchdog looks out for.

The other times should be based on your runtime preference, found in your Rosetta Preferences settings.

3hrs is the default runtime preference, if none is selected. If that is what you are seeing in stderr, that is how the work is running. Why that differs from your 8hr preference may be that you've just recently change the preference, or that the preference you are comparing to is for a different location.
Rosetta Moderator: Mod.Sense
ID: 36474 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 36569 - Posted: 11 Feb 2007, 19:18:37 UTC

The "watchdog" error for recent "DOC" workunits has been tracked down to be a bug in Rosetta code which was introduced in the past month. The worker thread worked properly, but it left some gaps during the simulation in which "score" is not updated ( to make it even worse, sometimes it is reset to ZERO ). The way how the "watchdog" thread works is that it periodically checks the "score" and compare it against the previously recorded value. If same, it thinks the current trajectory is stuck and it should terminate the whole process. For "DOC" workunits, the gaps can be relatively long and the chance of this happening therefore turns out to be high. We have fixed this problem and will test it in the next update on Ralph (very soon).

As mentioned in my previous post, there seem to be two isolated problems. The first one is why those "DOC" WUs get stuck and we have found the problem. The second one is why the watchdog thread did not terminate the process properly. This problem seems to be specific to linux platforms. As we queried our database on the problematic batch of DOC workunits, the "watchdog ending runs" message was seen across all platforms, but I have not so far seen one case for windows and mac that results were not returned as success. On the other hand, when this happened on linux platform, I saw mostly "aborted by users" outcomes which indicate that even if the watchdog thread found the run stuck, it could not terminate the process properly so that the WU is still hanging in system until mannualy killed by users. I am not sure this is also true for the watchdog termination of non-DOC workunits and we will continue to look into that.

Again, the rate of "false watchdog termination" should go away with the new fix, but there might be other problems which can cause a real stuck trajectory. If that happens, please report back to us here. Thank you very much for the help!
ID: 36569 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 36571 - Posted: 11 Feb 2007, 20:13:25 UTC - in response to Message 36569.  

I still have that one system sitting with the stuck workunit. Is there any way to provide you with additional information that would help to track down the reason why the watchdog terminated process isn't really dead ? (strace/gdb)

I haven't tried it on this particular instance, but suspending and later resuming a hung DOC_* workunit will remain hung. I'm pretty sure that also applied when terminating Boinc altogether and starting it back up.

The only time I have seen this symptom of 'running' rosetta processes that don't consume any cpu cycles and make no progress outside of these DOC_* workunits was when I ran boinc without the preference setting to keep the tasks in memory. So yes, there do appear to be other situations in which workunits can get stuck in this way.
Team Helix
ID: 36571 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 36588 - Posted: 12 Feb 2007, 4:49:55 UTC - in response to Message 36571.  

This is from another system, but also linux.
After the same Rosetta workunit hung a second time, I restarted boinc with
strace: strace -ff -tt -o boinc_rosetta ./boinc

user 23795 6196 0 21:25 pts/2 00:00:01 strace -ff -tt -o /xen2/boinc_rosetta ./boinc
user 23796 23795 0 21:25 pts/2 00:00:00 ./boinc
user 23797 23796 97 21:25 pts/2 00:03:21 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828
user 23798 23797 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828
user 23799 23798 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828
user 23800 23798 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828

PID 23795 is strace
PID 23796 is the boinc client started by strace
PID 23797 is the rosetta client started by boinc (this does all the computation)
PID 23798 is another rosetta task (2nd) started by the first one
PID 23799 is another rosetta task (3rd) started by the second one
PID 23800 is another rosetta task (4th) started by the second one (watchdog?)


21:25:52.938929 PID 23796, boinc is being executed (started by strace)
21:25:53.136532 PID 23796 forks (clone system call) and creates PID 23797
21:25:53.224175 PID 23797 rosetta is being executed (started by boinc PID 23796)
21:25:53.719005 PID 23797 creates file "boinc_lockfile"
21:25:53.724098 PID 23797 forks (clone system call) and creates PID 23798
21:25:53.725109 PID 23797 waits for signal (sigsuspend)
21:25:53.726537 PID 23798 forks (clone system call) and creates PID 23799
21:25:53.726825 PID 23798 sends signal SIGRTMIN to PID 23797 (kill)
21:25:53.726951 PID 23797 receives signal SIGRTMIN and continues
21:25:53.731726 PID 23799 starts (but never does anything interesting)
(PID 23797 writes lots of stuff to stdout.txt)
21:25:55.768784 PID 23797 waits for signal (sigsuspend)
21:25:55.769412 PID 23798 forks (clone system call) and creates PID 23800
21:25:55.769752 PID 23798 sends signal SIGRTMIN to PID 23797 (kill)
21:25:55.769875 PID 23797 receives signal SIGRTMIN and continues
21:25:55.772258 PID 23800 starts
22:26:42.220181 PID 23800 checks file "init_data.xml"
22:26:42.225455 PID 23800 writes "Rosetta score is stuck" to stdout.txt
22:26:42.226143 PID 23800 writes "Rosetta score is stuck" to stderr.txt
22:26:42.231475 PID 23800 writes "watchdog_failure: Stuck at score" to dd1IAI.out
22:26:45.470033 PID 23800 creates file "boinc_finish_called"
22:26:45.472508 PID 23800 removes file "boinc_lockfile"
22:26:45.490173 PID 23800 sends signal SIGRTMIN to PID 23797 (kill)
22:26:45.490560 PID 23797 receives signal SIGRTMIN
22:26:45.490459 PID 23800 waits for signal (sigsuspend)
22:26:45.491002 PID 23797 sends signal SIGRTMIN to PID 23800 (kill)
22:26:45.491108 PID 23800 receives signal SIGRTMIN and continues
22:26:45.502802 PID 23800 Segmentation Violation occurs!
The SIGSEGV happens just after several munmap (memory unmap) calls, so
possibly there was a reference to unmapped memory ?
22:26:45.503104 PID 23800 writes "SIGSEGV: segmentation violation" to stderr.txt
22:26:45.507844 PID 23797 waits for signal (sigsuspend)
22:26:45.509013 PID 23800 writes stack trace to stderr.txt
22:26:45.511959 PID 23800 writes "Exiting..." to stderr.txt (but it's a lie!)
22:26:45.512252 PID 23800 waits for signal (sigsuspend) that doesn't come!
22:26:45.821360 PID 23797 receives SIGALRM (timer expired ?)
22:26:45.822021 PID 23797 waits for signal (sigsuspend)
The last two lines keep repeating with PID 23797 waiting for a signal (perhaps
another SIGRTMIN from PID 23800 ?) and getting SIGALRM (timeout) instead.
The normal watchdog termination procedure seems to have been thrown off track
by the watchdog itself crashing in the process.

Left out in the sequence above is some communication between 23797 and 23798
through a pipe. I'm assuming 23797 and 23800 are communicating with shared
memory (besides signalling with SIGRTMIN), but that would not be visible in
the strace output.

Full strace logs available to anybody is interested.
Team Helix
ID: 36588 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MattDavis
Avatar

Send message
Joined: 22 Sep 05
Posts: 206
Credit: 1,377,748
RAC: 0
Message 36589 - Posted: 12 Feb 2007, 5:18:22 UTC

Great, now this thread is impossible to read.
ID: 36589 · Rating: -0.99999999999999 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 36596 - Posted: 12 Feb 2007, 7:10:10 UTC - in response to Message 36588.  

Thomas, thanks for helping debug this problem and posting such detailed log output. I never use trace before and do not have much knowledge in how processes work and communicate in linux. I will share your findings and thoughts with other project developers tomorrow to see what this can bring to us.

I have run some problematic DOC workunits on our linux computers in stand alone mode (without boinc manager) and it seemed that all the watchdog terminations exited properly. Particularly, I did not remember seeing any segmentation viloations ( I will double check this tomorrow). So I guess this will also help us to narrow down whether the problem is within Rosetta or between Rosetta and bonic manager.
This is from another system, but also linux.
After the same Rosetta workunit hung a second time, I restarted boinc with
strace: strace -ff -tt -o boinc_rosetta ./boinc

user 23795 6196 0 21:25 pts/2 00:00:01 strace -ff -tt -o /xen2/boinc_rosetta ./boinc
user 23796 23795 0 21:25 pts/2 00:00:00 ./boinc
user 23797 23796 97 21:25 pts/2 00:03:21 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828
user 23798 23797 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828
user 23799 23798 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828
user 23800 23798 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828

PID 23795 is strace
PID 23796 is the boinc client started by strace
PID 23797 is the rosetta client started by boinc (this does all the computation)
PID 23798 is another rosetta task (2nd) started by the first one
PID 23799 is another rosetta task (3rd) started by the second one
PID 23800 is another rosetta task (4th) started by the second one (watchdog?)


21:25:52.938929 PID 23796, boinc is being executed (started by strace)
21:25:53.136532 PID 23796 forks (clone system call) and creates PID 23797
21:25:53.224175 PID 23797 rosetta is being executed (started by boinc PID 23796)
21:25:53.719005 PID 23797 creates file "boinc_lockfile"
21:25:53.724098 PID 23797 forks (clone system call) and creates PID 23798
21:25:53.725109 PID 23797 waits for signal (sigsuspend)
21:25:53.726537 PID 23798 forks (clone system call) and creates PID 23799
21:25:53.726825 PID 23798 sends signal SIGRTMIN to PID 23797 (kill)
21:25:53.726951 PID 23797 receives signal SIGRTMIN and continues
21:25:53.731726 PID 23799 starts (but never does anything interesting)
(PID 23797 writes lots of stuff to stdout.txt)
21:25:55.768784 PID 23797 waits for signal (sigsuspend)
21:25:55.769412 PID 23798 forks (clone system call) and creates PID 23800
21:25:55.769752 PID 23798 sends signal SIGRTMIN to PID 23797 (kill)
21:25:55.769875 PID 23797 receives signal SIGRTMIN and continues
21:25:55.772258 PID 23800 starts
22:26:42.220181 PID 23800 checks file "init_data.xml"
22:26:42.225455 PID 23800 writes "Rosetta score is stuck" to stdout.txt
22:26:42.226143 PID 23800 writes "Rosetta score is stuck" to stderr.txt
22:26:42.231475 PID 23800 writes "watchdog_failure: Stuck at score" to dd1IAI.out
22:26:45.470033 PID 23800 creates file "boinc_finish_called"
22:26:45.472508 PID 23800 removes file "boinc_lockfile"
22:26:45.490173 PID 23800 sends signal SIGRTMIN to PID 23797 (kill)
22:26:45.490560 PID 23797 receives signal SIGRTMIN
22:26:45.490459 PID 23800 waits for signal (sigsuspend)
22:26:45.491002 PID 23797 sends signal SIGRTMIN to PID 23800 (kill)
22:26:45.491108 PID 23800 receives signal SIGRTMIN and continues
22:26:45.502802 PID 23800 Segmentation Violation occurs!
The SIGSEGV happens just after several munmap (memory unmap) calls, so
possibly there was a reference to unmapped memory ?
22:26:45.503104 PID 23800 writes "SIGSEGV: segmentation violation" to stderr.txt
22:26:45.507844 PID 23797 waits for signal (sigsuspend)
22:26:45.509013 PID 23800 writes stack trace to stderr.txt
22:26:45.511959 PID 23800 writes "Exiting..." to stderr.txt (but it's a lie!)
22:26:45.512252 PID 23800 waits for signal (sigsuspend) that doesn't come!
22:26:45.821360 PID 23797 receives SIGALRM (timer expired ?)
22:26:45.822021 PID 23797 waits for signal (sigsuspend)
The last two lines keep repeating with PID 23797 waiting for a signal (perhaps
another SIGRTMIN from PID 23800 ?) and getting SIGALRM (timeout) instead.
The normal watchdog termination procedure seems to have been thrown off track
by the watchdog itself crashing in the process.

Left out in the sequence above is some communication between 23797 and 23798
through a pipe. I'm assuming 23797 and 23800 are communicating with shared
memory (besides signalling with SIGRTMIN), but that would not be visible in
the strace output.

Full strace logs available to anybody is interested.


ID: 36596 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 36597 - Posted: 12 Feb 2007, 7:11:53 UTC - in response to Message 36589.  

if you mean the widdddddth of the board, I am wondering that too...
Great, now this thread is impossible to read.


ID: 36597 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 36609 - Posted: 12 Feb 2007, 16:51:34 UTC - in response to Message 36597.  

if you mean the widdddddth of the board, I am wondering that too...
Great, now this thread is impossible to read.



I'm sorry about blowing out the margins. I used the PRE tag to preserve the formatting of the output in my earlier post and the rosetta commandline is very, very long :(
Unfortunately there isn't any preview option, so I didn't know what would happen until it was too late.

It won't let me edit that post either, perhaps a moderator can remove the PRE /PRE tags ?
Team Helix
ID: 36609 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4016
Credit: 0
RAC: 0
Message 36613 - Posted: 12 Feb 2007, 17:48:55 UTC
Last modified: 12 Feb 2007, 18:02:18 UTC

Moderator can only delete the post (both now that it is incorporated into a reply). You can only "preview" by posting and then you have up to an hour to make edits.

I went after the wrong post first :) But it looks like I got normal margins back. Hope I have improved the thread more then disrupted it.
Rosetta Moderator: Mod.Sense
ID: 36613 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Workunits getting stuck and aborting



©2020 University of Washington
https://www.bakerlab.org