BOINC Dying and orphaning Rosetta - Possible cause

Message boards : Number crunching : BOINC Dying and orphaning Rosetta - Possible cause

To post messages, you must log in.

AuthorMessage
David Ball

Send message
Joined: 25 Nov 05
Posts: 25
Credit: 1,439,333
RAC: 0
Message 31753 - Posted: 28 Nov 2006, 8:51:08 UTC
Last modified: 28 Nov 2006, 9:05:27 UTC

I keep finding my Linux RHEL3 machine has died and orphaned Rosetta. I have to kill Rosetta Manually and restart BOINC. BOINC runs as a service. The machine has libsafe on it.

While going through the logs looking for an error with Docking@Home, I might have found the reason BOINC is dying. It looks like sometimes the Rosetta command line might be too long.

This is from stdoutdae.txt. Lines ending in $ were cut short by nano.

2006-11-22 03:41:02 [Docking@Home] Unrecoverable error for result 1tng_mod0001_9218_83020_5 (process exited with code 1 (0x1$
2006-11-22 03:41:02 [Docking@Home] Deferring scheduler requests for 1 minutes and 0 seconds
2006-11-22 03:41:02 [---] Rescheduling CPU: application exited
2006-11-22 03:41:02 [Docking@Home] Computation for task 1tng_mod0001_9218_83020_5 finished
2006-11-22 03:41:02 [---] Resuming round-robin CPU scheduling.
2006-11-22 03:41:02 [rosetta@home] Resuming task DOC_1MLC_R061114_pose_u_global_search_1402_736_0 using rosetta version 540
2006-11-22 04:14:59 [---] Resuming network activity
2006-11-22 04:14:59 [---] Allowing work fetch again.
.........Skipped some attempted work fetches and upload of the failed docking workunit.
2006-11-22 04:15:09 [rosetta@home] Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
2006-11-22 04:15:09 [rosetta@home] Reason: To fetch work
2006-11-22 04:15:09 [rosetta@home] Requesting 21600 seconds of new work, and reporting 1 completed tasks
2006-11-22 04:15:12 [Docking@Home] Finished upload of file 1tng_mod0001_9218_83020_5_2
2006-11-22 04:15:12 [Docking@Home] Throughput 51542 bytes/sec
2006-11-22 04:15:12 [Docking@Home] Finished upload of file 1tng_mod0001_9218_83020_5_3
2006-11-22 04:15:12 [Docking@Home] Throughput 598603 bytes/sec
2006-11-22 04:15:14 [rosetta@home] Scheduler request succeeded
2006-11-22 04:15:15 [rosetta@home] Started download of file hom018_s014_.fasta.gz
2006-11-22 04:15:15 [rosetta@home] Started download of file hom018_s014_.psipred_ss2.gz


At this point I discovered the BOINC service was dead,
and had to kill Rosetta manually, and restart the BOINC service. The NEXT line in stdoutdae.txt is:

2006-11-22 04:17:38 [---] Starting BOINC client version 5.4.9 for i686-pc-linux-gnu
2006-11-22 04:17:38 [---] libcurl/7.15.3 OpenSSL/0.9.8a zlib/1.2.3
2006-11-22 04:17:38 [---] Executing as a daemon
2006-11-22 04:17:38 [---] Data directory: /home/BOINC
2006-11-22 04:17:38 [rosetta@home] State file error: result DOC_1MLC_R061114_pose_u_global_search_1402_736_0 is in wrong sta$
2006-11-22 04:17:38 [---] Processor: 1 GenuineIntel Intel(R) Celeron(R) CPU 2.40GHz
2006-11-22 04:17:38 [---] Memory: 1.95 GB physical, 1.95 GB virtual
2006-11-22 04:17:38 [---] Disk: 16.02 GB total, 11.62 GB free
2006-11-22 04:17:38 [Docking@Home] URL: http://docking.utep.edu/; Computer ID: 223; location: work; project prefs: default
2006-11-22 04:17:38 [SETI@home] URL: http://setiathome.berkeley.edu/; Computer ID: 2185126; location: work; project prefs: d$
2006-11-22 04:17:38 [rosetta@home] URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 211470; location: work; project pre$
2006-11-22 04:17:38 [lhcathome] URL: http://lhcathome.cern.ch/lhcathome/; Computer ID: 2363079; location: work; project pref$
2006-11-22 04:17:38 [---] General prefs: from Docking@Home (last modified 2006-11-22 03:03:43)
2006-11-22 04:17:38 [---] General prefs: using separate prefs for work
2006-11-22 04:17:38 [---] Local control only allowed
2006-11-22 04:17:38 [---] Listening on port 31416
2006-11-22 04:17:38 [SETI@home] Deferring task 10jn03ab.7548.30496.284650.3.57_1
2006-11-22 04:17:38 [SETI@home] Restarting task 10jn03ab.7548.30496.284650.3.57_1 using setiathome_enhanced version 512
2006-11-22 04:17:39 [rosetta@home] Started download of file hom018_s014_.fasta.gz
2006-11-22 04:17:39 [rosetta@home] Started download of file hom018_s014_.psipred_ss2.gz
2006-11-22 04:17:42 [rosetta@home] Finished download of file hom018_s014_.fasta.gz
2006-11-22 04:17:42 [rosetta@home] Throughput 1149 bytes/sec
2006-11-22 04:17:42 [rosetta@home] Finished download of file hom018_s014_.psipred_ss2.gz
2006-11-22 04:17:42 [rosetta@home] Throughput 7188 bytes/sec
2006-11-22 04:17:42 [rosetta@home] Started download of file boinc_hom018_aas014_03_05.200_v1_3.gz
2006-11-22 04:17:42 [rosetta@home] Started download of file boinc_hom018_aas014_09_05.200_v1_3.gz
2006-11-22 04:17:44 [rosetta@home] Finished download of file boinc_hom018_aas014_09_05.200_v1_3.gz
2006-11-22 04:17:44 [rosetta@home] Throughput 360810 bytes/sec
2006-11-22 04:17:44 [rosetta@home] Started download of file sg_target_description.txt
2006-11-22 04:17:45 [rosetta@home] Finished download of file boinc_hom018_aas014_03_05.200_v1_3.gz
2006-11-22 04:17:45 [rosetta@home] Throughput 687255 bytes/sec
2006-11-22 04:17:45 [rosetta@home] Finished download of file sg_target_description.txt
2006-11-22 04:17:45 [rosetta@home] Throughput 943 bytes/sec
2006-11-22 04:17:46 [---] Rescheduling CPU: files downloaded
2006-11-22 04:17:46 [---] Using earliest-deadline-first scheduling because computer is overcommitted.
2006-11-22 04:17:46 [SETI@home] Pausing task 10jn03ab.7548.30496.284650.3.57_1 (left in memory)
2006-11-22 04:17:46 [rosetta@home] Starting task s014__BOINC_ABRELAX_SAVE_ALL_OUT_hom018__1406_4371_0 using rosetta version $
2006-11-22 04:17:49 [---] Suspending work fetch because computer is overcommitted.
2006-11-22 08:17:51 [---] Allowing work fetch again.


Now, from the stderrdae.txt file


2006-11-22 03:41:02 [Docking@Home] Unrecoverable error for result 1tng_mod0001_9218_83020_5 (process exited with code 1 (0x1$
2006-11-22 04:15:04 [Docking@Home] Message from server: No work sent
2006-11-22 04:15:04 [Docking@Home] Message from server: (reached daily quota of 1 results)
2006-11-22 04:15:04 [Docking@Home] No work from project
SIGSEGV: segmentation violationStack trace (16 frames):
/home/BOINC/boinc[0x8089dc2]
/lib/libpthread.so.0[0x40174619]
/lib/libc.so.6[0x400482b8]
/lib/libc.so.6(vsprintf+0x5b)[0x4007da5b]
/home/BOINC/boinc[0x808bc52]
/home/BOINC/boinc[0x808c01b]
/home/BOINC/boinc[0x80515c7]
/home/BOINC/boinc[0x8051d2a]
/home/BOINC/boinc[0x80718a9]
/home/BOINC/boinc[0x80715eb]
/home/BOINC/boinc[0x8071a99]
/home/BOINC/boinc[0x8059c15]
/home/BOINC/boinc[0x807d189]
/home/BOINC/boinc[0x807d2b7]
/lib/libc.so.6(__libc_start_main+0x8d)[0x40036bd1]
/home/BOINC/boinc(__fxstat64+0x99)[0x804c1e1]
Exiting...
2006-11-22 04:17:38 [rosetta@home] State file error: result DOC_1MLC_R061114_pose_u_global_search_1402_736_0 is in wrong sta$

I'm not sure if this is related to Docking or Rosetta, but I have noticed that anytime you stop the BOINC service on this machine, Rosetta keeps running, but in sleeping mode so it doesn't use any CPU. You have to kill Rosetta from top with a SIGTERM. This had happened prior to the log above when I stopped boinc to change something for another try at getting docking to work on this machine, IIRC.

BTW, a couple of times I have noticed it wasn't reporting results and found that rosetta had been sleeping for 2+ days and boinc was nowhere to be found.

This is the standard boinc 5.4.9 client on a text only machine (both console and ssh are text only), running as a service. They really need to release a command line only Linux boinc client version again. I'm having to use the boinc_cmd from boinc 5.2.13 to control it.

That error might have to do with the Rosetta command line being too long. I just ran a "ps axu" and here are the boinc processes as of now.

boinc 28900 0.0 0.0 4724 2020 ? S Nov25 0:00 /home/BOINC/boinc -redirectio daemon

boinc 31274 0.0 1.3 39288 26796 ? SN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu

boinc 31275 0.0 1.3 39288 26796 ?
RN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu

boinc 31276 11.4 1.3 39288 26796 ? SN Nov26 111:25
setiathome-5.12.i686-pc-linux-gnu

boinc 31277 0.0 1.3 39288 26796 ? SN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu

boinc 9907 88.7 3.6 111684 73940 ? RN 01:45 385:40 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose
-dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -

boinc 9908 0.0 3.6 111684 73940 ? RN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -

boinc 9909 0.0 3.6 111684 73940 ? SN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -

boinc 9921 0.0 3.6 111684 73940 ? SN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all
-accept_all -nstruct 10 -

Sorry for any weird formatting. I piped the output of "ps axu" into "nano -v" and did a cut-paste from the screen in nano. It looks like "ps axu" clipped the command lines for rosetta. Again, I don't know if it was docking or rosetta that killed boinc.

Just went into /proc/9907 and got the command line from there. The spaces between options didn't show so I'm guessing at that part.

[/proc/9907]# cat cmdline
rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 10800 -watchdog -constant_seed -jran 2638533

I could be totally wrong but could the problem with BOINC controlling Rosetta be because the command line is too long and it's killing the BOINC client?

BTW, I'm running the stock boinc client and applications.

-- David
EDIT: The only thing I was changing for Docking@Home was only to increase the allowed stack limit to unlimited. I keep finding more places where config files drop it back to the default unless you're root. Even if I have Docking suspended, I sometimes find the boinc client dead with Rosetta running in sleep mode. Since this error appears to be in a vsprintf in libpthread, I thought the BOINC client might be erroring out when it tried to control Rosetta.
Have you read a good Science Fiction book lately?
ID: 31753 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 31762 - Posted: 28 Nov 2006, 12:42:09 UTC

This looks very similar to the problems listed in this thread - the BOINC client crashes just after downloading files from Rosetta.

I'd been wondering if the problem could be down to the very long names that Rosetta uses, but of course it hasn't crashed since I started looking harder.
ID: 31762 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 31776 - Posted: 28 Nov 2006, 20:35:44 UTC - in response to Message 31753.  

I could be totally wrong but could the problem with BOINC controlling Rosetta be because the command line is too long and it's killing the BOINC client?



Rosetta has always had these horrendously long command lines(*), and has not always failed like this. It is possible that the latest BOINC has a shorter buffer than earlier versions, in which case the long line might be an issue, tho my guess is that this is unlikely.

Certainly linux has no problem dealing with very long command lines. If windows had such a problem, again I'd be puzzled why it has not sufaced before.

I'd agree it is a possibility that needs to be 'eliminated from enquiries' as detectvies say in bad crime fiction, but my guess is that this is not the smoking gnu.

River~~

(*) linux users can see the command line of the current rosetta task with this command from a terminal window / shell:

ps ax|grep rosetta
ID: 31776 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : BOINC Dying and orphaning Rosetta - Possible cause



©2024 University of Washington
https://www.bakerlab.org