21)
Message boards :
Number crunching :
BOINC Dying and orphaning Rosetta - Possible cause
(Message 31753)
Posted 28 Nov 2006 by David Ball Post: I keep finding my Linux RHEL3 machine has died and orphaned Rosetta. I have to kill Rosetta Manually and restart BOINC. BOINC runs as a service. The machine has libsafe on it. While going through the logs looking for an error with Docking@Home, I might have found the reason BOINC is dying. It looks like sometimes the Rosetta command line might be too long. This is from stdoutdae.txt. Lines ending in $ were cut short by nano. 2006-11-22 03:41:02 [Docking@Home] Unrecoverable error for result 1tng_mod0001_9218_83020_5 (process exited with code 1 (0x1$ 2006-11-22 03:41:02 [Docking@Home] Deferring scheduler requests for 1 minutes and 0 seconds 2006-11-22 03:41:02 [---] Rescheduling CPU: application exited 2006-11-22 03:41:02 [Docking@Home] Computation for task 1tng_mod0001_9218_83020_5 finished 2006-11-22 03:41:02 [---] Resuming round-robin CPU scheduling. 2006-11-22 03:41:02 [rosetta@home] Resuming task DOC_1MLC_R061114_pose_u_global_search_1402_736_0 using rosetta version 540 2006-11-22 04:14:59 [---] Resuming network activity 2006-11-22 04:14:59 [---] Allowing work fetch again. .........Skipped some attempted work fetches and upload of the failed docking workunit. 2006-11-22 04:15:09 [rosetta@home] Sending scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi 2006-11-22 04:15:09 [rosetta@home] Reason: To fetch work 2006-11-22 04:15:09 [rosetta@home] Requesting 21600 seconds of new work, and reporting 1 completed tasks 2006-11-22 04:15:12 [Docking@Home] Finished upload of file 1tng_mod0001_9218_83020_5_2 2006-11-22 04:15:12 [Docking@Home] Throughput 51542 bytes/sec 2006-11-22 04:15:12 [Docking@Home] Finished upload of file 1tng_mod0001_9218_83020_5_3 2006-11-22 04:15:12 [Docking@Home] Throughput 598603 bytes/sec 2006-11-22 04:15:14 [rosetta@home] Scheduler request succeeded 2006-11-22 04:15:15 [rosetta@home] Started download of file hom018_s014_.fasta.gz 2006-11-22 04:15:15 [rosetta@home] Started download of file hom018_s014_.psipred_ss2.gz At this point I discovered the BOINC service was dead, and had to kill Rosetta manually, and restart the BOINC service. The NEXT line in stdoutdae.txt is: 2006-11-22 04:17:38 [---] Starting BOINC client version 5.4.9 for i686-pc-linux-gnu 2006-11-22 04:17:38 [---] libcurl/7.15.3 OpenSSL/0.9.8a zlib/1.2.3 2006-11-22 04:17:38 [---] Executing as a daemon 2006-11-22 04:17:38 [---] Data directory: /home/BOINC 2006-11-22 04:17:38 [rosetta@home] State file error: result DOC_1MLC_R061114_pose_u_global_search_1402_736_0 is in wrong sta$ 2006-11-22 04:17:38 [---] Processor: 1 GenuineIntel Intel(R) Celeron(R) CPU 2.40GHz 2006-11-22 04:17:38 [---] Memory: 1.95 GB physical, 1.95 GB virtual 2006-11-22 04:17:38 [---] Disk: 16.02 GB total, 11.62 GB free 2006-11-22 04:17:38 [Docking@Home] URL: http://docking.utep.edu/; Computer ID: 223; location: work; project prefs: default 2006-11-22 04:17:38 [SETI@home] URL: http://setiathome.berkeley.edu/; Computer ID: 2185126; location: work; project prefs: d$ 2006-11-22 04:17:38 [rosetta@home] URL: http://boinc.bakerlab.org/rosetta/; Computer ID: 211470; location: work; project pre$ 2006-11-22 04:17:38 [lhcathome] URL: http://lhcathome.cern.ch/lhcathome/; Computer ID: 2363079; location: work; project pref$ 2006-11-22 04:17:38 [---] General prefs: from Docking@Home (last modified 2006-11-22 03:03:43) 2006-11-22 04:17:38 [---] General prefs: using separate prefs for work 2006-11-22 04:17:38 [---] Local control only allowed 2006-11-22 04:17:38 [---] Listening on port 31416 2006-11-22 04:17:38 [SETI@home] Deferring task 10jn03ab.7548.30496.284650.3.57_1 2006-11-22 04:17:38 [SETI@home] Restarting task 10jn03ab.7548.30496.284650.3.57_1 using setiathome_enhanced version 512 2006-11-22 04:17:39 [rosetta@home] Started download of file hom018_s014_.fasta.gz 2006-11-22 04:17:39 [rosetta@home] Started download of file hom018_s014_.psipred_ss2.gz 2006-11-22 04:17:42 [rosetta@home] Finished download of file hom018_s014_.fasta.gz 2006-11-22 04:17:42 [rosetta@home] Throughput 1149 bytes/sec 2006-11-22 04:17:42 [rosetta@home] Finished download of file hom018_s014_.psipred_ss2.gz 2006-11-22 04:17:42 [rosetta@home] Throughput 7188 bytes/sec 2006-11-22 04:17:42 [rosetta@home] Started download of file boinc_hom018_aas014_03_05.200_v1_3.gz 2006-11-22 04:17:42 [rosetta@home] Started download of file boinc_hom018_aas014_09_05.200_v1_3.gz 2006-11-22 04:17:44 [rosetta@home] Finished download of file boinc_hom018_aas014_09_05.200_v1_3.gz 2006-11-22 04:17:44 [rosetta@home] Throughput 360810 bytes/sec 2006-11-22 04:17:44 [rosetta@home] Started download of file sg_target_description.txt 2006-11-22 04:17:45 [rosetta@home] Finished download of file boinc_hom018_aas014_03_05.200_v1_3.gz 2006-11-22 04:17:45 [rosetta@home] Throughput 687255 bytes/sec 2006-11-22 04:17:45 [rosetta@home] Finished download of file sg_target_description.txt 2006-11-22 04:17:45 [rosetta@home] Throughput 943 bytes/sec 2006-11-22 04:17:46 [---] Rescheduling CPU: files downloaded 2006-11-22 04:17:46 [---] Using earliest-deadline-first scheduling because computer is overcommitted. 2006-11-22 04:17:46 [SETI@home] Pausing task 10jn03ab.7548.30496.284650.3.57_1 (left in memory) 2006-11-22 04:17:46 [rosetta@home] Starting task s014__BOINC_ABRELAX_SAVE_ALL_OUT_hom018__1406_4371_0 using rosetta version $ 2006-11-22 04:17:49 [---] Suspending work fetch because computer is overcommitted. 2006-11-22 08:17:51 [---] Allowing work fetch again. Now, from the stderrdae.txt file 2006-11-22 03:41:02 [Docking@Home] Unrecoverable error for result 1tng_mod0001_9218_83020_5 (process exited with code 1 (0x1$ 2006-11-22 04:15:04 [Docking@Home] Message from server: No work sent 2006-11-22 04:15:04 [Docking@Home] Message from server: (reached daily quota of 1 results) 2006-11-22 04:15:04 [Docking@Home] No work from project SIGSEGV: segmentation violationStack trace (16 frames): /home/BOINC/boinc[0x8089dc2] /lib/libpthread.so.0[0x40174619] /lib/libc.so.6[0x400482b8] /lib/libc.so.6(vsprintf+0x5b)[0x4007da5b] /home/BOINC/boinc[0x808bc52] /home/BOINC/boinc[0x808c01b] /home/BOINC/boinc[0x80515c7] /home/BOINC/boinc[0x8051d2a] /home/BOINC/boinc[0x80718a9] /home/BOINC/boinc[0x80715eb] /home/BOINC/boinc[0x8071a99] /home/BOINC/boinc[0x8059c15] /home/BOINC/boinc[0x807d189] /home/BOINC/boinc[0x807d2b7] /lib/libc.so.6(__libc_start_main+0x8d)[0x40036bd1] /home/BOINC/boinc(__fxstat64+0x99)[0x804c1e1] Exiting... 2006-11-22 04:17:38 [rosetta@home] State file error: result DOC_1MLC_R061114_pose_u_global_search_1402_736_0 is in wrong sta$ I'm not sure if this is related to Docking or Rosetta, but I have noticed that anytime you stop the BOINC service on this machine, Rosetta keeps running, but in sleeping mode so it doesn't use any CPU. You have to kill Rosetta from top with a SIGTERM. This had happened prior to the log above when I stopped boinc to change something for another try at getting docking to work on this machine, IIRC. BTW, a couple of times I have noticed it wasn't reporting results and found that rosetta had been sleeping for 2+ days and boinc was nowhere to be found. This is the standard boinc 5.4.9 client on a text only machine (both console and ssh are text only), running as a service. They really need to release a command line only Linux boinc client version again. I'm having to use the boinc_cmd from boinc 5.2.13 to control it. That error might have to do with the Rosetta command line being too long. I just ran a "ps axu" and here are the boinc processes as of now. boinc 28900 0.0 0.0 4724 2020 ? S Nov25 0:00 /home/BOINC/boinc -redirectio daemon boinc 31274 0.0 1.3 39288 26796 ? SN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu boinc 31275 0.0 1.3 39288 26796 ? RN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu boinc 31276 11.4 1.3 39288 26796 ? SN Nov26 111:25 setiathome-5.12.i686-pc-linux-gnu boinc 31277 0.0 1.3 39288 26796 ? SN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu boinc 9907 88.7 3.6 111684 73940 ? RN 01:45 385:40 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 - boinc 9908 0.0 3.6 111684 73940 ? RN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 - boinc 9909 0.0 3.6 111684 73940 ? SN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 - boinc 9921 0.0 3.6 111684 73940 ? SN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 - Sorry for any weird formatting. I piped the output of "ps axu" into "nano -v" and did a cut-paste from the screen in nano. It looks like "ps axu" clipped the command lines for rosetta. Again, I don't know if it was docking or rosetta that killed boinc. Just went into /proc/9907 and got the command line from there. The spaces between options didn't show so I'm guessing at that part. [/proc/9907]# cat cmdline rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 10800 -watchdog -constant_seed -jran 2638533 I could be totally wrong but could the problem with BOINC controlling Rosetta be because the command line is too long and it's killing the BOINC client? BTW, I'm running the stock boinc client and applications. -- David EDIT: The only thing I was changing for Docking@Home was only to increase the allowed stack limit to unlimited. I keep finding more places where config files drop it back to the default unless you're root. Even if I have Docking suspended, I sometimes find the boinc client dead with Rosetta running in sleep mode. Since this error appears to be in a vsprintf in libpthread, I thought the BOINC client might be erroring out when it tried to control Rosetta. |
22)
Message boards :
Number crunching :
WU will not finish upload of 883 KB result
(Message 29830)
Posted 22 Oct 2006 by David Ball Post: The jumping is normal. The client starts off assuming the server does not have any of the file (ie each transfer is a whole new experience at your end), but the server comes back with a message to say "I've already got nn Mb" (so the receiving end remembers the outcome of the previous attempts.) It's been doing retries for more than 24 hours. On the transfer tab, the cumulative total time it's spent trying to upload this workunit is about 2.5 hours. Since the backoffs were reaching the 3+ hour range, I've been hitting the retry button about every thirty minutes when I'm in the room. It's now up to about 680KB of 883KB. The next Rosetta WU is almost done and it's going to be interesting to see if it has the same problem or just uploads quickly like they have in the past. Usually, either the server is down or the upload goes in one try. This one must be in the 30 - 60 tries range. BTW, when I hit the update button for Rosetta on the manager to update the points total, it updates within a few seconds. I haven't noticed any problems with Rosetta on my windows machine on the same link, but I don't recall if it's tried to upload a WU this weekend. That particular machine is also past the 40 year milestone for the current CPDN 160 year simulation. I think it's been close to a month since it uploaded the 40 year milestone. It's done several small trickle ups since then. BTW, the next WU just finished and it took about 5 retries to upload it's 133KB result. The 883KB result is still getting a few more KB each time it tries. I'm retrying it manually about every 20 - 30 minutes since it has reached the point of trying 13 hour backoffs. The downloads for the two WU to replace these were very fast and included some files over 1 MB in size. The downloads experienced no retries. It's just the uploads that are having a problem and they show "http error" when they fail and backoff for another try. Oh well, it's got another day before the deadline so it should eventually finish uploading. BTW, I'm in the central time zone in the US. Thanks, -- David |
23)
Message boards :
Number crunching :
WU will not finish upload of 883 KB result
(Message 29803)
Posted 22 Oct 2006 by David Ball Post: I have a workunit on a linux BOINC 5.4.9 box that has been trying to upload a large Rosetta result for a couple of days. Each time, it starts at zero percent, jumps to the percent where it timed out before, maybe uploads a few more KB and then goes into waiting to retry communications. Right now, it's just over 527KB of the 883 KB result. Each try, it seems to jump from 0% to 0.29% (sometimes the 0.29% is skipped or likely happens so fast it doesn't display) to the percentage where it left off last upload attempt and gets a few more KB through. Does the server have a problem with results over a certain size? I don't think I've seen one this large before for Rosetta. The machine is set to run a WU for 24 hours since we're getting some complex WU's that take a long time per model. No other projects on that machine are having problems. DOC_1QFU_pose_u_pert_with_bbmin_1282_1198_0 ResultID 42657928 The result seems to have finished OK, with a runtime of 23:49:13 , progress of 100% , and status of uploading. It's the large upload that is having problems. The machine is also running CPDN, Einstein, Seti, and Docking@home Alpha with no problems uploading or downloading on the others. It spends just over 44% of the time on Rosetta. The machine is a Socket A sempron 2500+ with 1 GB Ram and 1 GB swap space, running FC3, and is set to "Leave applications in memory while suspended". It runs 24 hours a day and rarely runs anything but BOINC since I haven't started the project I plan to develop on it. It's running standard BOINC Linux client 5.4.9 and has been uploading and downloading to other projects fine during the 2 days it's been trying to upload the Rosetta result. On Rosetta, that machine has a total credit of 9,868.54 and a RAC of 68.99 , so it's been running Rosetta, and the other projects, for several months. That particular machine is also past the 40 year milestone for the current CPDN 160 year simulation. Thanks, -- David |
24)
Message boards :
Number crunching :
Take the pledge:
(Message 28596)
Posted 27 Sep 2006 by David Ball Post: As to the collective "You will never know", don't be that sure of it. The hounds searching for the moderator's identities have very refined sense of smell if not minds . :) *grin* Maybe someone should start a BOINC project to analyze the text posted on the message boards and predict who the mods are. We could call it moderator@home :-) Seriously: One of these days, I might get enough free time to read enough in the message boards to find out what the arguments are about. I donate computer time to several BOINC projects and the things I read in their message boards all runs together anyway. I know there are arguments about credits on several projects, but whether the project looks like it will produce valuable science is the primary determining factor for me. I usually just look for posts where the people running the projects are updating the status of things. Rosetta seems to be great at doing that and I like the valuable science it does, so it has the largest CPU share on my computers. -- David |
25)
Questions and Answers :
Web site :
Workunit web page confuses claimed and granted credit
(Message 26941)
Posted 16 Sep 2006 by David Ball Post: Please look at WU 32706783 for an example. At the top (just under canonical result 37379749) it says "granted credit 99.80" In the table at the bottom is has claimed credit as 99.80 and granted credit as 109.87 . I'm guessing that the web page hasn't been updated to reflect that granted credit is now different from claimed credit. I can see how it would be easy to miss a spot with all the changes. Regards, -- David Ball |
©2024 University of Washington
https://www.bakerlab.org