Posts by David Ball

21) Message boards : Number crunching : BOINC Dying and orphaning Rosetta - Possible cause (Message 31753)
Posted 28 Nov 2006 by David Ball
Post:
I keep finding my Linux RHEL3 machine has died and orphaned Rosetta. I have to kill Rosetta Manually and restart BOINC. BOINC runs as a service. The machine has libsafe on it.

While going through the logs looking for an error with Docking@Home, I might have found the reason BOINC is dying. It looks like sometimes the Rosetta command line might be too long.

This is from stdoutdae.txt. Lines ending in $ were cut short by nano.

2006-11-22 03:41:02 [Docking@Home] Unrecoverable error for result 1tng_mod0001_9218_83020_5 (process exited with code 1 (0x1$
2006-11-22 03:41:02 [Docking@Home] Deferring scheduler requests for 1 minutes and 0 seconds
2006-11-22 03:41:02 [---] Rescheduling CPU: application exited
2006-11-22 03:41:02 [Docking@Home] Computation for task 1tng_mod0001_9218_83020_5 finished
2006-11-22 03:41:02 [---] Resuming round-robin CPU scheduling.
2006-11-22 03:41:02 [rosetta@home] Resuming task DOC_1MLC_R061114_pose_u_global_search_1402_736_0 using rosetta version 540
2006-11-22 04:14:59 [---] Resuming network activity
2006-11-22 04:14:59 [---] Allowing work fetch again.
.........Skipped some attempted work fetches and upload of the failed docking workunit.
2006-11-22 04:15:09 [rosetta@home] Sending scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi
2006-11-22 04:15:09 [rosetta@home] Reason: To fetch work
2006-11-22 04:15:09 [rosetta@home] Requesting 21600 seconds of new work, and reporting 1 completed tasks
2006-11-22 04:15:12 [Docking@Home] Finished upload of file 1tng_mod0001_9218_83020_5_2
2006-11-22 04:15:12 [Docking@Home] Throughput 51542 bytes/sec
2006-11-22 04:15:12 [Docking@Home] Finished upload of file 1tng_mod0001_9218_83020_5_3
2006-11-22 04:15:12 [Docking@Home] Throughput 598603 bytes/sec
2006-11-22 04:15:14 [rosetta@home] Scheduler request succeeded
2006-11-22 04:15:15 [rosetta@home] Started download of file hom018_s014_.fasta.gz
2006-11-22 04:15:15 [rosetta@home] Started download of file hom018_s014_.psipred_ss2.gz


At this point I discovered the BOINC service was dead,
and had to kill Rosetta manually, and restart the BOINC service. The NEXT line in stdoutdae.txt is:

2006-11-22 04:17:38 [---] Starting BOINC client version 5.4.9 for i686-pc-linux-gnu
2006-11-22 04:17:38 [---] libcurl/7.15.3 OpenSSL/0.9.8a zlib/1.2.3
2006-11-22 04:17:38 [---] Executing as a daemon
2006-11-22 04:17:38 [---] Data directory: /home/BOINC
2006-11-22 04:17:38 [rosetta@home] State file error: result DOC_1MLC_R061114_pose_u_global_search_1402_736_0 is in wrong sta$
2006-11-22 04:17:38 [---] Processor: 1 GenuineIntel Intel(R) Celeron(R) CPU 2.40GHz
2006-11-22 04:17:38 [---] Memory: 1.95 GB physical, 1.95 GB virtual
2006-11-22 04:17:38 [---] Disk: 16.02 GB total, 11.62 GB free
2006-11-22 04:17:38 [Docking@Home] URL: http://docking.utep.edu/; Computer ID: 223; location: work; project prefs: default
2006-11-22 04:17:38 [SETI@home] URL: http://setiathome.berkeley.edu/; Computer ID: 2185126; location: work; project prefs: d$
2006-11-22 04:17:38 [rosetta@home] URL: http://boinc.bakerlab.org/rosetta/; Computer ID: 211470; location: work; project pre$
2006-11-22 04:17:38 [lhcathome] URL: http://lhcathome.cern.ch/lhcathome/; Computer ID: 2363079; location: work; project pref$
2006-11-22 04:17:38 [---] General prefs: from Docking@Home (last modified 2006-11-22 03:03:43)
2006-11-22 04:17:38 [---] General prefs: using separate prefs for work
2006-11-22 04:17:38 [---] Local control only allowed
2006-11-22 04:17:38 [---] Listening on port 31416
2006-11-22 04:17:38 [SETI@home] Deferring task 10jn03ab.7548.30496.284650.3.57_1
2006-11-22 04:17:38 [SETI@home] Restarting task 10jn03ab.7548.30496.284650.3.57_1 using setiathome_enhanced version 512
2006-11-22 04:17:39 [rosetta@home] Started download of file hom018_s014_.fasta.gz
2006-11-22 04:17:39 [rosetta@home] Started download of file hom018_s014_.psipred_ss2.gz
2006-11-22 04:17:42 [rosetta@home] Finished download of file hom018_s014_.fasta.gz
2006-11-22 04:17:42 [rosetta@home] Throughput 1149 bytes/sec
2006-11-22 04:17:42 [rosetta@home] Finished download of file hom018_s014_.psipred_ss2.gz
2006-11-22 04:17:42 [rosetta@home] Throughput 7188 bytes/sec
2006-11-22 04:17:42 [rosetta@home] Started download of file boinc_hom018_aas014_03_05.200_v1_3.gz
2006-11-22 04:17:42 [rosetta@home] Started download of file boinc_hom018_aas014_09_05.200_v1_3.gz
2006-11-22 04:17:44 [rosetta@home] Finished download of file boinc_hom018_aas014_09_05.200_v1_3.gz
2006-11-22 04:17:44 [rosetta@home] Throughput 360810 bytes/sec
2006-11-22 04:17:44 [rosetta@home] Started download of file sg_target_description.txt
2006-11-22 04:17:45 [rosetta@home] Finished download of file boinc_hom018_aas014_03_05.200_v1_3.gz
2006-11-22 04:17:45 [rosetta@home] Throughput 687255 bytes/sec
2006-11-22 04:17:45 [rosetta@home] Finished download of file sg_target_description.txt
2006-11-22 04:17:45 [rosetta@home] Throughput 943 bytes/sec
2006-11-22 04:17:46 [---] Rescheduling CPU: files downloaded
2006-11-22 04:17:46 [---] Using earliest-deadline-first scheduling because computer is overcommitted.
2006-11-22 04:17:46 [SETI@home] Pausing task 10jn03ab.7548.30496.284650.3.57_1 (left in memory)
2006-11-22 04:17:46 [rosetta@home] Starting task s014__BOINC_ABRELAX_SAVE_ALL_OUT_hom018__1406_4371_0 using rosetta version $
2006-11-22 04:17:49 [---] Suspending work fetch because computer is overcommitted.
2006-11-22 08:17:51 [---] Allowing work fetch again.


Now, from the stderrdae.txt file


2006-11-22 03:41:02 [Docking@Home] Unrecoverable error for result 1tng_mod0001_9218_83020_5 (process exited with code 1 (0x1$
2006-11-22 04:15:04 [Docking@Home] Message from server: No work sent
2006-11-22 04:15:04 [Docking@Home] Message from server: (reached daily quota of 1 results)
2006-11-22 04:15:04 [Docking@Home] No work from project
SIGSEGV: segmentation violationStack trace (16 frames):
/home/BOINC/boinc[0x8089dc2]
/lib/libpthread.so.0[0x40174619]
/lib/libc.so.6[0x400482b8]
/lib/libc.so.6(vsprintf+0x5b)[0x4007da5b]
/home/BOINC/boinc[0x808bc52]
/home/BOINC/boinc[0x808c01b]
/home/BOINC/boinc[0x80515c7]
/home/BOINC/boinc[0x8051d2a]
/home/BOINC/boinc[0x80718a9]
/home/BOINC/boinc[0x80715eb]
/home/BOINC/boinc[0x8071a99]
/home/BOINC/boinc[0x8059c15]
/home/BOINC/boinc[0x807d189]
/home/BOINC/boinc[0x807d2b7]
/lib/libc.so.6(__libc_start_main+0x8d)[0x40036bd1]
/home/BOINC/boinc(__fxstat64+0x99)[0x804c1e1]
Exiting...
2006-11-22 04:17:38 [rosetta@home] State file error: result DOC_1MLC_R061114_pose_u_global_search_1402_736_0 is in wrong sta$

I'm not sure if this is related to Docking or Rosetta, but I have noticed that anytime you stop the BOINC service on this machine, Rosetta keeps running, but in sleeping mode so it doesn't use any CPU. You have to kill Rosetta from top with a SIGTERM. This had happened prior to the log above when I stopped boinc to change something for another try at getting docking to work on this machine, IIRC.

BTW, a couple of times I have noticed it wasn't reporting results and found that rosetta had been sleeping for 2+ days and boinc was nowhere to be found.

This is the standard boinc 5.4.9 client on a text only machine (both console and ssh are text only), running as a service. They really need to release a command line only Linux boinc client version again. I'm having to use the boinc_cmd from boinc 5.2.13 to control it.

That error might have to do with the Rosetta command line being too long. I just ran a "ps axu" and here are the boinc processes as of now.

boinc 28900 0.0 0.0 4724 2020 ? S Nov25 0:00 /home/BOINC/boinc -redirectio daemon

boinc 31274 0.0 1.3 39288 26796 ? SN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu

boinc 31275 0.0 1.3 39288 26796 ?
RN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu

boinc 31276 11.4 1.3 39288 26796 ? SN Nov26 111:25
setiathome-5.12.i686-pc-linux-gnu

boinc 31277 0.0 1.3 39288 26796 ? SN Nov26 0:00 setiathome-5.12.i686-pc-linux-gnu

boinc 9907 88.7 3.6 111684 73940 ? RN 01:45 385:40 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose
-dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -

boinc 9908 0.0 3.6 111684 73940 ? RN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -

boinc 9909 0.0 3.6 111684 73940 ? SN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -

boinc 9921 0.0 3.6 111684 73940 ? SN 01:45 0:00 rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all
-accept_all -nstruct 10 -

Sorry for any weird formatting. I piped the output of "ps axu" into "nano -v" and did a cut-paste from the screen in nano. It looks like "ps axu" clipped the command lines for rosetta. Again, I don't know if it was docking or rosetta that killed boinc.

Just went into /proc/9907 and got the command line from there. The spaces between options didn't show so I'm guessing at that part.

[/proc/9907]# cat cmdline
rosetta_5.40_i686-pc-linux-gnu dd 1TAB 1 -s 1TAB.uppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -dock_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 10800 -watchdog -constant_seed -jran 2638533

I could be totally wrong but could the problem with BOINC controlling Rosetta be because the command line is too long and it's killing the BOINC client?

BTW, I'm running the stock boinc client and applications.

-- David
EDIT: The only thing I was changing for Docking@Home was only to increase the allowed stack limit to unlimited. I keep finding more places where config files drop it back to the default unless you're root. Even if I have Docking suspended, I sometimes find the boinc client dead with Rosetta running in sleep mode. Since this error appears to be in a vsprintf in libpthread, I thought the BOINC client might be erroring out when it tried to control Rosetta.
22) Message boards : Number crunching : WU will not finish upload of 883 KB result (Message 29830)
Posted 22 Oct 2006 by David Ball
Post:
The jumping is normal. The client starts off assuming the server does not have any of the file (ie each transfer is a whole new experience at your end), but the server comes back with a message to say "I've already got nn Mb" (so the receiving end remembers the outcome of the previous attempts.)

Don't know for sure on the 0.29%, but obviously this would be a bigger number if the WU was smaller, and different for every upload, so the effect may well be commonplace. One guess is that the client sends the first block out before it is told the server already has it (does 0.29% correspond to a later jump in the size?). This would make sense as most transfers will be starting from zero, and it would save a round trip delay between you and bakerlab.

I would hope that if there is a max size of file / max transfer that the transfer algorithm would stop it immediately on retry, whereas it clearly is responding with the file size.

Have you left it trying for 24hrs on its automatic backoff? From here (UK) the net to bakerlab is often very slow - especially 1200-1800 UT when the US is starting the day. Best time to try to eliminate net issues seems to be 0600 - 1000 UT. You may find it just whizzes out eventually, probably just after you leave the room.


It's been doing retries for more than 24 hours. On the transfer tab, the cumulative total time it's spent trying to upload this workunit is about 2.5 hours. Since the backoffs were reaching the 3+ hour range, I've been hitting the retry button about every thirty minutes when I'm in the room. It's now up to about 680KB of 883KB. The next Rosetta WU is almost done and it's going to be interesting to see if it has the same problem or just uploads quickly like they have in the past. Usually, either the server is down or the upload goes in one try. This one must be in the 30 - 60 tries range. BTW, when I hit the update button for Rosetta on the manager to update the points total, it updates within a few seconds. I haven't noticed any problems with Rosetta on my windows machine on the same link, but I don't recall if it's tried to upload a WU this weekend.

That particular machine is also past the 40 year milestone for the current CPDN 160 year simulation.


past as in just past (and might still be uploading gigabytes to CPDN)?

Or past as in well past and therefore we know your client can push out big files? Do you know how this Rosetta file compares with the bigger CPDN files you have uploaded?

River~~

I think it's been close to a month since it uploaded the 40 year milestone. It's done several small trickle ups since then.

BTW, the next WU just finished and it took about 5 retries to upload it's 133KB result. The 883KB result is still getting a few more KB each time it tries. I'm retrying it manually about every 20 - 30 minutes since it has reached the point of trying 13 hour backoffs.

The downloads for the two WU to replace these were very fast and included some files over 1 MB in size. The downloads experienced no retries. It's just the uploads that are having a problem and they show "http error" when they fail and backoff for another try.

Oh well, it's got another day before the deadline so it should eventually finish uploading. BTW, I'm in the central time zone in the US.

Thanks,

-- David
23) Message boards : Number crunching : WU will not finish upload of 883 KB result (Message 29803)
Posted 22 Oct 2006 by David Ball
Post:

I have a workunit on a linux BOINC 5.4.9 box that has been trying to upload a large Rosetta result for a couple of days. Each time, it starts at zero percent, jumps to the percent where it timed out before, maybe uploads a few more KB and then goes into waiting to retry communications. Right now, it's just over 527KB of the 883 KB result. Each try, it seems to jump from 0% to 0.29% (sometimes the 0.29% is skipped or likely happens so fast it doesn't display) to the percentage where it left off last upload attempt and gets a few more KB through.

Does the server have a problem with results over a certain size? I don't think I've seen one this large before for Rosetta. The machine is set to run a WU for 24 hours since we're getting some complex WU's that take a long time per model. No other projects on that machine are having problems.

DOC_1QFU_pose_u_pert_with_bbmin_1282_1198_0

ResultID 42657928

The result seems to have finished OK, with a runtime of 23:49:13 , progress of 100% , and status of uploading. It's the large upload that is having problems. The machine is also running CPDN, Einstein, Seti, and Docking@home Alpha with no problems uploading or downloading on the others. It spends just over 44% of the time on Rosetta. The machine is a Socket A sempron 2500+ with 1 GB Ram and 1 GB swap space, running FC3, and is set to "Leave applications in memory while suspended". It runs 24 hours a day and rarely runs anything but BOINC since I haven't started the project I plan to develop on it. It's running standard BOINC Linux client 5.4.9 and has been uploading and downloading to other projects fine during the 2 days it's been trying to upload the Rosetta result. On Rosetta, that machine has a total credit of 9,868.54 and a RAC of 68.99 , so it's been running Rosetta, and the other projects, for several months. That particular machine is also past the 40 year milestone for the current CPDN 160 year simulation.

Thanks,

-- David
24) Message boards : Number crunching : Take the pledge: (Message 28596)
Posted 27 Sep 2006 by David Ball
Post:
As to the collective "You will never know", don't be that sure of it. The hounds searching for the moderator's identities have very refined sense of smell if not minds . :)


*grin* Maybe someone should start a BOINC project to analyze the text posted on the message boards and predict who the mods are. We could call it moderator@home :-)

Seriously: One of these days, I might get enough free time to read enough in the message boards to find out what the arguments are about. I donate computer time to several BOINC projects and the things I read in their message boards all runs together anyway. I know there are arguments about credits on several projects, but whether the project looks like it will produce valuable science is the primary determining factor for me. I usually just look for posts where the people running the projects are updating the status of things. Rosetta seems to be great at doing that and I like the valuable science it does, so it has the largest CPU share on my computers.

-- David
25) Questions and Answers : Web site : Workunit web page confuses claimed and granted credit (Message 26941)
Posted 16 Sep 2006 by David Ball
Post:
Please look at WU 32706783 for an example.

At the top (just under canonical result 37379749) it says "granted credit 99.80"

In the table at the bottom is has claimed credit as 99.80 and granted credit as 109.87 .

I'm guessing that the web page hasn't been updated to reflect that granted credit is now different from claimed credit. I can see how it would be easy to miss a spot with all the changes.

Regards,

-- David Ball


Previous 20



©2024 University of Washington
https://www.bakerlab.org