Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 17 · Next
Author | Message |
---|---|
casio7131 Send message Joined: 10 Oct 05 Posts: 35 Credit: 149,748 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=10302640 PRODUCTION_ABINITIO_CENTROID_PACKING_2ci2I_301_2380_0 was stuck at 1% after ~30 hours. i restarted boinc, and it's now at 20% after 21 min. computer is dual p3 933. |
Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0 |
2/13/2006 9:59:01 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/1ac/BARCODE_30_1acf__299_23614_0_0 2182 bytes != offset 0 bytes 2/13/2006 9:59:01 AM|rosetta@home|Temporarily failed upload of BARCODE_30_1acf__299_23614_0_0: transient upload error 2/13/2006 9:59:01 AM|rosetta@home|Backing off 3 hours, 57 minutes, and 6 seconds on upload of file BARCODE_30_1acf__299_23614_0_0 2/13/2006 9:59:07 AM|rosetta@home|Started upload of BARCODE_30_1tig__299_23625_0_0 2/13/2006 9:59:10 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/371/BARCODE_30_1tig__299_23625_0_0 1948 bytes != offset 0 bytes 2/13/2006 9:59:10 AM|rosetta@home|Temporarily failed upload of BARCODE_30_1tig__299_23625_0_0: transient upload error 2/13/2006 9:59:10 AM|rosetta@home|Backing off 3 hours, 17 minutes, and 10 seconds on upload of file BARCODE_30_1tig__299_23625_0_0 2/13/2006 9:59:18 AM|rosetta@home|Started upload of BARCODE_30_1bm8__299_23283_2_0 2/13/2006 9:59:21 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/1b4/BARCODE_30_1bm8__299_23283_2_0 722 bytes != offset 0 bytes 2/13/2006 9:59:21 AM|rosetta@home|Temporarily failed upload of BARCODE_30_1bm8__299_23283_2_0: transient upload error 2/13/2006 9:59:21 AM|rosetta@home|Backing off 39 minutes and 49 seconds on upload of file BARCODE_30_1bm8__299_23283_2_0 2/13/2006 9:59:28 AM|rosetta@home|Started upload of BARCODE_30_1tig__299_26551_0_0 2/13/2006 9:59:31 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/15b/BARCODE_30_1tig__299_26551_0_0 488 bytes != offset 0 bytes 2/13/2006 9:59:31 AM|rosetta@home|Temporarily failed upload of BARCODE_30_1tig__299_26551_0_0: transient upload error 2/13/2006 9:59:31 AM|rosetta@home|Backing off 2 hours, 1 minutes, and 35 seconds on upload of file BARCODE_30_1tig__299_26551_0_0 2/13/2006 9:59:38 AM|rosetta@home|Started upload of BARCODE_30_4ubpA_299_26658_0_0 2/13/2006 9:59:41 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/1ac/BARCODE_30_4ubpA_299_26658_0_0 722 bytes != offset 0 bytes 2/13/2006 9:59:41 AM|rosetta@home|Temporarily failed upload of BARCODE_30_4ubpA_299_26658_0_0: transient upload error 2/13/2006 9:59:41 AM|rosetta@home|Backing off 51 minutes and 42 seconds on upload of file BARCODE_30_4ubpA_299_26658_0_0 2/13/2006 9:59:48 AM|rosetta@home|Started upload of BARCODE_30_1iibA_299_26685_0_0 2/13/2006 9:59:50 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/4f/BARCODE_30_1iibA_299_26685_0_0 722 bytes != offset 0 bytes 2/13/2006 9:59:50 AM|rosetta@home|Temporarily failed upload of BARCODE_30_1iibA_299_26685_0_0: transient upload error 2/13/2006 9:59:50 AM|rosetta@home|Backing off 3 hours, 31 minutes, and 31 seconds on upload of file BARCODE_30_1iibA_299_26685_0_0 |
arklms Send message Joined: 17 Dec 05 Posts: 7 Credit: 177,488 RAC: 0 |
FAST_ABINITIO_DEFAULT_256bA_306_1050 1 1%, 9 hours. |
stonnee Send message Joined: 3 Dec 05 Posts: 4 Credit: 31,283 RAC: 0 |
PRODUCTION_ABINITIO_1dhn__250_1151_1 WU 5694061 noticed it was around 14.5 hours and at 97.5% and then it had a client error 3 other computers running this WU all had errors I dont know if it was stuck at 1% |
Carlos_Pfitzner Send message Joined: 22 Dec 05 Posts: 71 Credit: 138,867 RAC: 0 |
Erros on my pcs, for yesterday, 14 Feb 2006 11370234 9223435 14 Feb 2006 21:34:40 UTC 14 Feb 2006 22:54:44 UTC Over Client error Downloading 0.00 0.00 11323177 9113761 14 Feb 2006 16:47:14 UTC 15 Feb 2006 0:51:59 UTC Over Client error Computing 1,218.44 2.90 11271660 9138765 14 Feb 2006 11:32:16 UTC 14 Feb 2006 11:42:49 UTC Over Client error Downloading 0.00 0.00 --- Details for error computing 11323177 Name FAST_ABINITIO_DEFAULT_1fkb__306_3546_1 Workunit 9113761 Created 14 Feb 2006 8:52:21 UTC Sent 14 Feb 2006 16:47:14 UTC Received 15 Feb 2006 0:51:59 UTC Server state Over Outcome Client error Client state Computing Exit status -1073741819 (0xc0000005) Computer ID 118809 Report deadline 21 Feb 2006 16:47:14 UTC CPU time 1218.4375 stderr out <core_client_version>5.3.2</core_client_version> <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x00739840 write attempt to address 0x06DF3010 Exiting... No heartbeat from core client for 31 sec - exiting ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x005005D1 read attempt to address 0x106E7154 Exiting... </stderr_txt> Validate state Invalid Claimed credit 2.9009510878005 Granted credit 0 application version 4.81 Click signature for global team stats |
Steve Shedroff Send message Joined: 7 Nov 05 Posts: 11 Credit: 250,657 RAC: 0 |
I have had a large number of downloads freeze and keep data from flowing so work has stopped. Most have the "fasta" designationin thier name. I just aborted about 20 downloads. Each took two aborts or more to actually kill them. I was getting Error 500 and error 505 messages from BOINC. Any idea what I may have set wrong that might be causing this? I saved a portion of the message log if anyone wnats to see the communication thread. Work is on a laptop that moves from connection to connection, some with Proxy and some without. I manually change proxy setting to fit location. Been running BOINC for some time now, 10,451 WU on this computer so far. This started happening this week. |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
This WU 9284726 was stuck at 1% after 10 hours. |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=10041959 ?????????????????????? |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
|
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
WU 8428792 stuck for a couple of days, under Linux, until I noticed and killed the task:
Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
|
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
PRODUCTION_ABINITIO_DBFLAGS_BARCODE10_2vik__308_1421_0 stuck at 23.08% for over a day. Aborting it. rosetta 4.79 on Mac OS X 10.3.9 |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Another possible cause is when the CPDN controlling process hadsm3_* is killed, leaving the worker process hadsm3um_* running. The Science Application (a.k.a. "worker") can only be killed using task manager or by a reboot. Not running the Boinc screensaver? Hmm then it seems likely that some part of Rosetta isn't being killed when switching and causing the error. I wonder if this is part of the problems Ralph is looking to find? I don't know much about Rosettas' processes/app. sorry tony |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
Another possible cause is when the CPDN controlling process hadsm3_* is killed, leaving the worker process hadsm3um_* running. The Science Application (a.k.a. "worker") can only be killed using task manager or by a reboot. No switching, only running R@H 24/7. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
Another WU apparently crashed & stuck under Linux (2.4.27 Debian Sarge Stable), 9469195 This machine has "leave in memory"=Yes. It has been shared between 6 other BOINC projects for >1month. Only Rosetta 4.80 has problems with getting stuck, prior v4.2 (HPF/WCG) never had a problem.
Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
A quick update on WU 9469195 mentioned in prior message. I killed the Rosetta 4.80 task ($ kill <pid>) and BOINC re-run the same WU, this time successfully, to completion. Probably the only change being the random seed. The stderr.txt shown in resultid, contains the contents of the previous, unsuccessful and eventual hung, run attempt (with the previous random seed). Which I had copied here in my previous post. Btw, should I take the time to report this stuff? Is anyone looking at this? Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
cloaked_chaos Send message Joined: 9 Nov 05 Posts: 14 Credit: 80,818 RAC: 0 |
This WU took 165 hours before it finally decided that it was running for too long. I would really like to receive credit for this since it is 2,175.86 credit. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=8103040 |
Rebirther Send message Joined: 17 Sep 05 Posts: 116 Credit: 41,315 RAC: 0 |
PRODUCTION_ABINITIO_INCREASECYCLES50_1ten__312_1196_0 1% after 4h, 2Mio steps again and again, last entry in stdout: Starting score3 moves... kk,score3,low_score,rms_err,low_rms,rms_min,naccept 0 -61.034 -61.034 11.848 11.848 8.788 15290 converged 2.07775331 108316 converged 2.71214509 112159 converged 2.55168295 125540 converged 2.40547872 129158 converged 1.95232618 132867 converged 2.9595387 137826 converged 2.75581789 140668 converged 2.2488966 144434 converged 2.80967975 158799 converged 2.50006342 162126 converged 2.39554954 169710 converged 2.04850674 183329 converged 1.99719334 187299 1 -40.606 -78.377 11.668 12.506 8.788 20134 converged 2.95864868 138902 converged 2.02392745 173508 converged 2.22490144 324940 2 -12.235 -78.377 8.579 12.506 6.717 26539 converged 2.71321511 126774 converged 1.66896379 159099 Time is not updated anymore of the stdout file but content, still at model 1! Restart boinc didn`t solve the problem, only a new random seed, what can I do? |
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
Another one, 9442770 Over 8 hours in and still stuck on 1%. It's running rosetta 4.82 too, so I guess that didn't fix the 1% problem then. Max CPU setting is 2 hours. |
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
Well it finished eventually, at 8hr 39mins. But it never did get off 1% as far as I could see. |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2025 University of Washington
https://www.bakerlab.org