Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 18 · Next

AuthorMessage
Rossmor35

Send message
Joined: 24 Sep 05
Posts: 4
Credit: 84,870
RAC: 0
Message 10674 - Posted: 11 Feb 2006, 20:41:15 UTC
Last modified: 11 Feb 2006, 20:49:05 UTC

This Wu stuck on 20% for 5hrs.Aborted as graphics not moving and step count not moving.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=8130938




ID: 10674 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Honza

Send message
Joined: 18 Sep 05
Posts: 48
Credit: 173,517
RAC: 0
Message 10696 - Posted: 12 Feb 2006, 17:03:20 UTC

Mercyfully killed WU after ~70 hours on Pentium D; the other resultID (also Pentium D) went fine.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=8363123
ID: 10696 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
casio7131

Send message
Joined: 10 Oct 05
Posts: 35
Credit: 149,748
RAC: 0
Message 10726 - Posted: 13 Feb 2006, 12:38:18 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=10302640
PRODUCTION_ABINITIO_CENTROID_PACKING_2ci2I_301_2380_0

was stuck at 1% after ~30 hours. i restarted boinc, and it's now at 20% after 21 min. computer is dual p3 933.
ID: 10726 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 10735 - Posted: 13 Feb 2006, 16:01:52 UTC

2/13/2006 9:59:01 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/1ac/BARCODE_30_1acf__299_23614_0_0 2182 bytes != offset 0 bytes
2/13/2006 9:59:01 AM|rosetta@home|Temporarily failed upload of BARCODE_30_1acf__299_23614_0_0: transient upload error
2/13/2006 9:59:01 AM|rosetta@home|Backing off 3 hours, 57 minutes, and 6 seconds on upload of file BARCODE_30_1acf__299_23614_0_0
2/13/2006 9:59:07 AM|rosetta@home|Started upload of BARCODE_30_1tig__299_23625_0_0
2/13/2006 9:59:10 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/371/BARCODE_30_1tig__299_23625_0_0 1948 bytes != offset 0 bytes
2/13/2006 9:59:10 AM|rosetta@home|Temporarily failed upload of BARCODE_30_1tig__299_23625_0_0: transient upload error
2/13/2006 9:59:10 AM|rosetta@home|Backing off 3 hours, 17 minutes, and 10 seconds on upload of file BARCODE_30_1tig__299_23625_0_0
2/13/2006 9:59:18 AM|rosetta@home|Started upload of BARCODE_30_1bm8__299_23283_2_0
2/13/2006 9:59:21 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/1b4/BARCODE_30_1bm8__299_23283_2_0 722 bytes != offset 0 bytes
2/13/2006 9:59:21 AM|rosetta@home|Temporarily failed upload of BARCODE_30_1bm8__299_23283_2_0: transient upload error
2/13/2006 9:59:21 AM|rosetta@home|Backing off 39 minutes and 49 seconds on upload of file BARCODE_30_1bm8__299_23283_2_0
2/13/2006 9:59:28 AM|rosetta@home|Started upload of BARCODE_30_1tig__299_26551_0_0
2/13/2006 9:59:31 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/15b/BARCODE_30_1tig__299_26551_0_0 488 bytes != offset 0 bytes
2/13/2006 9:59:31 AM|rosetta@home|Temporarily failed upload of BARCODE_30_1tig__299_26551_0_0: transient upload error
2/13/2006 9:59:31 AM|rosetta@home|Backing off 2 hours, 1 minutes, and 35 seconds on upload of file BARCODE_30_1tig__299_26551_0_0
2/13/2006 9:59:38 AM|rosetta@home|Started upload of BARCODE_30_4ubpA_299_26658_0_0
2/13/2006 9:59:41 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/1ac/BARCODE_30_4ubpA_299_26658_0_0 722 bytes != offset 0 bytes
2/13/2006 9:59:41 AM|rosetta@home|Temporarily failed upload of BARCODE_30_4ubpA_299_26658_0_0: transient upload error
2/13/2006 9:59:41 AM|rosetta@home|Backing off 51 minutes and 42 seconds on upload of file BARCODE_30_4ubpA_299_26658_0_0
2/13/2006 9:59:48 AM|rosetta@home|Started upload of BARCODE_30_1iibA_299_26685_0_0
2/13/2006 9:59:50 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/4f/BARCODE_30_1iibA_299_26685_0_0 722 bytes != offset 0 bytes
2/13/2006 9:59:50 AM|rosetta@home|Temporarily failed upload of BARCODE_30_1iibA_299_26685_0_0: transient upload error
2/13/2006 9:59:50 AM|rosetta@home|Backing off 3 hours, 31 minutes, and 31 seconds on upload of file BARCODE_30_1iibA_299_26685_0_0

ID: 10735 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
arklms

Send message
Joined: 17 Dec 05
Posts: 7
Credit: 177,488
RAC: 0
Message 10757 - Posted: 14 Feb 2006, 19:45:53 UTC

FAST_ABINITIO_DEFAULT_256bA_306_1050 1
1%, 9 hours.
ID: 10757 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
stonnee

Send message
Joined: 3 Dec 05
Posts: 4
Credit: 31,283
RAC: 0
Message 10773 - Posted: 15 Feb 2006, 14:17:36 UTC


PRODUCTION_ABINITIO_1dhn__250_1151_1

WU 5694061 noticed it was around 14.5 hours and at 97.5% and then it had
a client error

3 other computers running this WU all had errors

I dont know if it was stuck at 1%

ID: 10773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 10774 - Posted: 15 Feb 2006, 14:28:48 UTC

Erros on my pcs, for yesterday, 14 Feb 2006

11370234 9223435 14 Feb 2006 21:34:40 UTC 14 Feb 2006 22:54:44 UTC Over Client error Downloading 0.00 0.00

11323177 9113761 14 Feb 2006 16:47:14 UTC 15 Feb 2006 0:51:59 UTC Over Client error Computing 1,218.44 2.90

11271660 9138765 14 Feb 2006 11:32:16 UTC 14 Feb 2006 11:42:49 UTC Over Client error Downloading 0.00 0.00 ---

Details for error computing
11323177
Name FAST_ABINITIO_DEFAULT_1fkb__306_3546_1
Workunit 9113761
Created 14 Feb 2006 8:52:21 UTC
Sent 14 Feb 2006 16:47:14 UTC
Received 15 Feb 2006 0:51:59 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -1073741819 (0xc0000005)
Computer ID 118809
Report deadline 21 Feb 2006 16:47:14 UTC
CPU time 1218.4375
stderr out <core_client_version>5.3.2</core_client_version>
<message> - exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x00739840 write attempt to address 0x06DF3010

Exiting...
No heartbeat from core client for 31 sec - exiting

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x005005D1 read attempt to address 0x106E7154

Exiting...

</stderr_txt>


Validate state Invalid
Claimed credit 2.9009510878005
Granted credit 0
application version 4.81


Click signature for global team stats
ID: 10774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Steve Shedroff

Send message
Joined: 7 Nov 05
Posts: 11
Credit: 250,657
RAC: 0
Message 10775 - Posted: 15 Feb 2006, 14:58:00 UTC

I have had a large number of downloads freeze and keep data from flowing so work has stopped. Most have the "fasta" designationin thier name. I just aborted about 20 downloads. Each took two aborts or more to actually kill them. I was getting Error 500 and error 505 messages from BOINC. Any idea what I may have set wrong that might be causing this? I saved a portion of the message log if anyone wnats to see the communication thread.

Work is on a laptop that moves from connection to connection, some with Proxy and some without. I manually change proxy setting to fit location. Been running BOINC for some time now, 10,451 WU on this computer so far. This started happening this week.
ID: 10775 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 10814 - Posted: 16 Feb 2006, 16:43:18 UTC
Last modified: 16 Feb 2006, 16:45:18 UTC

This WU 9284726 was stuck at 1% after 10 hours.
ID: 10814 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 10826 - Posted: 16 Feb 2006, 22:40:35 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=10041959

??????????????????????
ID: 10826 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 10834 - Posted: 17 Feb 2006, 3:31:06 UTC - in response to Message 10826.  

https://boinc.bakerlab.org/rosetta/result.php?resultid=10041959

??????????????????????

Well, it seems you have found a new error message (at least to me). I will report this to the project. If they have an answer I will post it here.
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 10834 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 10839 - Posted: 17 Feb 2006, 6:20:57 UTC
Last modified: 17 Feb 2006, 6:23:44 UTC




ID: 10839 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 10895 - Posted: 18 Feb 2006, 17:29:01 UTC

WU 8428792 stuck for a couple of days, under Linux, until I noticed and killed the task:


$ cat stderr.txt
[0x87074eb]
[0x871f4bc]
[0x8785188]
[0x879fb6c]
[0x87a143d]
[0x8770107]
[0x8771ba1]
[0x807b75c]
[0x83fc9ce]
[0x83fd773]
[0x840b53d]
[0x840d412]
[0x86a4bfa]
[0x85b2c98]
[0x85b45f4]
[0x83ca2af]
[0x83cc2cf]
[0x877e534]
[0x8048121]

$ tail stdout.txt
2 42.168 25.526 13.542 13.658 13.009 1295
3 29.464 20.977 13.430 14.201 13.009 1464
smooth trials: 8000 accepts: 356 %: 4.45
standard trials: 4000 accepts: 178 %: 4.45
-----------------------------------------------------
-----------------------------------------------------
CYCLES::number is 1 x total_residue: 103
initializing full atom coordinates
starting score 2966.06836 rms 14.2007523
starting full atom minimization

$ tail boinc.log
2006-02-18 19:15:01 [rosetta@home] Result BARCODE_30_1iibA_299_21962_1 exited with zero status but no 'finished' file
2006-02-18 19:15:01 [rosetta@home] If this happens repeatedly you may need to reset the project.
2006-02-18 19:15:01 [---] request_reschedule_cpus: process exited
2006-02-18 19:15:01 [rosetta@home] Restarting result BARCODE_30_1iibA_299_21962_1 using rosetta version 480
2006-02-18 19:15:05 [rosetta@home] Unrecoverable error for result BARCODE_30_1iibA_299_21962_1 (process exited with code 131 (0x83))
2006-02-18 19:15:05 [rosetta@home] Unrecoverable error for result BARCODE_30_1iibA_299_21962_1 (process exited with code 131 (0x83))
2006-02-18 19:15:05 [---] request_reschedule_cpus: process exited
2006-02-18 19:15:05 [rosetta@home] Computation for result BARCODE_30_1iibA_299_21962_1 finished
2006-02-18 19:15:05 [rosetta@home] Starting result PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_5croA_311_83_0 using rosetta version 480

Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 10895 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 10903 - Posted: 18 Feb 2006, 20:36:05 UTC - in response to Message 10839.  


ID: 10903 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,892,963
RAC: 4,851
Message 10905 - Posted: 18 Feb 2006, 21:23:08 UTC

PRODUCTION_ABINITIO_DBFLAGS_BARCODE10_2vik__308_1421_0 stuck at 23.08% for over a day. Aborting it.

rosetta 4.79 on Mac OS X 10.3.9
ID: 10905 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 10906 - Posted: 18 Feb 2006, 21:24:31 UTC - in response to Message 10903.  
Last modified: 18 Feb 2006, 21:25:18 UTC

Another possible cause is when the CPDN controlling process hadsm3_* is killed, leaving the worker process hadsm3um_* running. The Science Application (a.k.a. "worker") can only be killed using task manager or by a reboot.

I'm not using graphics at all running R@H.
And the other thing doesn't ring a bell to me.

Not running the Boinc screensaver? Hmm then it seems likely that some part of Rosetta isn't being killed when switching and causing the error. I wonder if this is part of the problems Ralph is looking to find? I don't know much about Rosettas' processes/app.

sorry

tony
ID: 10906 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 10914 - Posted: 18 Feb 2006, 23:03:52 UTC - in response to Message 10906.  

Another possible cause is when the CPDN controlling process hadsm3_* is killed, leaving the worker process hadsm3um_* running. The Science Application (a.k.a. "worker") can only be killed using task manager or by a reboot.

I'm not using graphics at all running R@H.
And the other thing doesn't ring a bell to me.

Not running the Boinc screensaver? Hmm then it seems likely that some part of Rosetta isn't being killed when switching and causing the error. I wonder if this is part of the problems Ralph is looking to find? I don't know much about Rosettas' processes/app.

sorry

tony

No switching, only running R@H 24/7.

ID: 10914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 10915 - Posted: 19 Feb 2006, 2:29:17 UTC

Another WU apparently crashed & stuck under Linux (2.4.27 Debian Sarge Stable), 9469195

This machine has "leave in memory"=Yes. It has been shared between 6 other BOINC projects for >1month. Only Rosetta 4.80 has problems with getting stuck, prior v4.2 (HPF/WCG) never had a problem.


cat stderr.txt
[0x87074eb]
[0x871f4bc]
[0x8785188]
[0x86fed46]
[0x8658186]
[0x8659ab1]
[0x865df17]
[0x860c8d4]
[0x86a4e17]
[0x8276df5]
[0x83ca4ac]
[0x83cc2cf]
[0x877e534]
[0x8048121]
*** glibc detected *** corrupted double-linked list: 0x088ff700 ***
SIGSEGV: segmentation violationStack trace (14 frames):

Exiting...
[0x87074eb]
[0x871f4bc]
[0x8785188]
[0x8785674]
[0x879a4c6]
[0x879ee8d]
[0x879f463]
[0x879f8bf]
[0x8770365]
[0x87700d1]
[0x84baeb1]
[0x8785b7f]
[0x8707536]
[0x871f4bc]
[0x8785188]
[0x86fed46]
[0x8658186]
[0x8659ab1]
[0x865df17]
[0x860c8d4]
[0x86a4e17]
[0x8276df5]
[0x83ca4ac]
[0x83cc2cf]
[0x877e534]
[0x8048121]

tail stdout.txt
Starting score3 moves...
kk,score3,low_score,rms_err,low_rms,rms_min,naccept
0 -20.144 -20.144 17.448 17.448 11.055 4370
1 -40.180 -57.881 13.942 14.983 11.055 4965

cmd
command executed: rosetta_4.80_i686-pc-linux-gnu xx 1tit _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -antiparallel_weight 4.0 -nstruct 19

random seed: 1572401

Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 10915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 10949 - Posted: 19 Feb 2006, 17:55:09 UTC

A quick update on WU 9469195 mentioned in prior message.

I killed the Rosetta 4.80 task ($ kill <pid>) and BOINC re-run the same WU, this time successfully, to completion. Probably the only change being the random seed.

The stderr.txt shown in resultid, contains the contents of the previous, unsuccessful and eventual hung, run attempt (with the previous random seed). Which I had copied here in my previous post.

Btw, should I take the time to report this stuff? Is anyone looking at this?
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 10949 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cloaked_chaos

Send message
Joined: 9 Nov 05
Posts: 14
Credit: 80,818
RAC: 0
Message 10964 - Posted: 19 Feb 2006, 20:26:44 UTC
Last modified: 19 Feb 2006, 20:27:35 UTC

This WU took 165 hours before it finally decided that it was running for too long. I would really like to receive credit for this since it is 2,175.86 credit.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=8103040
ID: 10964 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 18 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2024 University of Washington
https://www.bakerlab.org