Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 18 · Next

AuthorMessage
Profile Moderator8
Volunteer moderator
Project administrator

Send message
Joined: 10 Jan 06
Posts: 16
Credit: 0
RAC: 0
Message 8741 - Posted: 10 Jan 2006, 22:46:51 UTC
Last modified: 10 Jan 2006, 22:56:56 UTC

This thread replaces the previous stuck wu and please abort threads which were getting a little long and unwieldy.

Existing postings have been left in the original threads and direct replies to those postings can be made there please.

Thanks to everyone for the continued reports of bugs and suspicious events.
ID: 8741 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 8748 - Posted: 11 Jan 2006, 0:42:09 UTC

This workunit was stuck at 1 percent for over 3 hours of cpu comp time. it was also recently created.


NO_BARCODE_FRAGS_2reb_227_9206
ID: 8748 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marc Miller

Send message
Joined: 30 Nov 05
Posts: 2
Credit: 18,163
RAC: 0
Message 8749 - Posted: 11 Jan 2006, 0:42:31 UTC

Must "Leave applications in memory while preempted?" still be set to yes? My corporate desktop is on all 24/7 and could be useful for projects like this, but is scarce on memory.

BOINC 5.2.13
Windows XP
rosetta 481
ID: 8749 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marc Miller

Send message
Joined: 30 Nov 05
Posts: 2
Credit: 18,163
RAC: 0
Message 8750 - Posted: 11 Jan 2006, 0:48:51 UTC - in response to Message 8749.  

Must "Leave applications in memory while preempted?" still be set to yes? My corporate desktop is on all 24/7 and could be useful for projects like this, but is scarce on memory.

BOINC 5.2.13
Windows XP
rosetta 481


...and my failed workloads ("Unrecoverable error - exit code 1073741819 (0xc0000005)") include
NO_SIM_ANNEAL_1ogw_228_9701_0
MORE_FRAGS_W_BARCODE_1ogw_231_7263_0
NO_BARCODE_FRAGS_1b72_227_9863_0
NO_BARCODE_FRAGS_1dtj_227_9288_1
NO_RAND_WTS_1b72_230_7636_0
NO_RANDOM_WTS_OR_FRAGS_1b72_223_9475_1
MORE_FRAGS_W_BARCODE_1ogw_231_8193_0
NO_RAND_WTS_1mky_230_8461_0
INCREASE_CYCLES_10_1di2_226_8118_2
BARCODE_FRAG_30_1di2_234_119_0
MORE_FRAGS_W_BARCODE_1dtj_231_9106_0

ID: 8750 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
godpiou

Send message
Joined: 22 Dec 05
Posts: 7
Credit: 1,373
RAC: 0
Message 8763 - Posted: 11 Jan 2006, 7:42:07 UTC

Ok... This is my last unit that show an error

BARCODE_FRAG_30_1di2_234_471_0

Hope this help !

Godpiou
ID: 8763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
godpiou

Send message
Joined: 22 Dec 05
Posts: 7
Credit: 1,373
RAC: 0
Message 8766 - Posted: 11 Jan 2006, 11:05:54 UTC

It's me again...

Another unit that abort computation re: fatal error. The unit in question is:

BARCODE_FRAG_30_2tif_234_838_0

Again, hope this help...

Godpiou
ID: 8766 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 8772 - Posted: 11 Jan 2006, 13:58:28 UTC - in response to Message 8749.  

Must "Leave applications in memory while preempted?" still be set to yes?

yes -though if Rosetta is your sole project and the box is on 24/7 you may get away with a 'no' setting - you would still risk losing a wu every time the benchmarks run.

Best advice is to try it with setting = yes, and if it slugs the box then go to running only when machine not in use.

I also had some success running with a 'yes' setting and with the max no cpus = 1 on an HT box - it got 75% of the throughput of allowing two 'cpus' to be used, but left the box responding to other tasks as well as when BOINC not running.

If none of these work, I think best advice at present is to go to another project till Rosetta gets this sorted. Sorry!

River~~
ID: 8772 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Steve Shedroff

Send message
Joined: 7 Nov 05
Posts: 11
Credit: 250,657
RAC: 0
Message 8773 - Posted: 11 Jan 2006, 13:58:52 UTC

I have a Work Unit that is at 1% processed after 21 hours, 24 minutes and counting. Id: NO_BARCODE_FRAGS_2reb_227_9692_0. The other Work Units I have seem to be running fine in parrallel on this multi CPU computer. Should I abort or let it fail on its own?
ID: 8773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 8774 - Posted: 11 Jan 2006, 14:04:13 UTC - in response to Message 8773.  

I have a Work Unit that is at 1% processed after 21 hours, 24 minutes and counting. Id: NO_BARCODE_FRAGS_2reb_227_9692_0. The other Work Units I have seem to be running fine in parrallel on this multi CPU computer. Should I abort or let it fail on its own?


My rule of thumb is to abort if a WU sticks at the same progress for more than half the time of a full length Rosetta WU running on the same box.

Sometimes stopping BOINC and restarting can save such a WU, but with a multi cpu box you lose some crunch from each of the other cpus, as each of the other Rosetta wu will revert to their last checkpoints.

So unless Rosetta WU typically take 42 hours on your box, it is time to abort in my opinion.

ID: 8774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 8784 - Posted: 11 Jan 2006, 15:19:04 UTC

INCREASE_CYCLES_10_1ogw_226_9787


This workunit errored out after 19000 seconds thats 5 and half hours wasted!

I have been noticing my computer will succcessfully compute others failed units and my failed units are sometimes successfully run on other computers.

Question to the project scientists what is going on here?

I really dont have the time to babysit any of my computers. Mainly will check in from time to time randomly throughout the day for a couple of minutes.

Cheers!! hope I can get a resonable answer from a project scientist as to why the problems of a few weeks ago still rear thier ugly head from time to time.
ID: 8784 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 8786 - Posted: 11 Jan 2006, 15:30:12 UTC - in response to Message 8784.  

INCREASE_CYCLES_10_1ogw_226_9787


This workunit errored out after 19000 seconds thats 5 and half hours wasted!

I have been noticing my computer will succcessfully compute others failed units and my failed units are sometimes successfully run on other computers.

Question to the project scientists what is going on here?

I really dont have the time to babysit any of my computers. Mainly will check in from time to time randomly throughout the day for a couple of minutes.

Cheers!! hope I can get a resonable answer from a project scientist as to why the problems of a few weeks ago still rear thier ugly head from time to time.



I don't know exactly what is going on. for each work unit, we have now close to the targeted 10,000 successful completions, so there are clearly no systematic errors affecting all instantces of a wu. I would love to know how many failures of the sort you had there have been. It is possible that for certain random number seeds very rare rosetta bugs are encountered--this would have to be at less than 1 in 100 since we don't see them in our in house tests. so question: what fraction of your WU have this problem?

we can search for rosetta bugs by starting runs in house with the random number seed and command line from your run. we are doing this now


ID: 8786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile carl.h
Avatar

Send message
Joined: 28 Dec 05
Posts: 555
Credit: 183,449
RAC: 0
Message 8790 - Posted: 11 Jan 2006, 17:24:13 UTC

1/11/2006 13:43:17|rosetta@home|Unrecoverable error for result NO_BARCODE_FRAGS_1di2_227_8993_0 ( - exit code -1073741819 (0xc0000005))



1/11/2006 17:00:24|rosetta@home|Unrecoverable error for result DEFAULT_1n0u_218_633_9 (Incorrect function. (0x1) - exit code 1 (0x1))

Not all Czech`s bounce but I`d like to try with Barbar ;-)

Make no mistake This IS the TEDDIES TEAM.
ID: 8790 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile kb7rzf
Avatar

Send message
Joined: 7 Oct 05
Posts: 16
Credit: 35,427
RAC: 0
Message 8823 - Posted: 12 Jan 2006, 4:45:13 UTC
Last modified: 12 Jan 2006, 4:46:04 UTC

Got one that errored out.

1/11/2006 4:18:28 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1r69_240_504_0 ( - exit code -1073741819 (0xc0000005))

[edit]

Heres the info on the WU:

stderr out <core_client_version>5.2.13</core_client_version>
<message> - exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
No heartbeat from core client for 31 sec - exiting

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x7C911E58 read attempt to address 0xBF005BE0

Exiting...

</stderr_txt>


Validate state Invalid
Claimed credit 16.5597049633772
Granted credit 0
application version 4.81

Thanks

Jeremy
ID: 8823 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 8824 - Posted: 12 Jan 2006, 6:23:12 UTC - in response to Message 8786.  




http://boinc.bakerlab.org/rosetta/workunit.php?wuid=5020591
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=4952011
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=4964562
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=4964562
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=4704097


Hi David,

Thank-you for responding to my question.

Out of 185 results I have yet to compute 15 so I have 170 results computed.

Out of the 170 21 were the ones that error out in 20 seconds.

Thier were 5 workunits that had the large compute time and errored out from anywere from 2 hours to five hours plus.

Out of these five the big one has errored out on another computer after an hour and is currently on a third computer.

The one that errored out after three hours failed on another computer after one hour. Then was suuccessfully completed by a third computer after four hours.

The two that errored after 1 hour and a half; one was successfully computed and the other is on another computer.

These time consuming wrkunits represent 3 percent of my 170 completions since dec 21 2005.

I have included the five workunits addresses at the top of this post.


Have a Great day....despite the rain ..........Cheers!!!!!!!!!!!


It is possible that for certain random number seeds very rare rosetta bugs are encountered--this would have to be at less than 1 in 100 since we don't see them in our in house tests. so question: what fraction of your WU have this problem?

we can search for rosetta bugs by starting runs in house with the random number seed and command line from your run. we are doing this now

[/quote]

ID: 8824 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andrew

Send message
Joined: 19 Sep 05
Posts: 162
Credit: 105,512
RAC: 0
Message 8887 - Posted: 13 Jan 2006, 0:21:15 UTC

If you get a stuck WU, specifically a 1% stuck WU, and want to help diagnose the problem, follow the instruction in this thread:

Help us solve the 1% bug!
ID: 8887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
godpiou

Send message
Joined: 22 Dec 05
Posts: 7
Credit: 1,373
RAC: 0
Message 8914 - Posted: 13 Jan 2006, 6:11:29 UTC

Hi !

Sorry but...another WU aborted... Here's the details:

rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1npsA_239_837_0 ( - exit code -164 (0xffffff5c))

Again..hope this help & have a good day,

Godpiou
ID: 8914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marie Lucie

Send message
Joined: 9 Dec 05
Posts: 5
Credit: 40,616
RAC: 0
Message 8951 - Posted: 13 Jan 2006, 16:58:14 UTC

Hello, for the first time I just had a problem with a wu. Here are the messages :

13/01/2006 17:25:02|rosetta@home|Unrecoverable error for result INCREASE_CYCLES_10_1mky_208_40_8 ( - exit code -1073741819 (0xc0000005))
13/01/2006 17:25:02||request_reschedule_cpus: process exited
13/01/2006 17:25:02|rosetta@home|Computation for result INCREASE_CYCLES_10_1mky_208_40_8 finished

Hope it helps
ID: 8951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Golden Turtle

Send message
Joined: 23 Sep 05
Posts: 34
Credit: 22,941
RAC: 0
Message 8975 - Posted: 13 Jan 2006, 23:22:20 UTC
Last modified: 13 Jan 2006, 23:24:05 UTC

ROSETTA 5.2.13. Windows XP 2Pro.
WU:- No Barcode Frags - 1di2 227 9845 0.
CPU Time = 08.29.21 Progress = 0% Time to completion = 7.44.04
Message: aborted via GUI RPC Unhandled exception Reason Access Violation [0xc0000005]
at address 0x7c910f29 read attempt to adress 0x3f8f5c2d exiting.
Hope this is of use!
ID: 8975 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kevint

Send message
Joined: 8 Oct 05
Posts: 84
Credit: 2,530,451
RAC: 0
Message 8987 - Posted: 14 Jan 2006, 3:59:58 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=3537724

I assume this is a 1% bug, I have just recently started crunching Rosetta so am unfamiler with much of what has gone on in the past.

This PC is a P4 Hyperthread (4 virtual CPU) - normaly crunches average of about 2 hours per CPU I noticed this stuck today at just over 3 hours and still sitting around 1% - I was unaware of a bug, thought it might be something wrong with my pc so I just aborted the thing.
SETI.USA


ID: 8987 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,400,906
RAC: 0
Message 9024 - Posted: 14 Jan 2006, 17:35:49 UTC
Last modified: 14 Jan 2006, 17:48:01 UTC

This WU here Hung for over 4 hours at 70% complete. I stopped BOINC, restarted and it completed sucessfully. There are some error notations and de-bug info in the reported result at the above link.

WU name - NO_RAND_WTS_1ogw_230_7724_0

Mac 1.4GHz Dual G4
Mac OS 10.4.3
BOINC 5.2.13

Regards
Phil

We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9024 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 18 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2020 University of Washington
http://www.bakerlab.org