Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 17 · Next

AuthorMessage
Stwato

Send message
Joined: 11 Jan 06
Posts: 150
Credit: 655,634
RAC: 0
Message 12064 - Posted: 15 Mar 2006, 19:03:21 UTC
Last modified: 15 Mar 2006, 19:08:23 UTC

I had a work unit stuck on 1% for over 3 hours. It was FA_RLXbg_hom004_1bgf_359_376_0

If another WU gets stuck should I abort it (I aborted this one, sorry if that was wrong)?

[Edit: wrote down wrong work unit, oops]
ID: 12064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Steve Shedroff

Send message
Joined: 7 Nov 05
Posts: 11
Credit: 250,657
RAC: 0
Message 12079 - Posted: 16 Mar 2006, 1:41:03 UTC

I aborted the following WU it appeared stuck at 1% after 44 hours with 59 hours to go. It did not appear to be progressing.

3/15/2006 6:54:16 PM|rosetta@home|Unrecoverable error for result FA_RLXai_hom022_1ail__359_179_0 (aborted via GUI RPC)
Regards,
Steve
ID: 12079 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ib Rasmussen

Send message
Joined: 27 Sep 05
Posts: 16
Credit: 211,416
RAC: 0
Message 12088 - Posted: 16 Mar 2006, 8:00:13 UTC
Last modified: 16 Mar 2006, 8:04:36 UTC

This morning my machine ID 5309, which only runs rosetta, showed an empty Boinc-manager - no, work, no projects.

stderrae.txt says

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x0038F114 read attempt to address 0x00000008

1: 03/15/06 22:02:49
1: SymGetLineFromAddr(): GetLastError = 126

stdoutae.txt says (with some editing)

2006-03-15 20:08:46 [rosetta@home] Starting result FA_RLXb3_hom003_1b3aA_359_477_0 using rosetta version 482
...
2006-03-15 20:36:39 [rosetta@home] Computation for result FA_RLXbk_hom026_1bk2__359_477_0 finished
2006-03-15 20:36:40 [rosetta@home] Starting result FA_RLXac_hom001_1acf__359_478_0 using rosetta version 482
...
2006-03-15 21:01:42 [---] request_reschedule_cpus: files downloaded
2006-03-15 22:02:41 [rosetta@home] Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
2006-03-15 22:02:41 [rosetta@home] Reason: To fetch work
2006-03-15 22:02:41 [rosetta@home] Requesting 1470 seconds of new work
2006-03-15 22:02:46 [rosetta@home] Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
2006-03-15 22:02:48 [rosetta@home] Started download of hom027_1ig5A.fasta.gz
2006-03-15 22:02:48 [rosetta@home] Started download of hom027_1ig5A.psipred_ss2.gz
2006-03-15 22:02:50 [rosetta@home] Finished download of hom027_1ig5A.psipred_ss2.gz
2006-03-15 22:02:50 [rosetta@home] Throughput 1687 bytes/sec
2006-03-15 22:02:50 [rosetta@home] Started download of hom027_aa1ig5A03_05.200_v1_3.gz


After restarting the manager the two wus picked up at 87% and 71%.

The last wu in the queue: FA_RLXig_hom027_1ig5a_360_73_0

/Ib
ID: 12088 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 12089 - Posted: 16 Mar 2006, 8:12:38 UTC

Another one: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11167215

Stuck on 1% after 12 hours.
ID: 12089 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,299,279
RAC: 4
Message 12113 - Posted: 16 Mar 2006, 23:42:08 UTC - in response to Message 12089.  

This WU was stuck at 1% for a day - then started "on it's own" - got to about 70% done and then BOINC switched over to one of the other projects I'm running (BBC CCE) and immediately I got a "computation error" and the percentage went to 100%.

16/03/2006 23:19:42|rosetta@home|Unrecoverable error for result FA_RLXdh_hom025_1dhn__360_62_0 ( - exit code -164 (0xffffff5c))


So, that more wasted CPU cycles.

This project used to be bullet-proof - what's changed....?

regards,

Tim

PS - Our team with 430+ overall members (and at least 130 already joined up to Rosetta) were going to be concentrating on Rosetta for a "Crunching Weekend" on 25th-26th March.

see here: http://www.ukboincteam.org.uk/uk-boinc-team.html

If this project doesn't get sorted REAL QUICK, we'll be forced to switch our attentions to a different project....!


ID: 12113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile m.mitch
Avatar

Send message
Joined: 10 Feb 06
Posts: 34
Credit: 1,928,904
RAC: 0
Message 12117 - Posted: 17 Mar 2006, 1:09:27 UTC

I have a work unit that has been stuck at 1% for 10:53:49. I just noticed it.

Does anyone want to have a look or should I just restart it?


Click here to join the #1 Aussie Alliance on Rosetta
ID: 12117 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile xrobert

Send message
Joined: 28 Oct 05
Posts: 3
Credit: 168,865
RAC: 0
Message 12131 - Posted: 17 Mar 2006, 7:08:54 UTC

The WU I have has got stuck on 1%.

FA_RLXai_hom023_1aiu__359_454_0

I've now also aborted another unit which begins with ai_hom, to be safe.


ID: 12131 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12133 - Posted: 17 Mar 2006, 7:28:06 UTC - in response to Message 12113.  

This WU was stuck at 1% for a day - then started "on it's own" - got to about 70% done and then BOINC switched over to one of the other projects I'm running (BBC CCE) and immediately I got a "computation error" and the percentage went to 100%.

16/03/2006 23:19:42|rosetta@home|Unrecoverable error for result FA_RLXdh_hom025_1dhn__360_62_0 ( - exit code -164 (0xffffff5c))


So, that more wasted CPU cycles.

This project used to be bullet-proof - what's changed....?

regards,

Tim

PS - Our team with 430+ overall members (and at least 130 already joined up to Rosetta) were going to be concentrating on Rosetta for a "Crunching Weekend" on 25th-26th March.

see here: http://www.ukboincteam.org.uk/uk-boinc-team.html

If this project doesn't get sorted REAL QUICK, we'll be forced to switch our attentions to a different project....!




believe me, we are trying!

ID: 12133 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 12150 - Posted: 17 Mar 2006, 14:35:33 UTC

WU 11139882 stuck for a couple of days since 15-Mar


ps u -U boinc
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 27870 0.0 1.3 6800 2944 ? S Jan18 2:07 ./boinc
boinc 4280 81.3 0.1 141408 284 ? RN Mar15 2324:55 rosetta_4.81_i68
boinc 4281 0.0 0.1 141408 284 ? SN Mar15 0:00 rosetta_4.81_i686
boinc 4282 0.0 0.1 141408 284 ? SN Mar15 0:00 rosetta_4.81_i686

fgrep pct_ stdout.txt
BOINC :: [2006-03-15 16:43:09] :: mode: abrelax :: nstartnm: 1 :: number_of_output: 10 :: num_decoys: 0 :: pct_complete: 0.01
BOINC :: [2006-03-15 16:57:00] :: num_decoys: 1 :: number_of_output: 35 :: pct_complete: 0.0283208
BOINC :: [2006-03-15 17:09:48] :: num_decoys: 2 :: number_of_output: 36 :: pct_complete: 0.0546174


I've killed the task, let's see how it goes...
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 12150 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ib Rasmussen

Send message
Joined: 27 Sep 05
Posts: 16
Credit: 211,416
RAC: 0
Message 12151 - Posted: 17 Mar 2006, 14:38:27 UTC

HB_BARCODE_30_1elwA_351_8105_0 stuck for two hours. Still stuck efter restart. Aborted.

Computer ID 177498.

/Ib
ID: 12151 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,299,279
RAC: 4
Message 12154 - Posted: 17 Mar 2006, 16:01:19 UTC - in response to Message 12133.  

This project used to be bullet-proof - what's changed....?

believe me, we are trying!



GREAT NEWS - Perhaps it might be an idea to let people know there is a problem and to stop making work available until it's fixed - that'll take the pressure off you guys.

What sort of percentage of the work returned to you is being trashed by this bug?

Would imagine it's fairly high - although, if it was an epidemic failure, would assume you would have stopped sending out work before now.

But surely you must be going into damage limitation mode by now. Can you afford to lose lots of crunchers?

regards and good luck.

Tim
ID: 12154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 12155 - Posted: 17 Mar 2006, 17:30:15 UTC - in response to Message 12154.  


ID: 12155 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Divide Overflow

Send message
Joined: 17 Sep 05
Posts: 82
Credit: 921,382
RAC: 0
Message 12156 - Posted: 17 Mar 2006, 17:40:38 UTC
Last modified: 17 Mar 2006, 17:43:12 UTC

I'm getting a 1% freeze at step 22183 on this WU:

FA_RLXbq_hom030_1bq9A_359_467_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=13795176

Stopping and re-starting BOINC will cause the WU to begin again from step 0 but it always halts at the same place. Aborting unit.
ID: 12156 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12159 - Posted: 17 Mar 2006, 19:23:32 UTC - in response to Message 12154.  
Last modified: 17 Mar 2006, 19:24:27 UTC

This project used to be bullet-proof - what's changed....?

believe me, we are trying!



GREAT NEWS - Perhaps it might be an idea to let people know there is a problem and to stop making work available until it's fixed - that'll take the pressure off you guys.

What sort of percentage of the work returned to you is being trashed by this bug?

Would imagine it's fairly high - although, if it was an epidemic failure, would assume you would have stopped sending out work before now.





The error rate is still very machine specific--many machines appear to have virtually no errors. Aside from a few faulty work units, which we've stopped, we don't think there are new errors or bugs--just the same old problems.

The exciting news is that the Boinc consultant we have hired, Rom, has made an improvement in how
the rosetta process terminates that seems to really have made a difference on Ralph. the problem seems to have been not any bug in the rosetta code, but a problem in how the rosetta process shuts itself down when the processor starts doing something else (hence the leave in memory bug, etc.).

here is his latest email:

Public Project Failure Rate by Type (Top 3):
-164 (0xffffff5c) ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED
4052
39.25%
-1073741819 (0xc0000005) Unknown error number
2293
22.21%
1 Unknown error number
1284
12.44%

Ralph Failure Rate By Type (Top 3):
-186 (0xffffffffffffff46) ERR_RESULT_DOWNLOAD
35
31.25%
1 Unknown error number
20
17.86%
-529697949 (0xffffffffe06d7363) Unknown error number
18
16.07%

The percentages above signify the percentage of times that exit code was given relative to the total number of non-zero exit codes given.

As you can see, the 0xc0000005Ís and the NESTED_UNHANDLED_EXCEPTIONS have dropped off of the top three biggest offenders.



I believe we are making progress.

----- Rom








But surely you must be going into damage limitation mode by now. Can you afford to lose lots of crunchers?
NO! that is why we are investing everything in fixing the problems now


regards and good luck.

Tim


ID: 12159 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hob.
Avatar

Send message
Joined: 4 Nov 05
Posts: 64
Credit: 250,683
RAC: 0
Message 12161 - Posted: 17 Mar 2006, 20:00:03 UTC - in response to Message 12025.  

3/4/2006 6:51:19 AM|rosetta@home|Starting result ABINITli_hom010_1lis__322_14_1 using rosetta version 482


this job has been running for over 10 12 days now.........it's been on 84.89% for at least 24 hrs now.......maybe a lot longer ?? its still using cpu power so i assume it's doing something ?? runtime is listed as 256 12 hours so far...and still counting up

any advice as to what to do with it would be welcome.


restarted boink .............unit reset to 13 hours then stuck again for several hours at the same point..............aborted now..............imho not worth sending to anyone else

46 years dc so far

join team FaDbeens
join us

ID: 12161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 12163 - Posted: 17 Mar 2006, 20:14:29 UTC


ID: 12163 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tallguy-13088

Send message
Joined: 14 Dec 05
Posts: 9
Credit: 843,378
RAC: 0
Message 12174 - Posted: 18 Mar 2006, 1:47:30 UTC
Last modified: 18 Mar 2006, 1:54:57 UTC

Hello,

I just aborted WU: FA_RLXac_hom017_1acf_359_113_0 after two BOINC Mgr. restarts and about 50+ (combined) hours of processing. All three attempts stalled at STEP: 23171. The last ACCEPTED RMSD was 14.75 and the last ACCEPTED ENERGY was 15.47414. Each attempt showed activity for about 1 minute and hung at 1% complete with the ELAPSED TIME continuing to increment. Rosetta version was 4.82 under Win2K. Just an FYI in case someone wants to debug the workunit.

ID: 12174 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,299,279
RAC: 4
Message 12207 - Posted: 18 Mar 2006, 21:46:39 UTC - in response to Message 12159.  
Last modified: 18 Mar 2006, 22:07:36 UTC

The exciting news is that the Boinc consultant we have hired, Rom, has made an improvement in how
the rosetta process terminates that seems to really have made a difference on Ralph. the problem seems to have been not any bug in the rosetta code, but a problem in how the [b}rosetta process shuts itself down when the processor starts doing something else[/b] (hence the leave in memory bug, etc.).


Hi David,

Well that ties in with an observation I can make, which I've seen once or twice.

I have noticed that a Rosetta work units "fail" when my 3GHz P4/HT switches from one project to another - so there seem to issues when the Rosetta process seems to be "suspended" by BOINC as it then switches over to another project - (I'm running BBC CCE as a second simultaneous BOINC project on the same PC - this "switch over problem" tends to occur when one "CPU" switches out of working on a Rosetta WU and then switches over to the CCE WU).

Maybe this is a help - but seems Rom is on the right trail.


regards

Tim

(edit) typo
ID: 12207 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 40,071
RAC: 0
Message 12225 - Posted: 18 Mar 2006, 23:43:18 UTC - in response to Message 12163.  
Last modified: 18 Mar 2006, 23:43:53 UTC


So the top 3 errors on Rosetta aren't the top 3 errors on Ralph. Great news. How far down on the Rosetta list are the top 3 errors that Ralph is having?


The ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED no longer appears on the list at all, and the 0xC0000005 only accounts for 6 of the 49 errors reported in the last 24 hours.

If the data of Ralph is any indication about how the application is going to behave on the public project it should result in a 60%-70% in error rate for the public project.

----- Rom
----- Rom
My Blog
ID: 12225 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Angus

Send message
Joined: 17 Sep 05
Posts: 412
Credit: 321,053
RAC: 0
Message 12226 - Posted: 18 Mar 2006, 23:53:12 UTC
Last modified: 18 Mar 2006, 23:54:05 UTC

Isn't the biggest failure mode still the "stuck at 1%" issue? Do those even get reported as errors, outside of the forum? All the ones I have had have neede to be aborted, after failing to run. Doesn't that just show as "aborted by user", effectively hiding the scope of the problem?
Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)
ID: 12226 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 17 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2025 University of Washington
https://www.bakerlab.org