Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 17 · Next

AuthorMessage
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 40,071
RAC: 0
Message 12228 - Posted: 19 Mar 2006, 0:26:26 UTC - in response to Message 12226.  
Last modified: 19 Mar 2006, 0:28:42 UTC


Isn't the biggest failure mode still the "stuck at 1%" issue? Do those even get reported as errors, outside of the forum? All the ones I have had have neede to be aborted, after failing to run. Doesn't that just show as "aborted by user", effectively hiding the scope of the problem?


Actually the “%1 bug” only accounts for roughly 5% of the overall failure cases reported per day. It is by far the biggest failure case from the community perspective though as it requires manual intervention.

0xC0000005 and the ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED errors together were accounting for 60% of the reported errors per day. I tackled these first as they seemed at the time manifestations of the same fundamental problem and they accounted for the biggest piece of the pie.

The next biggest heavy hitter is exit code 1; this is a program defined error. This just required that the project change its error logging from stdout to stderr so that it’ll show up in the result log reported back to the server. That work item will be finished in the next few days.

Next after that one is 0xC000000D, which seems to have a reoccurring theme that stackwalker failed to initialize during a stack dump. I’ve added some extra messages to the BOINC API to try and track this one down.

Now we get to the ERR_ABORTED_VIA_GUI error; this 1% error is really nasty. Unfortunately the pdb file was not deployed with the 4.82 release so trying to get stack traces from the community while it is stuck in the loop it is in isn’t really doable. I have started the investigation with members of the Ralph community to try and track this down since they have access to the pdb file for 4.93. You can track the progress being made here.

I hope this clears up some stuff for the community.

----- Rom
My Blog
ID: 12228 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,103,208
RAC: 5
Message 12229 - Posted: 19 Mar 2006, 0:38:48 UTC

ID: 12229 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 40,071
RAC: 0
Message 12230 - Posted: 19 Mar 2006, 0:58:02 UTC - in response to Message 12229.  
Last modified: 19 Mar 2006, 0:59:27 UTC

Actually the “%1 bug” only accounts for roughly 5% of the overall failure cases reported per day.
==========

Like you said whats reported Rom, if a lot of people are like me we quit reporting them a long time ago. I didn't see any point in reporting them any more because it's the same thing over & over. I know I aborted at least 5 or 6 Stuck 1% WU's today alone & it's like that ever day ... :/


That is to say, what is reported to the server. When somebody aborts a workunit, it gets reported to the server as ERR_ABORTED_VIA_GUI.

If the workunit eventually exceeds its allocated CPU time it is reported as ERR_RSC_LIMIT_EXCEEDED.

So unless you are resetting the project everytime, I get to see it. :)


----- Rom
My Blog
ID: 12230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 12231 - Posted: 19 Mar 2006, 1:20:04 UTC - in response to Message 12229.  


ID: 12231 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 4,103,208
RAC: 5
Message 12234 - Posted: 19 Mar 2006, 2:02:49 UTC
Last modified: 19 Mar 2006, 2:03:30 UTC

If you've got a system that consistently has problems with being stuck at 1% then please join Ralph and help them identify the cause.
==========

Good Idea, I'll do that as soon as I get some free time ... ;)

ID: 12234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12236 - Posted: 19 Mar 2006, 3:34:02 UTC
Last modified: 19 Mar 2006, 3:49:50 UTC

This is the third stuck at 1% WU in two days(that I know of, I happened to be on that Machine ATM) that I've aborted. It only shows 10 hours but BOINC showed 59 hours...
stderr out <core_client_version>5.2.13</core_client_version>
<message>aborted via GUI RPC
</message>
<stderr_txt>
# random seed: 2739161
# cpu_run_time_pref: 36000
# cpu_run_time_pref: 36000
# random seed: 2739161

</stderr_txt>



Join the Teddies@WCG
ID: 12236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mgabriel

Send message
Joined: 18 Sep 05
Posts: 5
Credit: 96,494
RAC: 0
Message 12257 - Posted: 19 Mar 2006, 11:34:32 UTC

umm, how bout this one, FA_RLXbq_hom019_1bq9A_359_191_0 running 11 hours, 45.13% done, time to complete is running backwards, now 6:39 hours.
also im getting many computation errors on this system
ID: 12257 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
vavega
Avatar

Send message
Joined: 2 Nov 05
Posts: 82
Credit: 519,981
RAC: 0
Message 12259 - Posted: 19 Mar 2006, 13:50:47 UTC - in response to Message 12230.  


ID: 12259 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 12322 - Posted: 20 Mar 2006, 10:16:41 UTC
Last modified: 20 Mar 2006, 10:19:21 UTC

ID: 12322 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 12332 - Posted: 20 Mar 2006, 12:37:28 UTC
Last modified: 20 Mar 2006, 12:38:44 UTC

And another.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11627337

This and the 4 I mentioned below were all stuck on 1%.
ID: 12332 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Larry256

Send message
Joined: 11 Nov 05
Posts: 2
Credit: 4,205,117
RAC: 2,182
Message 12335 - Posted: 20 Mar 2006, 14:06:23 UTC

Look at the errors on this one

https://boinc.bakerlab.org/rosetta/result.php?resultid=13858838


ID: 12335 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sharder8
Avatar

Send message
Joined: 2 Feb 06
Posts: 7
Credit: 15,648,378
RAC: 0
Message 12346 - Posted: 20 Mar 2006, 19:31:00 UTC

Someone may want to take a look at the results on this one as well, there's plenty of them and it isn't the 1% "stuck" problem.

https://boinc.bakerlab.org/rosetta/results.php?hostid=181476

I've stopped Rosetta on this machine, as it would run through a ton of jobs and client error them until it gave the message "daily quota met".

Harder
ID: 12346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 12353 - Posted: 20 Mar 2006, 21:59:34 UTC - in response to Message 8786.  


I don't know exactly what is going on. for each work unit, we have now close to the targeted 10,000 successful completions, so there are clearly no systematic errors affecting all instantces of a wu. I would love to know how many failures of the sort you had there have been. It is possible that for certain random number seeds very rare rosetta bugs are encountered--this would have to be at less than 1 in 100 since we don't see them in our in house tests. so question: what fraction of your WU have this problem?

we can search for rosetta bugs by starting runs in house with the random number seed and command line from your run. we are doing this now



David,

Many moons ago, in a different thread, you gave instructions for manually restarting a "1% stuck" WU. I've got on one one of my systems, do you want me to restart it, and is there anything else that I can do to help identify the problem? Things like taking a snapshot of the rosetta and slots/0 folders, zipping it up and making it available for you to download, or anything else that might help.
ID: 12353 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Team_Elteor_Borislavj~Intelligence

Send message
Joined: 7 Dec 05
Posts: 14
Credit: 56,027
RAC: 0
Message 12358 - Posted: 20 Mar 2006, 23:14:47 UTC

HB_BARCODE_30_1bk2__351_7729_0 is stuck! After 5 hours of crunching still at 1% :(

ID: 12358 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 12360 - Posted: 20 Mar 2006, 23:16:27 UTC

These HB_BARCODE_30 were stuck in a slot for ~24 hours without progressing. Some had ~25min CPU time; at least one had ~55min CPU.

It's really annoying that I lost around 4-cpu-days of work because of these four.

dag

https://boinc.bakerlab.org/rosetta/result.php?resultid=14185477
https://boinc.bakerlab.org/rosetta/result.php?resultid=14184543
https://boinc.bakerlab.org/rosetta/result.php?resultid=14100307
https://boinc.bakerlab.org/rosetta/result.php?resultid=14099752
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 12360 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 12468 - Posted: 21 Mar 2006, 22:12:55 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=14068243

Rather strange : I did not, repeat, did not abort it myself.
Didn't touch the machine, it runs on it own.
Had more of these and still don't know what happens.

ID: 12468 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 12497 - Posted: 22 Mar 2006, 8:47:15 UTC
Last modified: 22 Mar 2006, 8:51:25 UTC

Another one, aborted after 16 hours stuck on 1%

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11717839

This one was a bit different - it was stuck on 30.19% after 8 hours. After restarting BOINC, it reset back to 38 mins CPU time and 30.19% and got stuck again.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11743501

It's getting increasingly frustrating having to babysit this project all the time. Fingers crossed for those working on a fix.
ID: 12497 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Team_Elteor_Borislavj~Intelligence

Send message
Joined: 7 Dec 05
Posts: 14
Credit: 56,027
RAC: 0
Message 12502 - Posted: 22 Mar 2006, 10:13:45 UTC

I'm experiencing a lot of stuck WU's with FA_RLX****
I'm now at the point, if a WU is at 1 percent after 1 hour, i'm manually aborting it... i want credit for my cpu time :(
ID: 12502 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 12522 - Posted: 22 Mar 2006, 17:44:29 UTC

This unit stuck for 9 hours:

FA_RLXpt_hom003_1ptq__361_234

Brought up the graphics screen (I dont run any graphics) and it was all froze except for the cpu clock was still counting.

Resetting boinc did no good. Ended up aborting it.

out of 183 results i have 4 errors of the frozen or 1 to 15 percent type.

Cheers all!!!
ID: 12522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mr.kjellen

Send message
Joined: 5 Dec 05
Posts: 3
Credit: 1,226,674
RAC: 0
Message 12525 - Posted: 22 Mar 2006, 19:51:52 UTC

this one stuck at one percent.

HBLR_1.0_1dtj_332_2576

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=9848871

had it crunching for about 350000sek/1500creds :( Aborted it. Seems someone did crunch it eventually. No luck for me tho.
/anton
ID: 12525 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 17 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2025 University of Washington
https://www.bakerlab.org