Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 17 · Next
Author | Message |
---|---|
Rom Walton (BOINC) Volunteer moderator Project developer Send message Joined: 17 Sep 05 Posts: 18 Credit: 40,071 RAC: 0 |
Actually the “%1 bug” only accounts for roughly 5% of the overall failure cases reported per day. It is by far the biggest failure case from the community perspective though as it requires manual intervention. 0xC0000005 and the ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED errors together were accounting for 60% of the reported errors per day. I tackled these first as they seemed at the time manifestations of the same fundamental problem and they accounted for the biggest piece of the pie. The next biggest heavy hitter is exit code 1; this is a program defined error. This just required that the project change its error logging from stdout to stderr so that it’ll show up in the result log reported back to the server. That work item will be finished in the next few days. Next after that one is 0xC000000D, which seems to have a reoccurring theme that stackwalker failed to initialize during a stack dump. I’ve added some extra messages to the BOINC API to try and track this one down. Now we get to the ERR_ABORTED_VIA_GUI error; this 1% error is really nasty. Unfortunately the pdb file was not deployed with the 4.82 release so trying to get stack traces from the community while it is stuck in the loop it is in isn’t really doable. I have started the investigation with members of the Ralph community to try and track this down since they have access to the pdb file for 4.93. You can track the progress being made here. I hope this clears up some stuff for the community. ----- Rom My Blog |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,103,208 RAC: 5 |
|
Rom Walton (BOINC) Volunteer moderator Project developer Send message Joined: 17 Sep 05 Posts: 18 Credit: 40,071 RAC: 0 |
Actually the “%1 bug” only accounts for roughly 5% of the overall failure cases reported per day. That is to say, what is reported to the server. When somebody aborts a workunit, it gets reported to the server as ERR_ABORTED_VIA_GUI. If the workunit eventually exceeds its allocated CPU time it is reported as ERR_RSC_LIMIT_EXCEEDED. So unless you are resetting the project everytime, I get to see it. :) ----- Rom My Blog |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
|
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,103,208 RAC: 5 |
If you've got a system that consistently has problems with being stuck at 1% then please join Ralph and help them identify the cause. ========== Good Idea, I'll do that as soon as I get some free time ... ;) |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
This is the third stuck at 1% WU in two days(that I know of, I happened to be on that Machine ATM) that I've aborted. It only shows 10 hours but BOINC showed 59 hours... stderr out <core_client_version>5.2.13</core_client_version> Join the Teddies@WCG |
mgabriel Send message Joined: 18 Sep 05 Posts: 5 Credit: 96,494 RAC: 0 |
umm, how bout this one, FA_RLXbq_hom019_1bq9A_359_191_0 running 11 hours, 45.13% done, time to complete is running backwards, now 6:39 hours. also im getting many computation errors on this system |
vavega Send message Joined: 2 Nov 05 Posts: 82 Credit: 519,981 RAC: 0 |
|
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
Just aborted 4 more. Really hope this gets fixed soon, we've just wasted over 5 days of CPU time! Good luck Rom. 8.8 hours https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11584596 18.2 hours https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11551106 41.4 hours https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11460182 71.0 hours https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11330309 |
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
And another. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11627337 This and the 4 I mentioned below were all stuck on 1%. |
Larry256 Send message Joined: 11 Nov 05 Posts: 2 Credit: 4,205,117 RAC: 2,182 |
Look at the errors on this one https://boinc.bakerlab.org/rosetta/result.php?resultid=13858838 |
sharder8 Send message Joined: 2 Feb 06 Posts: 7 Credit: 15,648,378 RAC: 0 |
Someone may want to take a look at the results on this one as well, there's plenty of them and it isn't the 1% "stuck" problem. https://boinc.bakerlab.org/rosetta/results.php?hostid=181476 I've stopped Rosetta on this machine, as it would run through a ton of jobs and client error them until it gave the message "daily quota met". Harder |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
David, Many moons ago, in a different thread, you gave instructions for manually restarting a "1% stuck" WU. I've got on one one of my systems, do you want me to restart it, and is there anything else that I can do to help identify the problem? Things like taking a snapshot of the rosetta and slots/0 folders, zipping it up and making it available for you to download, or anything else that might help. |
Team_Elteor_Borislavj~Intelligence Send message Joined: 7 Dec 05 Posts: 14 Credit: 56,027 RAC: 0 |
HB_BARCODE_30_1bk2__351_7729_0 is stuck! After 5 hours of crunching still at 1% :( |
dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0 |
These HB_BARCODE_30 were stuck in a slot for ~24 hours without progressing. Some had ~25min CPU time; at least one had ~55min CPU. It's really annoying that I lost around 4-cpu-days of work because of these four. dag https://boinc.bakerlab.org/rosetta/result.php?resultid=14185477 https://boinc.bakerlab.org/rosetta/result.php?resultid=14184543 https://boinc.bakerlab.org/rosetta/result.php?resultid=14100307 https://boinc.bakerlab.org/rosetta/result.php?resultid=14099752 dag --Finding aliens is cool, but understanding the structure of proteins is useful. |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=14068243 Rather strange : I did not, repeat, did not abort it myself. Didn't touch the machine, it runs on it own. Had more of these and still don't know what happens. |
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
Another one, aborted after 16 hours stuck on 1% https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11717839 This one was a bit different - it was stuck on 30.19% after 8 hours. After restarting BOINC, it reset back to 38 mins CPU time and 30.19% and got stuck again. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11743501 It's getting increasingly frustrating having to babysit this project all the time. Fingers crossed for those working on a fix. |
Team_Elteor_Borislavj~Intelligence Send message Joined: 7 Dec 05 Posts: 14 Credit: 56,027 RAC: 0 |
I'm experiencing a lot of stuck WU's with FA_RLX**** I'm now at the point, if a WU is at 1 percent after 1 hour, i'm manually aborting it... i want credit for my cpu time :( |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
This unit stuck for 9 hours: FA_RLXpt_hom003_1ptq__361_234 Brought up the graphics screen (I dont run any graphics) and it was all froze except for the cpu clock was still counting. Resetting boinc did no good. Ended up aborting it. out of 183 results i have 4 errors of the frozen or 1 to 15 percent type. Cheers all!!! |
mr.kjellen Send message Joined: 5 Dec 05 Posts: 3 Credit: 1,226,674 RAC: 0 |
this one stuck at one percent. HBLR_1.0_1dtj_332_2576 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=9848871 had it crunching for about 350000sek/1500creds :( Aborted it. Seems someone did crunch it eventually. No luck for me tho. /anton |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2025 University of Washington
https://www.bakerlab.org