Report stuck & aborted WU here please

Author	Message
Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0	Message 12156 - Posted: 17 Mar 2006, 17:40:38 UTC Last modified: 17 Mar 2006, 17:43:12 UTC I'm getting a 1% freeze at step 22183 on this WU: FA_RLXbq_hom030_1bq9A_359_467_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=13795176 Stopping and re-starting BOINC will cause the WU to begin again from step 0 but it always halts at the same place. Aborting unit. ID: 12156 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 12159 - Posted: 17 Mar 2006, 19:23:32 UTC - in response to Message 12154. Last modified: 17 Mar 2006, 19:24:27 UTC This project used to be bullet-proof - what's changed....? believe me, we are trying! GREAT NEWS - Perhaps it might be an idea to let people know there is a problem and to stop making work available until it's fixed - that'll take the pressure off you guys. What sort of percentage of the work returned to you is being trashed by this bug? Would imagine it's fairly high - although, if it was an epidemic failure, would assume you would have stopped sending out work before now. The error rate is still very machine specific--many machines appear to have virtually no errors. Aside from a few faulty work units, which we've stopped, we don't think there are new errors or bugs--just the same old problems. The exciting news is that the Boinc consultant we have hired, Rom, has made an improvement in how the rosetta process terminates that seems to really have made a difference on Ralph. the problem seems to have been not any bug in the rosetta code, but a problem in how the rosetta process shuts itself down when the processor starts doing something else (hence the leave in memory bug, etc.). here is his latest email: Public Project Failure Rate by Type (Top 3): -164 (0xffffff5c) ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED 4052 39.25% -1073741819 (0xc0000005) Unknown error number 2293 22.21% 1 Unknown error number 1284 12.44% Ralph Failure Rate By Type (Top 3): -186 (0xffffffffffffff46) ERR_RESULT_DOWNLOAD 35 31.25% 1 Unknown error number 20 17.86% -529697949 (0xffffffffe06d7363) Unknown error number 18 16.07% The percentages above signify the percentage of times that exit code was given relative to the total number of non-zero exit codes given. As you can see, the 0xc0000005�s and the NESTED_UNHANDLED_EXCEPTIONS have dropped off of the top three biggest offenders. I believe we are making progress. ----- Rom But surely you must be going into damage limitation mode by now. Can you afford to lose lots of crunchers? NO! that is why we are investing everything in fixing the problems now regards and good luck. Tim ID: 12159 · Rating: 0 · rate: / Reply Quote

hob. Send message Joined: 4 Nov 05 Posts: 64 Credit: 250,683 RAC: 0	Message 12161 - Posted: 17 Mar 2006, 20:00:03 UTC - in response to Message 12025. 3/4/2006 6:51:19 AM\|rosetta@home\|Starting result ABINITli_hom010_1lis__322_14_1 using rosetta version 482 this job has been running for over 10 12 days now.........it's been on 84.89% for at least 24 hrs now.......maybe a lot longer ?? its still using cpu power so i assume it's doing something ?? runtime is listed as 256 12 hours so far...and still counting up any advice as to what to do with it would be welcome. restarted boink .............unit reset to 13 hours then stuck again for several hours at the same point..............aborted now..............imho not worth sending to anyone else 46 years dc so far join team FaDbeens join us ID: 12161 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 12163 - Posted: 17 Mar 2006, 20:14:29 UTC ID: 12163 · Rating: 0 · rate: / Reply Quote

Tallguy-13088 Send message Joined: 14 Dec 05 Posts: 9 Credit: 843,378 RAC: 0	Message 12174 - Posted: 18 Mar 2006, 1:47:30 UTC Last modified: 18 Mar 2006, 1:54:57 UTC Hello, I just aborted WU: FA_RLXac_hom017_1acf_359_113_0 after two BOINC Mgr. restarts and about 50+ (combined) hours of processing. All three attempts stalled at STEP: 23171. The last ACCEPTED RMSD was 14.75 and the last ACCEPTED ENERGY was 15.47414. Each attempt showed activity for about 1 minute and hung at 1% complete with the ELAPSED TIME continuing to increment. Rosetta version was 4.82 under Win2K. Just an FYI in case someone wants to debug the workunit. ID: 12174 · Rating: 0 · rate: / Reply Quote

UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,299,279 RAC: 357	Message 12207 - Posted: 18 Mar 2006, 21:46:39 UTC - in response to Message 12159. Last modified: 18 Mar 2006, 22:07:36 UTC The exciting news is that the Boinc consultant we have hired, Rom, has made an improvement in how the rosetta process terminates that seems to really have made a difference on Ralph. the problem seems to have been not any bug in the rosetta code, but a problem in how the [b}rosetta process shuts itself down when the processor starts doing something else[/b] (hence the leave in memory bug, etc.). Hi David, Well that ties in with an observation I can make, which I've seen once or twice. I have noticed that a Rosetta work units "fail" when my 3GHz P4/HT switches from one project to another - so there seem to issues when the Rosetta process seems to be "suspended" by BOINC as it then switches over to another project - (I'm running BBC CCE as a second simultaneous BOINC project on the same PC - this "switch over problem" tends to occur when one "CPU" switches out of working on a Rosetta WU and then switches over to the CCE WU). Maybe this is a help - but seems Rom is on the right trail. regards Tim (edit) typo ID: 12207 · Rating: 0 · rate: / Reply Quote

Rom Walton (BOINC) Volunteer moderator Project developer Send message Joined: 17 Sep 05 Posts: 18 Credit: 40,071 RAC: 0	Message 12225 - Posted: 18 Mar 2006, 23:43:18 UTC - in response to Message 12163. Last modified: 18 Mar 2006, 23:43:53 UTC So the top 3 errors on Rosetta aren't the top 3 errors on Ralph. Great news. How far down on the Rosetta list are the top 3 errors that Ralph is having? The ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED no longer appears on the list at all, and the 0xC0000005 only accounts for 6 of the 49 errors reported in the last 24 hours. If the data of Ralph is any indication about how the application is going to behave on the public project it should result in a 60%-70% in error rate for the public project. ----- Rom ----- Rom My Blog ID: 12225 · Rating: 0 · rate: / Reply Quote

Angus Send message Joined: 17 Sep 05 Posts: 412 Credit: 321,053 RAC: 0	Message 12226 - Posted: 18 Mar 2006, 23:53:12 UTC Last modified: 18 Mar 2006, 23:54:05 UTC Isn't the biggest failure mode still the "stuck at 1%" issue? Do those even get reported as errors, outside of the forum? All the ones I have had have neede to be aborted, after failing to run. Doesn't that just show as "aborted by user", effectively hiding the scope of the problem? Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :) "You can't fix stupid" (Ron White) ID: 12226 · Rating: -1 · rate: / Reply Quote

Rom Walton (BOINC) Volunteer moderator Project developer Send message Joined: 17 Sep 05 Posts: 18 Credit: 40,071 RAC: 0	Message 12228 - Posted: 19 Mar 2006, 0:26:26 UTC - in response to Message 12226. Last modified: 19 Mar 2006, 0:28:42 UTC Isn't the biggest failure mode still the "stuck at 1%" issue? Do those even get reported as errors, outside of the forum? All the ones I have had have neede to be aborted, after failing to run. Doesn't that just show as "aborted by user", effectively hiding the scope of the problem? Actually the �%1 bug� only accounts for roughly 5% of the overall failure cases reported per day. It is by far the biggest failure case from the community perspective though as it requires manual intervention. 0xC0000005 and the ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED errors together were accounting for 60% of the reported errors per day. I tackled these first as they seemed at the time manifestations of the same fundamental problem and they accounted for the biggest piece of the pie. The next biggest heavy hitter is exit code 1; this is a program defined error. This just required that the project change its error logging from stdout to stderr so that it�ll show up in the result log reported back to the server. That work item will be finished in the next few days. Next after that one is 0xC000000D, which seems to have a reoccurring theme that stackwalker failed to initialize during a stack dump. I�ve added some extra messages to the BOINC API to try and track this one down. Now we get to the ERR_ABORTED_VIA_GUI error; this 1% error is really nasty. Unfortunately the pdb file was not deployed with the 4.82 release so trying to get stack traces from the community while it is stuck in the loop it is in isn�t really doable. I have started the investigation with members of the Ralph community to try and track this down since they have access to the pdb file for 4.93. You can track the progress being made here. I hope this clears up some stuff for the community. ----- Rom My Blog ID: 12228 · Rating: 0 · rate: / Reply Quote

STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,101,065 RAC: 144	Message 12229 - Posted: 19 Mar 2006, 0:38:48 UTC ID: 12229 · Rating: 0 · rate: / Reply Quote

Rom Walton (BOINC) Volunteer moderator Project developer Send message Joined: 17 Sep 05 Posts: 18 Credit: 40,071 RAC: 0	Message 12230 - Posted: 19 Mar 2006, 0:58:02 UTC - in response to Message 12229. Last modified: 19 Mar 2006, 0:59:27 UTC Actually the �%1 bug� only accounts for roughly 5% of the overall failure cases reported per day. ========== Like you said whats reported Rom, if a lot of people are like me we quit reporting them a long time ago. I didn't see any point in reporting them any more because it's the same thing over & over. I know I aborted at least 5 or 6 Stuck 1% WU's today alone & it's like that ever day ... :/ That is to say, what is reported to the server. When somebody aborts a workunit, it gets reported to the server as ERR_ABORTED_VIA_GUI. If the workunit eventually exceeds its allocated CPU time it is reported as ERR_RSC_LIMIT_EXCEEDED. So unless you are resetting the project everytime, I get to see it. :) ----- Rom My Blog ID: 12230 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 12231 - Posted: 19 Mar 2006, 1:20:04 UTC - in response to Message 12229. ID: 12231 · Rating: 0 · rate: / Reply Quote

STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,101,065 RAC: 144	Message 12234 - Posted: 19 Mar 2006, 2:02:49 UTC Last modified: 19 Mar 2006, 2:03:30 UTC If you've got a system that consistently has problems with being stuck at 1% then please join Ralph and help them identify the cause. ========== Good Idea, I'll do that as soon as I get some free time ... ;) ID: 12234 · Rating: 0 · rate: / Reply Quote

Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0	Message 12236 - Posted: 19 Mar 2006, 3:34:02 UTC Last modified: 19 Mar 2006, 3:49:50 UTC This is the third stuck at 1% WU in two days(that I know of, I happened to be on that Machine ATM) that I've aborted. It only shows 10 hours but BOINC showed 59 hours... stderr out <core_client_version>5.2.13</core_client_version> <message>aborted via GUI RPC </message> <stderr_txt> # random seed: 2739161 # cpu_run_time_pref: 36000 # cpu_run_time_pref: 36000 # random seed: 2739161 </stderr_txt> Join the Teddies@WCG ID: 12236 · Rating: 0 · rate: / Reply Quote

mgabriel Send message Joined: 18 Sep 05 Posts: 5 Credit: 96,494 RAC: 0	Message 12257 - Posted: 19 Mar 2006, 11:34:32 UTC umm, how bout this one, FA_RLXbq_hom019_1bq9A_359_191_0 running 11 hours, 45.13% done, time to complete is running backwards, now 6:39 hours. also im getting many computation errors on this system ID: 12257 · Rating: 0 · rate: / Reply Quote

vavega Send message Joined: 2 Nov 05 Posts: 82 Credit: 519,981 RAC: 0	Message 12259 - Posted: 19 Mar 2006, 13:50:47 UTC - in response to Message 12230. ID: 12259 · Rating: 0 · rate: / Reply Quote

Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 12271 - Posted: 19 Mar 2006, 16:40:41 UTC Last modified: 19 Mar 2006, 22:28:51 UTC For those who may be interested, Rom has posted information about Rosetta Work Unit errors and the status of the ongoing work to fix the bugs in Rosetta on his "Blog". Moderator9 ROSETTA@home FAQ Moderator Contact ID: 12271 · Rating: 0 · rate: / Reply Quote

Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0	Message 12322 - Posted: 20 Mar 2006, 10:16:41 UTC Last modified: 20 Mar 2006, 10:19:21 UTC Just aborted 4 more. Really hope this gets fixed soon, we've just wasted over 5 days of CPU time! Good luck Rom. 8.8 hours https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11584596 18.2 hours https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11551106 41.4 hours https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11460182 71.0 hours https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11330309 ID: 12322 · Rating: 0 · rate: / Reply Quote

Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0	Message 12332 - Posted: 20 Mar 2006, 12:37:28 UTC Last modified: 20 Mar 2006, 12:38:44 UTC And another. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11627337 This and the 4 I mentioned below were all stuck on 1%. ID: 12332 · Rating: 0 · rate: / Reply Quote

Larry256 Send message Joined: 11 Nov 05 Posts: 2 Credit: 4,021,708 RAC: 5,787	Message 12335 - Posted: 20 Mar 2006, 14:06:23 UTC Look at the errors on this one https://boinc.bakerlab.org/rosetta/result.php?resultid=13858838 ID: 12335 · Rating: 0 · rate: / Reply Quote