Report stuck & aborted WU here please

Author	Message
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 13802 - Posted: 15 Apr 2006, 5:44:59 UTC - in response to Message 13801. Thanks for the report. Timeouts should happen on all machines -- but in BOINC they're tied to the number of floating point operations, not wall clock time, unfortunately. Because the Windows app appears to be a little bit faster than the Linux app, timeouts will generally take more time on the linux apps. We completely understand that you can't check up on the servers all the time -- we're being extra vigilant to make sure this sort of problem doesn't happen again. https://boinc.bakerlab.org/rosetta/result.php?resultid=17102919 Manually aborted at 1% after 52 hrs, it was a dodgy truncate unit I missed when checking systems. Thing is though it did run for 52 hours does that mean the workunit timeouts only work on windows systems not linux? It would be good to know as I wont always be able to check up as often as I have been on these servers. Thanks ID: 13802 · Rating: 0 · rate: / Reply Quote

[B@H] Ray Send message Joined: 20 Sep 05 Posts: 118 Credit: 100,251 RAC: 0	Message 13805 - Posted: 15 Apr 2006, 6:48:01 UTC I just aborted WU TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_20_2 as this was still at 1.04% at close to 10 hours (CPU time= 35,419.17 seconds). When I checked the graphics a few times in the last hour nothing was hapening there also. Ray Pizza@Home Rays Place Rays place Forums ID: 13805 · Rating: 0 · rate: / Reply Quote

Carlos_Pfitzner Send message Joined: 22 Dec 05 Posts: 71 Credit: 138,867 RAC: 0	Message 13806 - Posted: 15 Apr 2006, 6:53:25 UTC Last modified: 15 Apr 2006, 6:57:04 UTC Preferece run time 2 hours actual cpu time 5 hours 50 minutes Measured floating point speed 1417.88 million ops/sec Measured integer speed 3114.3 million ops/sec Done 1.48% Windows XP Rosetta 4.98 <message>aborted by user https://boinc.bakerlab.org/rosetta/result.php?resultid=17217056 Click signature for global team stats ID: 13806 · Rating: 0 · rate: / Reply Quote

anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0	Message 13807 - Posted: 15 Apr 2006, 6:59:29 UTC - in response to Message 13806. Preferece run time 2 hours actual cpu time 5 hours 50 minutes Measured floating point speed 1417.88 million ops/sec Measured integer speed 3114.3 million ops/sec Done 1.48% Windows XP Rosetta 4.98 <message>aborted by user https://boinc.bakerlab.org/rosetta/result.php?resultid=17217056 Hi Carlos I had 1 of those on a P4 2,8. It took 3 H + to compete it. Anders n ID: 13807 · Rating: 0 · rate: / Reply Quote

[DPC]Division_Brabant~OldButNotSoWise Send message Joined: 23 Jan 06 Posts: 42 Credit: 371,797 RAC: 0	Message 13832 - Posted: 15 Apr 2006, 13:52:23 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=17162086 Aborted at 1.8% after 5 hours or so, because the graphics also stopped. ID: 13832 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 13840 - Posted: 15 Apr 2006, 16:25:29 UTC - in response to Message 13832. Last modified: 15 Apr 2006, 16:28:34 UTC Hi, Thanks for reporting this. We are working on maintaining a relative stable run time on these WUs. Also see my reply to this thread. Bin https://boinc.bakerlab.org/rosetta/result.php?resultid=17162086 Aborted at 1.8% after 5 hours or so, because the graphics also stopped. ID: 13840 · Rating: 0 · rate: / Reply Quote

Cedomir Igaly Send message Joined: 5 Dec 05 Posts: 2 Credit: 66,345 RAC: 0	Message 13841 - Posted: 15 Apr 2006, 17:08:30 UTC Last modified: 15 Apr 2006, 17:30:43 UTC 7521_largescale_large_fullatom_relax_dec7521_1_02_9.pdb_437_133_0 7486_largescale_large_fullatom_relax_dec7486_1_10_4.pdb_435_177_0 stuck (and aborted) at ~ 1% ID: 13841 · Rating: 0 · rate: / Reply Quote

keitaisamurai Send message Joined: 21 Mar 06 Posts: 2 Credit: 55,037 RAC: 0	Message 13843 - Posted: 15 Apr 2006, 17:25:17 UTC Work unit stuck at 1.04% with more than 175 HOURS of processing time (I really should check this computer more often). https://boinc.bakerlab.org/rosetta/result.php?resultid=15389604 Needless to say, I think I'll abort it... ID: 13843 · Rating: 0 · rate: / Reply Quote

Sybr_E-N Send message Joined: 26 Nov 05 Posts: 2 Credit: 164,851 RAC: 0	Message 13854 - Posted: 15 Apr 2006, 20:33:31 UTC 50% error rate today... https://boinc.bakerlab.org/rosetta/result.php?resultid=17261491 https://boinc.bakerlab.org/rosetta/result.php?resultid=17261490 https://boinc.bakerlab.org/rosetta/result.php?resultid=17261489 https://boinc.bakerlab.org/rosetta/result.php?resultid=17261456 All with the same error: <core_client_version>5.2.13</core_client_version> <message>Onjuiste functie. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # random seed: 3316837 # random seed: 3316837 # cpu_run_time_pref: 7200 # Exception caught in nstruct loop ii=1 i=2 # num_decoys:1 attempts:2 cpu_run_time:5760.47 ERROR:: Exit at: .nblist.cc line:541 </stderr_txt> "Onjuiste functie" is Dutch ( :) ) for "wrong function" ID: 13854 · Rating: 0 · rate: / Reply Quote

Angus Send message Joined: 17 Sep 05 Posts: 412 Credit: 321,053 RAC: 0	Message 13858 - Posted: 15 Apr 2006, 21:15:28 UTC Last modified: 15 Apr 2006, 21:21:25 UTC 4/15/2006 12:26:24 PM\|rosetta@home\|Unrecoverable error for result ALL_TOPOLOGY_CODES_1shfA_434_201_0 (aborted by user) aborted result This seemed to freeze at 2.42% (I think - didn't do screen shot) and the CPU time in Boinc Manager had been stopped for almost an hour. Couldn't get into "show graphics" to look at progress. see this thread for no graphics problem Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :) "You can't fix stupid" (Ron White) ID: 13858 · Rating: -1 · rate: / Reply Quote

Gretchen Send message Joined: 1 Nov 05 Posts: 1 Credit: 21,277 RAC: 0	Message 13874 - Posted: 16 Apr 2006, 2:19:07 UTC - in response to Message 13331. Last modified: 16 Apr 2006, 2:20:39 UTC This one was up to 8 hours and 1.04% done, with 16 hours to go. So I aborted it. The WU was created on : 11 Apr 2006 17:46:10 UTC name: TRUNCATE_TERMINI_FULLRELAX_2tif__433_637 Computer was : AuthenticAMD AMD Athlon(tm) Processor Operating System: Microsoft Windows XP Home Edition, Service Pack 2, (05.01.2600.00) Memory 639.42 MB ID: 13874 · Rating: 0 · rate: / Reply Quote

TCU Computer Science Send message Joined: 7 Dec 05 Posts: 28 Credit: 12,861,977 RAC: 0	Message 13881 - Posted: 16 Apr 2006, 6:23:01 UTC - in response to Message 13331. Last modified: 16 Apr 2006, 6:32:19 UTC I aborted nine WUs today. These four showed 20-50 hours of accumulated time 17028012 TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_678 17051917 TRUNCATE_TERMINI_FULLRELAX_1ptq__433_905 17050886 TRUNCATE_TERMINI_FULLRELAX_1enh__433_896 16238549 FA_RLXpt_hom006_1ptq__361_440 The following five showed little or no accumulated time but had been running for 4-11 days: 17016383 TRUNCATE_TERMINI_FULLRELAX_1ptq__433_569 16970141 TRUNCATE_TERMINI_FULLRELAX_2tif__433_104 16995174 TRUNCATE_TERMINI_FULLRELAX_2tif__433_369 16196147 FA_RLXpt_hom002_1ptq__361_379 16227211 FARELAX_NOFILTERS_1bm8__417_637 ID: 13881 · Rating: 0 · rate: / Reply Quote

XS_DDT's_Cattle_Prods Send message Joined: 24 Mar 06 Posts: 12 Credit: 1,180,072 RAC: 0	Message 13901 - Posted: 16 Apr 2006, 17:14:27 UTC 17049214 stuck url just found it, almost 17 or so hours wasted, do we get any kind of credit for these things? ID: 13901 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 13905 - Posted: 16 Apr 2006, 20:28:21 UTC - in response to Message 13901. These work units were problematic -- getting your reports has helped us fix an important bug in Rosetta. So it wasn't a total waste of CPU time! And about once a week, we're running a script to grant credit to work units that claimed credit. 17049214 stuck url just found it, almost 17 or so hours wasted, do we get any kind of credit for these things? ID: 13905 · Rating: 0 · rate: / Reply Quote

keitaisamurai Send message Joined: 21 Mar 06 Posts: 2 Credit: 55,037 RAC: 0	Message 13923 - Posted: 17 Apr 2006, 2:24:45 UTC Found stuck at 1.02% after 16 hours of processing time. 16784959 ID: 13923 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 13932 - Posted: 17 Apr 2006, 3:57:28 UTC - in response to Message 13799. But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING The bad WU's are removed from our servers, but we can't remove them from your machines. Hopefully there will be no more bad WU's at all so this won't be a problem anymore. Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would and send it self in IT JUST RESTARTED and it most likely was the 3rd time it restarted. That is 142 Hrs of wasted CPU time And I will not get credit for it as you said we would because your time out did not work This is why I said you must came up with a way to auto abort on our clients A script to do a project reset or something . To expect us to clean up after you send out Bad W/Us Is not right and you know it. You must come up with a better plan then do nothing If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 13932 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0	Message 13951 - Posted: 17 Apr 2006, 14:30:50 UTC Here is a work unit that got stuck at a little over 1%: 17203004 Regards, Bob P. ID: 13951 · Rating: 0 · rate: / Reply Quote

ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0	Message 13953 - Posted: 17 Apr 2006, 14:53:17 UTC I aborted result #16087136 after it got stuck at about 75% with no progress. HTH ID: 13953 · Rating: 0 · rate: / Reply Quote

[DPC]FOKschaap~devzero Send message Joined: 9 Dec 05 Posts: 1 Credit: 2,785,811 RAC: 0	Message 13957 - Posted: 17 Apr 2006, 16:06:40 UTC Last modified: 17 Apr 2006, 16:14:22 UTC I aborted TRUNCATE_TERMINI_FULLRELAX_2tif__433_736 after it got stuck at 1%. ID: 13957 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 13964 - Posted: 17 Apr 2006, 18:25:07 UTC - in response to Message 13932. Can you post a link to the result? I think you will actually get credit for this. The result is reported to us as a large amount of "claimed credit"; we are going through on a weekly basis and granting credit for these jobs that caused big problems and returned "invalid" results. The results and your posting are still useful to us -- in this case, the postings on this work unit helped us track down a pretty esoteric bug in Rosetta. Let me know what the result number is -- if its not flagged to get credit, I can see why. But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING The bad WU's are removed from our servers, but we can't remove them from your machines. Hopefully there will be no more bad WU's at all so this won't be a problem anymore. Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would and send it self in IT JUST RESTARTED and it most likely was the 3rd time it restarted. That is 142 Hrs of wasted CPU time And I will not get credit for it as you said we would because your time out did not work This is why I said you must came up with a way to auto abort on our clients A script to do a project reset or something . To expect us to clean up after you send out Bad W/Us Is not right and you know it. You must come up with a better plan then do nothing ID: 13964 · Rating: 0 · rate: / Reply Quote

Report stuck & aborted WU here please - II