Report stuck & aborted WU here please

Author	Message
biodoc Send message Joined: 19 Feb 06 Posts: 14 Credit: 30,720,744 RAC: 0	Message 13478 - Posted: 11 Apr 2006, 22:10:19 UTC Iv'e got a stuck work unit at 1.042% complete (4h 50min) w/ 2 hr runtime preference. No activity in graphics mode. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13936782 Please advise; should I terminate? ID: 13478 · Rating: 0 · rate: / Reply Quote

keyboards Send message Joined: 3 Mar 06 Posts: 36 Credit: 74,787 RAC: 0	Message 13481 - Posted: 11 Apr 2006, 23:04:09 UTC Aborting 7485_largescale_large_fullatom_relax_dec7485_1_47_1.pdb_432_95. Completed 1.76% after 2 hours with no further advance. Set for 2 hours. *!!Stupidity should be PAINFUL!!* ID: 13481 · Rating: 0 · rate: / Reply Quote

biodoc Send message Joined: 19 Feb 06 Posts: 14 Credit: 30,720,744 RAC: 0	Message 13485 - Posted: 11 Apr 2006, 23:24:25 UTC - in response to Message 13478. Iv'e got a stuck work unit at 1.042% complete (4h 50min) w/ 2 hr runtime preference. No activity in graphics mode. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13936782 Please advise; should I terminate? I've aborted this work unit after six hours. https://boinc.bakerlab.org/rosetta/result.php?resultid=17002591 ID: 13485 · Rating: 0 · rate: / Reply Quote

Purple Rabbit Send message Joined: 24 Sep 05 Posts: 28 Credit: 4,536,152 RAC: 0	Message 13509 - Posted: 12 Apr 2006, 1:31:38 UTC Last modified: 12 Apr 2006, 1:33:57 UTC This one ran for 6 hours stuck at 1.04%. I restarted BOINC and the WU began again at zero. It quickly ran up to 1.04%, but seemed to have hung again according to the graphics display. I aborted the WU after 14 minutes (the second time). TRUNCATE_TERMINI_FULLRELAX_1enh__433_53_0 using rosetta version 498 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13904970 ID: 13509 · Rating: 0 · rate: / Reply Quote

Christian Barrett Send message Joined: 17 Sep 05 Posts: 11 Credit: 14,933 RAC: 0	Message 13515 - Posted: 12 Apr 2006, 3:59:27 UTC Last modified: 12 Apr 2006, 4:00:55 UTC here is one that cost me dearly https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13792169 10 Apr 2006 22:50:35 UTC 12 Apr 2006 3:54:52 UTC Over Client error Done 70,199.00 105.31 --- ID: 13515 · Rating: 0 · rate: / Reply Quote

JT.Ault Send message Joined: 9 Dec 05 Posts: 1 Credit: 829,315 RAC: 0	Message 13518 - Posted: 12 Apr 2006, 4:57:56 UTC - in response to Message 13331. home1 rosetta@home 4/11/2006 9:48:33 PM Unrecoverable error for result TRUNCATE_TERMINI_FULLRELAX_2tif__433_106_0 (aborted via GUI RPC) https://boinc.bakerlab.org/rosetta/result.php?resultid=16970267 Exit status -197 (0xffffff3b) application version 4.98 Stuck at 1.04% ID: 13518 · Rating: 0 · rate: / Reply Quote

Team_Elteor_Borislavj~Intelligence Send message Joined: 7 Dec 05 Posts: 14 Credit: 56,027 RAC: 0	Message 13528 - Posted: 12 Apr 2006, 8:11:21 UTC 1 of my clients hasnt contacted bakerlab since 23 march, will investigate this evening why it died... ID: 13528 · Rating: 0 · rate: / Reply Quote

[DPC]Charley Send message Joined: 18 Mar 06 Posts: 9 Credit: 295,915 RAC: 0	Message 13532 - Posted: 12 Apr 2006, 9:20:04 UTC Got another two units stuck at 1%, aborted 'm TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_355 after 6 hours and TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_479_0 after 10 hours (seriously stuck, no counters increase except for the time) ID: 13532 · Rating: 0 · rate: / Reply Quote

Christian Hagen Send message Joined: 26 Sep 05 Posts: 5 Credit: 46,795 RAC: 0	Message 13535 - Posted: 12 Apr 2006, 10:17:37 UTC Last modified: 12 Apr 2006, 10:18:55 UTC Got also a WU stuck at 1% and aborted it https://boinc.bakerlab.org/rosetta/result.php?resultid=17029338 TRUNCATE_TERMINI_FULLRELAX_1enh__433_691_0 after 2.5 hours ID: 13535 · Rating: 0 · rate: / Reply Quote

Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0	Message 13542 - Posted: 12 Apr 2006, 12:40:44 UTC ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH Another failed project 17027763 12417632 11 Apr 2006 20:35:20 UTC 12 Apr 2006 12:28:07 UTC Over Client error Computing 44,064.20 136.62 This makes at least 5 projects with crashes and more than 5 cpu days wasted in total. What the hell is happening. To Say I am frustrated is an understatement This and no other is the root from which a Tyrant springs; when he first appears he is a protector.â€ Plato ID: 13542 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 13548 - Posted: 12 Apr 2006, 14:23:27 UTC - in response to Message 13542. ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH Another failed project 17027763 12417632 11 Apr 2006 20:35:20 UTC 12 Apr 2006 12:28:07 UTC Over Client error Computing 44,064.20 136.62 This makes at least 5 projects with crashes and more than 5 cpu days wasted in total. What the hell is happening. To Say I am frustrated is an understatement Well don't feel to bad Jose I seem to have to abort 60 to 100 Hrs of wasted CPU time every DAY. I did abort just today 7 WU's STUCK at 1.04% for a total of 80 HRs DAVID what are you going to do about solving this problem ??? Any end in sight? Baby sitting your client does consume a lot of my time If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 13548 · Rating: 0 · rate: / Reply Quote

Jimi@0wned.org.uk Send message Joined: 10 Mar 06 Posts: 29 Credit: 335,252 RAC: 0	Message 13550 - Posted: 12 Apr 2006, 15:29:25 UTC Last modified: 12 Apr 2006, 15:56:09 UTC 2 WUs here stuck at 1.04% TRUNCATE_TERMINI_FULLRELAX_1ptq_433_485_0 TRUNCATE_TERMINI_FULLRELAX_1enh_433_558_0 There are two more in this series to come; I'll abort the stuck ones and see what happens. Edit: the subsequent WUs seem to be running ok, although one of them had already been aborted elsewhere. Anyway, they're both past 8% so fingers crossed. NB: my default is 4 hours and the two units above are the first to have stuck. ID: 13550 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 13551 - Posted: 12 Apr 2006, 15:31:15 UTC - in response to Message 13548. ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH Another failed project 17027763 12417632 11 Apr 2006 20:35:20 UTC 12 Apr 2006 12:28:07 UTC Over Client error Computing 44,064.20 136.62 This makes at least 5 projects with crashes and more than 5 cpu days wasted in total. What the hell is happening. To Say I am frustrated is an understatement Well don't feel to bad Jose I seem to have to abort 60 to 100 Hrs of wasted CPU time every DAY. I did abort just today 7 WU's STUCK at 1.04% for a total of 80 HRs DAVID what are you going to do about solving this problem ??? Any end in sight? Baby sitting your client does consume a lot of my time sounds to me like things are worse than they were a week ago, is this correct? the only change is that we increased the default run time from 2 hours to 4 hours, which reduces network traffic at the cost of an increased chance of work unit errors (because they are longer). we can set the default back to two hours and see if it helps. anyway--main question--are people seeing more stuck work units now than 7-10 days ago? ID: 13551 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0	Message 13552 - Posted: 12 Apr 2006, 16:16:02 UTC - in response to Message 13551. Last modified: 12 Apr 2006, 16:17:14 UTC anyway--main question--are people seeing more stuck work units now than 7-10 days ago? Rom (or someone) should probably do an analysis to see what (if any) common factors there are for the errored units, and the overall frequency. Knock on wood (although with limited sampling), I have kept my run time at 8 hours and have not had any problems with 4.98. Regards, Bob P. ID: 13552 · Rating: 0 · rate: / Reply Quote

arminius Send message Joined: 23 Sep 05 Posts: 8 Credit: 883,822 RAC: 0	Message 13553 - Posted: 12 Apr 2006, 16:29:10 UTC Last modified: 12 Apr 2006, 16:34:10 UTC my first (linux box) .... stuck at 1.04% TRUNCATE_TERMINI_FULLRELAX_1enh__433_38_0 a. ID: 13553 · Rating: 0 · rate: / Reply Quote

Robert Everly Send message Joined: 8 Oct 05 Posts: 27 Credit: 665,094 RAC: 0	Message 13557 - Posted: 12 Apr 2006, 17:40:59 UTC Just got my first stuck WU. Yay me :( Anyway its. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13923483 It's currently at 8+49 CPU time. Stuck at 1.042% It has exceeded both the default run time and my run time setting. I have suspended the WU. Bonic 5.2.13. Please advise as to what to do with this WU. ID: 13557 · Rating: 0 · rate: / Reply Quote

jomebrew Send message Joined: 31 Mar 06 Posts: 2 Credit: 25,914,516 RAC: 0	Message 13559 - Posted: 12 Apr 2006, 18:02:57 UTC I have a couple of these on my Linux system. I would appreciate some help on a clean way to abort these on Linux. I have been hacking client_state.xml and deleting files in the slots directory. There has to be a better way. Warning! PRODUCTION_ABINITIO_CENTROID_PACKING_1ctf__429_247_0 was started at 2006-04-09 20:52:34 but has not finished! Warning! HBLR_1.0_2reb_426_994_0 was started at 2006-04-09 23:07:09 but has not finished! Warning! 7449_largescale_large_fullatom_relax_dec7449_1_05_6.pdb_431_53_0 was started at 2006-04-09 20:58:18 but has not finished! Warning! PRODUCTION_ABINITIO_CENTROID_PACKING_1vls__428_262_0 was started at 2006-04-09 21:19:28 but has not finished! Warning! 7485_largescale_large_fullatom_relax_dec7485_1_05_8.pdb_432_129_0 was started at 2006-04-09 21:50:01 but has not finished! Warning! TRUNCATE_TERMINI_FULLRELAX_1ptq__433_587_0 was started at 2006-04-11 17:55:43 but has not finished! ID: 13559 · Rating: 0 · rate: / Reply Quote

n7zfi Send message Joined: 7 Apr 06 Posts: 1 Credit: 4,623,875 RAC: 0	Message 13563 - Posted: 12 Apr 2006, 18:19:18 UTC Running on Windows XP Pro, I have a WU stuck at 1.04%. The graphics appears to be locked up; nothing is moving even though the CPU utilization clock keeps ticking. The WU in questions is: TRUNCATE_TERMINI_FULLRELAX_1ptq_433_906_0 I have suspended it after 1:34:22 of run time. The other WUs progress past that point in a few minutes. ID: 13563 · Rating: 0 · rate: / Reply Quote

snoekbaars Send message Joined: 16 Mar 06 Posts: 2 Credit: 12,136 RAC: 0	Message 13565 - Posted: 12 Apr 2006, 18:42:22 UTC Work unit aborted at 48% - CPU time used ~24 hours. Time needed to completion only going up. Nothing moved in the graphics. WU Name "FA_RLXpt_hom003_1ptq__361_156_3" - Application "rosetta 4.98" Workunit = 11684527; Result ID = 16802748; System = Intel P4 3.0GHz, Win-XP SP 2 The workunit still reports "in progress" at the time of writing this message. The workunit was aborted manually ("Aborted via GUI RPC"). ID: 13565 · Rating: 0 · rate: / Reply Quote

Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0	Message 13569 - Posted: 12 Apr 2006, 19:17:41 UTC Just again had a WU that was running for more than 6 hours at 1.17% and when I checked it again another one had started which is running for 45 minutes now at 1.06% but I cannot find that other wu in my results. Better testdrive a project like this more thoroughly before letting so many people waste their money. If I go on this month it wil be the last anyway. Rather fed up with it. No fun at all anymore. ID: 13569 · Rating: 0 · rate: / Reply Quote

Report stuck & aborted WU here please - II