Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

AuthorMessage
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 13459 - Posted: 11 Apr 2006, 14:26:05 UTC
Last modified: 11 Apr 2006, 14:30:00 UTC

I just aborted Result ID 16972487, TRUNCATE_TERMINI_FULLRELAX_1ptq__433_139_0 at 1.04% after two+ hours (CPU run time preference =1 hour).



ID: 13459 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mikkie

Send message
Joined: 1 Apr 06
Posts: 9
Credit: 5,700
RAC: 0
Message 13477 - Posted: 11 Apr 2006, 22:04:23 UTC
Last modified: 11 Apr 2006, 22:05:45 UTC


ID: 13477 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
biodoc

Send message
Joined: 19 Feb 06
Posts: 14
Credit: 30,006,457
RAC: 10,261
Message 13478 - Posted: 11 Apr 2006, 22:10:19 UTC

Iv'e got a stuck work unit at 1.042% complete (4h 50min) w/ 2 hr runtime preference. No activity in graphics mode.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13936782

Please advise; should I terminate?
ID: 13478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
keyboards
Avatar

Send message
Joined: 3 Mar 06
Posts: 36
Credit: 74,787
RAC: 0
Message 13481 - Posted: 11 Apr 2006, 23:04:09 UTC

Aborting 7485_largescale_large_fullatom_relax_dec7485_1_47_1.pdb_432_95. Completed 1.76% after 2 hours with no further advance. Set for 2 hours.
!!Stupidity should be PAINFUL!!

ID: 13481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
biodoc

Send message
Joined: 19 Feb 06
Posts: 14
Credit: 30,006,457
RAC: 10,261
Message 13485 - Posted: 11 Apr 2006, 23:24:25 UTC - in response to Message 13478.  

Iv'e got a stuck work unit at 1.042% complete (4h 50min) w/ 2 hr runtime preference. No activity in graphics mode.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13936782

Please advise; should I terminate?


I've aborted this work unit after six hours.
https://boinc.bakerlab.org/rosetta/result.php?resultid=17002591
ID: 13485 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Purple Rabbit
Avatar

Send message
Joined: 24 Sep 05
Posts: 28
Credit: 3,858,933
RAC: 2,553
Message 13509 - Posted: 12 Apr 2006, 1:31:38 UTC
Last modified: 12 Apr 2006, 1:33:57 UTC

This one ran for 6 hours stuck at 1.04%. I restarted BOINC and the WU began again at zero. It quickly ran up to 1.04%, but seemed to have hung again according to the graphics display. I aborted the WU after 14 minutes (the second time).

TRUNCATE_TERMINI_FULLRELAX_1enh__433_53_0 using rosetta version 498
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13904970
ID: 13509 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13513 - Posted: 12 Apr 2006, 2:13:26 UTC

The fun continues,

https://boinc.bakerlab.org/rosetta/result.php?resultid=16988523
https://boinc.bakerlab.org/rosetta/result.php?resultid=17002662

Both aborted via cli due to 1% error, 12 hours lost.
ID: 13513 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Christian Barrett
Avatar

Send message
Joined: 17 Sep 05
Posts: 11
Credit: 14,933
RAC: 0
Message 13515 - Posted: 12 Apr 2006, 3:59:27 UTC
Last modified: 12 Apr 2006, 4:00:55 UTC

here is one that cost me dearly

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13792169

10 Apr 2006 22:50:35 UTC 12 Apr 2006 3:54:52 UTC Over Client error Done 70,199.00 105.31 ---
ID: 13515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13517 - Posted: 12 Apr 2006, 4:14:36 UTC

Another:

https://boinc.bakerlab.org/rosetta/result.php?resultid=16998699

9.5 hours - 1% manually killed
ID: 13517 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JT.Ault

Send message
Joined: 9 Dec 05
Posts: 1
Credit: 829,315
RAC: 0
Message 13518 - Posted: 12 Apr 2006, 4:57:56 UTC - in response to Message 13331.  

home1 rosetta@home 4/11/2006 9:48:33 PM Unrecoverable error for result TRUNCATE_TERMINI_FULLRELAX_2tif__433_106_0 (aborted via GUI RPC)

https://boinc.bakerlab.org/rosetta/result.php?resultid=16970267
Exit status -197 (0xffffff3b)
application version 4.98
Stuck at 1.04%
ID: 13518 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Team_Elteor_Borislavj~Intelligence

Send message
Joined: 7 Dec 05
Posts: 14
Credit: 56,027
RAC: 0
Message 13528 - Posted: 12 Apr 2006, 8:11:21 UTC

1 of my clients hasnt contacted bakerlab since 23 march, will investigate this evening why it died...
ID: 13528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13529 - Posted: 12 Apr 2006, 8:25:02 UTC

This is so not my day, maybe I'll hit the record for most lost work in a 24 hour period.

https://boinc.bakerlab.org/rosetta/result.php?resultid=17029579

21,256.53 seconds still at 1% manually aborted.

ID: 13529 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Charley

Send message
Joined: 18 Mar 06
Posts: 9
Credit: 295,915
RAC: 0
Message 13532 - Posted: 12 Apr 2006, 9:20:04 UTC

Got another two units stuck at 1%, aborted 'm
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_355 after 6 hours
and
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_479_0 after 10 hours (seriously stuck, no counters increase except for the time)

ID: 13532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13534 - Posted: 12 Apr 2006, 10:17:26 UTC

and another one, 4.5 hours stuck at 1%:

https://boinc.bakerlab.org/rosetta/result.php?resultid=17043276

all these stuck units are from different systems and a mix of linux/windows.
ID: 13534 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Christian Hagen

Send message
Joined: 26 Sep 05
Posts: 5
Credit: 46,795
RAC: 0
Message 13535 - Posted: 12 Apr 2006, 10:17:37 UTC
Last modified: 12 Apr 2006, 10:18:55 UTC

Got also a WU stuck at 1% and aborted it

https://boinc.bakerlab.org/rosetta/result.php?resultid=17029338

TRUNCATE_TERMINI_FULLRELAX_1enh__433_691_0 after 2.5 hours
ID: 13535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13539 - Posted: 12 Apr 2006, 11:13:17 UTC
Last modified: 12 Apr 2006, 11:48:43 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=17060746
https://boinc.bakerlab.org/rosetta/result.php?resultid=17044331
and
https://boinc.bakerlab.org/rosetta/result.php?resultid=17051524

ok whats the word on these work units this is really annoying.
ID: 13539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 13542 - Posted: 12 Apr 2006, 12:40:44 UTC

ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH

Another failed project 17027763 12417632 11 Apr 2006 20:35:20 UTC 12 Apr 2006 12:28:07 UTC Over Client error Computing 44,064.20 136.62

This makes at least 5 projects with crashes and more than 5 cpu days wasted in total.

What the hell is happening. To Say I am frustrated is an understatement
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 13542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 13548 - Posted: 12 Apr 2006, 14:23:27 UTC - in response to Message 13542.  

ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH

Another failed project 17027763 12417632 11 Apr 2006 20:35:20 UTC 12 Apr 2006 12:28:07 UTC Over Client error Computing 44,064.20 136.62

This makes at least 5 projects with crashes and more than 5 cpu days wasted in total.

What the hell is happening. To Say I am frustrated is an understatement


Well don't feel to bad Jose I seem to have to abort 60 to 100 Hrs of wasted CPU time every DAY. I did abort just today 7 WU's STUCK at 1.04% for a total of 80 HRs

DAVID what are you going to do about solving this problem ??? Any end in sight?
Baby sitting your client does consume a lot of my time


If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 13548 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jimi@0wned.org.uk

Send message
Joined: 10 Mar 06
Posts: 29
Credit: 335,252
RAC: 0
Message 13550 - Posted: 12 Apr 2006, 15:29:25 UTC
Last modified: 12 Apr 2006, 15:56:09 UTC

2 WUs here stuck at 1.04%

TRUNCATE_TERMINI_FULLRELAX_1ptq_433_485_0
TRUNCATE_TERMINI_FULLRELAX_1enh_433_558_0

There are two more in this series to come; I'll abort the stuck ones and see what happens.

Edit: the subsequent WUs seem to be running ok, although one of them had already been aborted elsewhere. Anyway, they're both past 8% so fingers crossed. NB: my default is 4 hours and the two units above are the first to have stuck.
ID: 13550 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13551 - Posted: 12 Apr 2006, 15:31:15 UTC - in response to Message 13548.  

ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH

Another failed project 17027763 12417632 11 Apr 2006 20:35:20 UTC 12 Apr 2006 12:28:07 UTC Over Client error Computing 44,064.20 136.62

This makes at least 5 projects with crashes and more than 5 cpu days wasted in total.

What the hell is happening. To Say I am frustrated is an understatement


Well don't feel to bad Jose I seem to have to abort 60 to 100 Hrs of wasted CPU time every DAY. I did abort just today 7 WU's STUCK at 1.04% for a total of 80 HRs

DAVID what are you going to do about solving this problem ??? Any end in sight?
Baby sitting your client does consume a lot of my time



sounds to me like things are worse than they were a week ago, is this correct? the only change is that
we increased the default run time from 2 hours to 4 hours, which reduces network traffic at the cost of
an increased chance of work unit errors (because they are longer). we can set the default back to two hours and see if it helps. anyway--main question--are people seeing more stuck work units now than
7-10 days ago?

ID: 13551 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2024 University of Washington
https://www.bakerlab.org