Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

AuthorMessage
biodoc

Send message
Joined: 19 Feb 06
Posts: 14
Credit: 30,720,744
RAC: 0
Message 13478 - Posted: 11 Apr 2006, 22:10:19 UTC

Iv'e got a stuck work unit at 1.042% complete (4h 50min) w/ 2 hr runtime preference. No activity in graphics mode.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13936782

Please advise; should I terminate?
ID: 13478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
keyboards
Avatar

Send message
Joined: 3 Mar 06
Posts: 36
Credit: 74,787
RAC: 0
Message 13481 - Posted: 11 Apr 2006, 23:04:09 UTC

Aborting 7485_largescale_large_fullatom_relax_dec7485_1_47_1.pdb_432_95. Completed 1.76% after 2 hours with no further advance. Set for 2 hours.
!!Stupidity should be PAINFUL!!

ID: 13481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
biodoc

Send message
Joined: 19 Feb 06
Posts: 14
Credit: 30,720,744
RAC: 0
Message 13485 - Posted: 11 Apr 2006, 23:24:25 UTC - in response to Message 13478.  

Iv'e got a stuck work unit at 1.042% complete (4h 50min) w/ 2 hr runtime preference. No activity in graphics mode.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13936782

Please advise; should I terminate?


I've aborted this work unit after six hours.
https://boinc.bakerlab.org/rosetta/result.php?resultid=17002591
ID: 13485 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Purple Rabbit
Avatar

Send message
Joined: 24 Sep 05
Posts: 28
Credit: 4,536,152
RAC: 0
Message 13509 - Posted: 12 Apr 2006, 1:31:38 UTC
Last modified: 12 Apr 2006, 1:33:57 UTC

This one ran for 6 hours stuck at 1.04%. I restarted BOINC and the WU began again at zero. It quickly ran up to 1.04%, but seemed to have hung again according to the graphics display. I aborted the WU after 14 minutes (the second time).

TRUNCATE_TERMINI_FULLRELAX_1enh__433_53_0 using rosetta version 498
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13904970
ID: 13509 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Christian Barrett
Avatar

Send message
Joined: 17 Sep 05
Posts: 11
Credit: 14,933
RAC: 0
Message 13515 - Posted: 12 Apr 2006, 3:59:27 UTC
Last modified: 12 Apr 2006, 4:00:55 UTC

here is one that cost me dearly

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13792169

10 Apr 2006 22:50:35 UTC 12 Apr 2006 3:54:52 UTC Over Client error Done 70,199.00 105.31 ---
ID: 13515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JT.Ault

Send message
Joined: 9 Dec 05
Posts: 1
Credit: 829,315
RAC: 0
Message 13518 - Posted: 12 Apr 2006, 4:57:56 UTC - in response to Message 13331.  

home1 rosetta@home 4/11/2006 9:48:33 PM Unrecoverable error for result TRUNCATE_TERMINI_FULLRELAX_2tif__433_106_0 (aborted via GUI RPC)

https://boinc.bakerlab.org/rosetta/result.php?resultid=16970267
Exit status -197 (0xffffff3b)
application version 4.98
Stuck at 1.04%
ID: 13518 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Team_Elteor_Borislavj~Intelligence

Send message
Joined: 7 Dec 05
Posts: 14
Credit: 56,027
RAC: 0
Message 13528 - Posted: 12 Apr 2006, 8:11:21 UTC

1 of my clients hasnt contacted bakerlab since 23 march, will investigate this evening why it died...
ID: 13528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Charley

Send message
Joined: 18 Mar 06
Posts: 9
Credit: 295,915
RAC: 0
Message 13532 - Posted: 12 Apr 2006, 9:20:04 UTC

Got another two units stuck at 1%, aborted 'm
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_355 after 6 hours
and
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_479_0 after 10 hours (seriously stuck, no counters increase except for the time)

ID: 13532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Christian Hagen

Send message
Joined: 26 Sep 05
Posts: 5
Credit: 46,795
RAC: 0
Message 13535 - Posted: 12 Apr 2006, 10:17:37 UTC
Last modified: 12 Apr 2006, 10:18:55 UTC

Got also a WU stuck at 1% and aborted it

https://boinc.bakerlab.org/rosetta/result.php?resultid=17029338

TRUNCATE_TERMINI_FULLRELAX_1enh__433_691_0 after 2.5 hours
ID: 13535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 13542 - Posted: 12 Apr 2006, 12:40:44 UTC

ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH

Another failed project 17027763 12417632 11 Apr 2006 20:35:20 UTC 12 Apr 2006 12:28:07 UTC Over Client error Computing 44,064.20 136.62

This makes at least 5 projects with crashes and more than 5 cpu days wasted in total.

What the hell is happening. To Say I am frustrated is an understatement
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 13542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 13548 - Posted: 12 Apr 2006, 14:23:27 UTC - in response to Message 13542.  

ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH

Another failed project 17027763 12417632 11 Apr 2006 20:35:20 UTC 12 Apr 2006 12:28:07 UTC Over Client error Computing 44,064.20 136.62

This makes at least 5 projects with crashes and more than 5 cpu days wasted in total.

What the hell is happening. To Say I am frustrated is an understatement


Well don't feel to bad Jose I seem to have to abort 60 to 100 Hrs of wasted CPU time every DAY. I did abort just today 7 WU's STUCK at 1.04% for a total of 80 HRs

DAVID what are you going to do about solving this problem ??? Any end in sight?
Baby sitting your client does consume a lot of my time


If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 13548 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jimi@0wned.org.uk

Send message
Joined: 10 Mar 06
Posts: 29
Credit: 335,252
RAC: 0
Message 13550 - Posted: 12 Apr 2006, 15:29:25 UTC
Last modified: 12 Apr 2006, 15:56:09 UTC

2 WUs here stuck at 1.04%

TRUNCATE_TERMINI_FULLRELAX_1ptq_433_485_0
TRUNCATE_TERMINI_FULLRELAX_1enh_433_558_0

There are two more in this series to come; I'll abort the stuck ones and see what happens.

Edit: the subsequent WUs seem to be running ok, although one of them had already been aborted elsewhere. Anyway, they're both past 8% so fingers crossed. NB: my default is 4 hours and the two units above are the first to have stuck.
ID: 13550 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13551 - Posted: 12 Apr 2006, 15:31:15 UTC - in response to Message 13548.  

ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH

Another failed project 17027763 12417632 11 Apr 2006 20:35:20 UTC 12 Apr 2006 12:28:07 UTC Over Client error Computing 44,064.20 136.62

This makes at least 5 projects with crashes and more than 5 cpu days wasted in total.

What the hell is happening. To Say I am frustrated is an understatement


Well don't feel to bad Jose I seem to have to abort 60 to 100 Hrs of wasted CPU time every DAY. I did abort just today 7 WU's STUCK at 1.04% for a total of 80 HRs

DAVID what are you going to do about solving this problem ??? Any end in sight?
Baby sitting your client does consume a lot of my time



sounds to me like things are worse than they were a week ago, is this correct? the only change is that
we increased the default run time from 2 hours to 4 hours, which reduces network traffic at the cost of
an increased chance of work unit errors (because they are longer). we can set the default back to two hours and see if it helps. anyway--main question--are people seeing more stuck work units now than
7-10 days ago?

ID: 13551 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 13552 - Posted: 12 Apr 2006, 16:16:02 UTC - in response to Message 13551.  
Last modified: 12 Apr 2006, 16:17:14 UTC

anyway--main question--are people seeing more stuck work units now than
7-10 days ago?

Rom (or someone) should probably do an analysis to see what (if any) common factors there are for the errored units, and the overall frequency. Knock on wood (although with limited sampling), I have kept my run time at 8 hours and have not had any problems with 4.98.

Regards,
Bob P.
ID: 13552 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile arminius

Send message
Joined: 23 Sep 05
Posts: 8
Credit: 883,822
RAC: 0
Message 13553 - Posted: 12 Apr 2006, 16:29:10 UTC
Last modified: 12 Apr 2006, 16:34:10 UTC

my first (linux box) .... stuck at 1.04%
TRUNCATE_TERMINI_FULLRELAX_1enh__433_38_0
a.
ID: 13553 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robert Everly

Send message
Joined: 8 Oct 05
Posts: 27
Credit: 665,094
RAC: 0
Message 13557 - Posted: 12 Apr 2006, 17:40:59 UTC

Just got my first stuck WU. Yay me :(

Anyway its.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13923483

It's currently at 8+49 CPU time. Stuck at 1.042%

It has exceeded both the default run time and my run time setting.

I have suspended the WU. Bonic 5.2.13. Please advise as to what to do with this WU.
ID: 13557 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jomebrew

Send message
Joined: 31 Mar 06
Posts: 2
Credit: 25,914,516
RAC: 0
Message 13559 - Posted: 12 Apr 2006, 18:02:57 UTC

I have a couple of these on my Linux system. I would appreciate some help on a clean way to abort these on Linux. I have been hacking client_state.xml and deleting files in the slots directory. There has to be a better way.

Warning! PRODUCTION_ABINITIO_CENTROID_PACKING_1ctf__429_247_0 was started at 2006-04-09 20:52:34 but has not finished!

Warning! HBLR_1.0_2reb_426_994_0 was started at 2006-04-09 23:07:09 but has not finished!

Warning! 7449_largescale_large_fullatom_relax_dec7449_1_05_6.pdb_431_53_0 was started at 2006-04-09 20:58:18 but has not finished!

Warning! PRODUCTION_ABINITIO_CENTROID_PACKING_1vls__428_262_0 was started at 2006-04-09 21:19:28 but has not finished!

Warning! 7485_largescale_large_fullatom_relax_dec7485_1_05_8.pdb_432_129_0 was started at 2006-04-09 21:50:01 but has not finished!

Warning! TRUNCATE_TERMINI_FULLRELAX_1ptq__433_587_0 was started at 2006-04-11 17:55:43 but has not finished!


ID: 13559 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
n7zfi

Send message
Joined: 7 Apr 06
Posts: 1
Credit: 4,623,875
RAC: 0
Message 13563 - Posted: 12 Apr 2006, 18:19:18 UTC

Running on Windows XP Pro, I have a WU stuck at 1.04%. The graphics appears to be locked up; nothing is moving even though the CPU utilization clock keeps ticking. The WU in questions is:

TRUNCATE_TERMINI_FULLRELAX_1ptq_433_906_0

I have suspended it after 1:34:22 of run time. The other WUs progress past that point in a few minutes.
ID: 13563 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
snoekbaars

Send message
Joined: 16 Mar 06
Posts: 2
Credit: 12,136
RAC: 0
Message 13565 - Posted: 12 Apr 2006, 18:42:22 UTC

Work unit aborted at 48% - CPU time used ~24 hours. Time needed to completion only going up. Nothing moved in the graphics.

WU Name "FA_RLXpt_hom003_1ptq__361_156_3" - Application "rosetta 4.98"
Workunit = 11684527; Result ID = 16802748; System = Intel P4 3.0GHz, Win-XP SP 2

The workunit still reports "in progress" at the time of writing this message.
The workunit was aborted manually ("Aborted via GUI RPC").
ID: 13565 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 13569 - Posted: 12 Apr 2006, 19:17:41 UTC

Just again had a WU that was running for more than 6 hours at 1.17% and when I checked it again another one had started which is running for 45 minutes now at 1.06% but I cannot find that other wu in my results.
Better testdrive a project like this more thoroughly before letting so many people waste their money.
If I go on this month it wil be the last anyway.
Rather fed up with it.
No fun at all anymore.

ID: 13569 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2026 University of Washington
https://www.bakerlab.org