Report stuck & aborted 5.01 WU here please - III

Message boards : Number crunching : Report stuck & aborted 5.01 WU here please - III

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 14386 - Posted: 22 Apr 2006, 16:00:34 UTC - in response to Message 14385.  

... Would be nice to get points.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13422021


Please read this post

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 14386 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tribaal
Avatar

Send message
Joined: 6 Feb 06
Posts: 80
Credit: 2,754,607
RAC: 0
Message 14397 - Posted: 22 Apr 2006, 17:41:34 UTC

22.04.2006 19:39:50|rosetta@home|Unrecoverable error for result PROD_ABINITIO_1tul__454_145_0 ( - exit code -1073741811 (0xc000000d))

Hope this helps =(

- trib'
ID: 14397 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Division_Brabant~OldButNotSoWise
Avatar

Send message
Joined: 23 Jan 06
Posts: 42
Credit: 371,797
RAC: 0
Message 14405 - Posted: 22 Apr 2006, 18:37:46 UTC

ID: 14405 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mewbysea

Send message
Joined: 29 Jan 06
Posts: 17
Credit: 15,843,832
RAC: 1,814
Message 14414 - Posted: 22 Apr 2006, 21:20:33 UTC

Aborted 2 stuck wus:

HBLR_1.0_1di2_420_4698 at 10:11 hours and 3.6941% see result id 17749702 (full atom relax, model 1, step 32974)

HBLR_1.0_2tif_420_9229 at 8:59 hours and 4.996% See result id 17770512 (full atom relax, model 1, step 34201)

Both were re-releases from 6 April (no results returned)




ID: 14414 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ian

Send message
Joined: 14 Apr 06
Posts: 29
Credit: 326,863
RAC: 577
Message 14422 - Posted: 22 Apr 2006, 22:47:18 UTC

Aborted this one after 25hrs, as per my other thread...

https://boinc.bakerlab.org/rosetta/result.php?resultid=17774846
Ian Cundell, St Albans, UK
ID: 14422 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chilcotin

Send message
Joined: 5 Nov 05
Posts: 15
Credit: 16,969,500
RAC: 0
Message 14430 - Posted: 23 Apr 2006, 2:03:29 UTC

Workunit

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13401144

aborted after 27 hours. It was making progress but was only up to 12 % completed by the time I quit.
ID: 14430 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 14448 - Posted: 23 Apr 2006, 4:19:38 UTC

I have moved the discussion about the new abort feature to this thread.
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 14448 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 14451 - Posted: 23 Apr 2006, 4:51:54 UTC
Last modified: 23 Apr 2006, 4:52:34 UTC

This one stuck without progress after 7%:

https://boinc.bakerlab.org/rosetta/result.php?resultid=17853207

WU name: NO_TERM_STRAND_1ogw_423_2866
checkpoint CPU time: 98378.230000
current CPU time: 98951.020000
fraction done: 0.077710
estimated CPU time remaining: 115357.613121

ID: 14451 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 14454 - Posted: 23 Apr 2006, 6:01:11 UTC

Just aborted 4 work units from 4 different machines Longest had been running close to 10 hours and was at 5% the shorted 6 hours and at one percent
#1 from 2700xp
Result ID 17772227
Name HBLR_1.0_1mky_420_9630_1
Workunit 13428053
Created 20 Apr 2006 21:42:41 UTC
Sent 21 Apr 2006 4:22:49 UTC
Received 23 Apr 2006 5:53:20 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 148992
Report deadline 5 May 2006 4:22:49 UTC
CPU time 32013.537868

#2 From 1800 xp
Result ID 17805638
Name NO_TERM_STRAND_1ogw_423_6947_2
Workunit 13496532
Created 21 Apr 2006 5:49:41 UTC
Sent 21 Apr 2006 8:05:02 UTC
Received 23 Apr 2006 5:52:38 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 105489
Report deadline 5 May 2006 8:05:02 UTC
CPU time 24477.506926

#3 from 2000 xp
Result ID 17748958
Name FACONTACTS_RECENTER_NOFILTERS_1ig5A_448_551_1
Workunit 14550587
Created 20 Apr 2006 16:34:25 UTC
Sent 20 Apr 2006 22:38:14 UTC
Received 23 Apr 2006 5:51:22 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 106748
Report deadline 4 May 2006 22:38:14 UTC
CPU time 25011.984375

#4 from 2500 Xp
Result ID 17786001
Name HBLR_1.0_1n0u_ROT_TRIALS_TRIE_449_5_0
Workunit 14630032
Created 21 Apr 2006 1:00:11 UTC
Sent 21 Apr 2006 3:09:30 UTC
Received 23 Apr 2006 5:50:36 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 107679
Report deadline 5 May 2006 3:09:30 UTC
CPU time 22721.8125
ID: 14454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Lucky Angel~AES_koetje

Send message
Joined: 18 Mar 06
Posts: 4
Credit: 0
RAC: 0
Message 14465 - Posted: 23 Apr 2006, 11:11:36 UTC - in response to Message 14397.  

22.04.2006 19:39:50|rosetta@home|Unrecoverable error for result PROD_ABINITIO_1tul__454_145_0 ( - exit code -1073741811 (0xc000000d))

Hope this helps =(

- trib'


I have seen this error code:
exit code -1073741811 (0xc000000d)
too often. Spend over an hour searching for an useful interpretation. Does somebody know the answer?

ID: 14465 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tallguy-13088

Send message
Joined: 14 Dec 05
Posts: 9
Credit: 843,378
RAC: 0
Message 14478 - Posted: 23 Apr 2006, 13:44:41 UTC - in response to Message 14373.  

Just as a quick update, these two work units are up in the 30+ hour range and seem to be progressing albeit slowly (around 12% completion). Since I have read in another post that these are "more intense" models that are using new code, I don't have a problem running them as long as they make forward progress.

What I am seeing is that the "slowdown" is definitely in the Full Atom Relax stage (Ab Initio seems to crank right along). Current estimates put both of these WU's in the 300 Hr. range. I guess we will all know how it works out in about a couple of weeks if all continues to compute.

Folks,

Just as an FYI at this point, I have two "long runners" on two seperate machines. The first is HBLR_1.0_1djt_420_4640_1 running on a Dual Xeon P4 @ 2.8Ghz. The numbers are CPU_Time: 18.49 Hrs, Complete: 7.68% and To Completion: 20.12 Hrs. The second is HBLR_1.0_1n0u_420_9492_1 running on a 3.2 Ghz P4 with CPU_Time: 19.26 Hrs, Complete: 7.28% and To Completion: 20.45. Both are version 5.01

The graphics seem to be updating and both work units are apparently making forward progress (tons of data points). Both appear to swapping between 2+ seconds per step and then the normal "fast" multiple steps per second. Obviously the green data points are the "fast steppers".

I plan on running these up to about 50 hours apiece before aborting them (if they appear to require significantly more time). My goal is to give you guys as much diagnostic info as possible which translates into run-time.


ID: 14478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
universum

Send message
Joined: 22 Mar 06
Posts: 1
Credit: 42,464
RAC: 0
Message 14487 - Posted: 23 Apr 2006, 16:09:50 UTC

I have been running the same work form more than 17 hours now (usually one wrok unit takes 2-4 hours for me), and it seems like it restarts over and over on "model 1" and is stuck on a few percent. I was up at 3%-something and restarted the BOINC manager and it started from 1.00% and is now up at 1.6%. It's just not making any progress and it doesn't abort automatically either. Something must be wrong..
ID: 14487 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
de Mecquenem Pascal

Send message
Joined: 11 Oct 05
Posts: 1
Credit: 1,366,202
RAC: 0
Message 14489 - Posted: 23 Apr 2006, 17:09:10 UTC

Had to abort this work today. Stuck at 8,04 % after 13 hours (Time to completion 13 hours). Closed and restarted Boinc : it was then stuck at 1,01 %.
Graphics worked fine.

Windows XP Home Edition


Name HBLR_1.0_1n0u_420_7152_1
Workunit 13415665
Created 20 Apr 2006 19:27:24 UTC
Sent 21 Apr 2006 1:29:25 UTC
Received 23 Apr 2006 0:44:07 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 176653
Report deadline 5 May 2006 1:29:25 UTC
CPU time 51250.078125
stderr out <core_client_version>5.2.13</core_client_version>
<message>aborted via GUI RPC
</message>
<stderr_txt>
# cpu_run_time_pref: 7200
# random seed: 1543553
# cpu_run_time_pref: 7200
# random seed: 1543553

</stderr_txt>


Validate state Invalid
Claimed credit 138.782538165437
Granted credit 0
application version 5.01

ID: 14489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The Cow Association

Send message
Joined: 15 Jan 06
Posts: 1
Credit: 145,104
RAC: 0
Message 14494 - Posted: 23 Apr 2006, 18:06:12 UTC

i have one job that is running for 43 hours right now en till completion it says still 31 hours.
the job is making progress en is at 29.47%.
do i get the normal amount of points for this ? , or is it better to abort the job.
it is a HBLR_1.0_1b71_ROT_TRIALS_TRIE_449_30_0 job

ID: 14494 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 14500 - Posted: 23 Apr 2006, 21:37:11 UTC - in response to Message 14494.  

Go ahead and abort the jobs that have been going for more than 10 hours -- we are seeing incompatibility of certain workunits with certain machines. (We're testing the fix over on ralph now.) You'll still get credit later in the week when we grant credit for claimed credit! And you'll get some workunits that should not get stuck. Thanks for posting.


i have one job that is running for 43 hours right now en till completion it says still 31 hours.
the job is making progress en is at 29.47%.
do i get the normal amount of points for this ? , or is it better to abort the job.
it is a HBLR_1.0_1b71_ROT_TRIALS_TRIE_449_30_0 job


ID: 14500 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 14515 - Posted: 24 Apr 2006, 1:42:19 UTC

These 5.01 WUs were aborted today:

11.8 hrs
https://boinc.bakerlab.org/rosetta/result.php?resultid=17929192
FACONTACTS_RECENTER_NOFILTERS_1ubi__448_846

34.7 hrs
https://boinc.bakerlab.org/rosetta/result.php?resultid=17754714
HBLR_1.0_1hz6_420_5519

44.2 hrs
https://boinc.bakerlab.org/rosetta/result.php?resultid=17786665
HBLR_1.0_1di2_ROT_TRIALS_TRIE_449_49

50.7 hrs
https://boinc.bakerlab.org/rosetta/result.php?resultid=17762275
HBLR_1.0_1hz6_420_7237

49.3 hrs
https://boinc.bakerlab.org/rosetta/result.php?resultid=17773010
FACONTACTS_RECENTER_NOFILTERS_1vls__448_927

27.5 hrs
https://boinc.bakerlab.org/rosetta/result.php?resultid=17797075
NO_TERM_STRAND_1ogw_423_3285
ID: 14515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Metal-Phantom~MetalMike

Send message
Joined: 8 Mar 06
Posts: 2
Credit: 2,052,366
RAC: 0
Message 14520 - Posted: 24 Apr 2006, 5:13:31 UTC

HBLR_1.0_1n0u_420_9804_2

After 9,05 hours and 3,6% it killed itself on my P-M 1.6 running WinXP
<error_code>-161</error_code>

https://boinc.bakerlab.org/rosetta/result.php?resultid=17871439
ID: 14520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
biodoc

Send message
Joined: 19 Feb 06
Posts: 14
Credit: 30,717,792
RAC: 0
Message 14528 - Posted: 24 Apr 2006, 10:08:22 UTC
Last modified: 24 Apr 2006, 10:16:05 UTC

I just aborted a WU (details below). It was running 8+ hours at 2% complete and I have a 2 hr runtime pref. set. The behavior of this WU was similar to the FACONTACT & HBLR_1.0 WUs that I've seen posted here & experienced myself as "long runners". They seem to run normally through the "model 1" process (# of steps in the 6 figure range) & instead of moving on to "model 2, step 1", they start over as model 1, step1. Perhaps it could be described as a "model 1 loop" bug? Anyone else seen this?

I'm only running Rosetta & I have "leave in memory" checked as a pref.

Could this be Accepted RMSD & Accepted energy parameters or "goals" are not met during the Model 1 calculation & thus does not move on to a model 2 calculation & just starts the model 1 calculation over again?


Result ID 18037907
Name FARELAX_NOFILTERS_1scjB_417_302_3
Workunit 13208951

ID: 14528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tallguy-13088

Send message
Joined: 14 Dec 05
Posts: 9
Credit: 843,378
RAC: 0
Message 14529 - Posted: 24 Apr 2006, 10:14:15 UTC - in response to Message 14478.  
Last modified: 24 Apr 2006, 10:15:27 UTC

Finally aborted these two work units. The links are as follows:

HBLR_1.0_1dtj_420_4640
and
HBLR_1.0_1n0u_420_9429_1

Other than the large compute times, nothing really "stands out" about these two.


Just as a quick update, these two work units are up in the 30+ hour range and seem to be progressing albeit slowly (around 12% completion). Since I have read in another post that these are "more intense" models that are using new code, I don't have a problem running them as long as they make forward progress.

What I am seeing is that the "slowdown" is definitely in the Full Atom Relax stage (Ab Initio seems to crank right along). Current estimates put both of these WU's in the 300 Hr. range. I guess we will all know how it works out in about a couple of weeks if all continues to compute.

Folks,

Just as an FYI at this point, I have two "long runners" on two seperate machines. The first is HBLR_1.0_1djt_420_4640_1 running on a Dual Xeon P4 @ 2.8Ghz. The numbers are CPU_Time: 18.49 Hrs, Complete: 7.68% and To Completion: 20.12 Hrs. The second is HBLR_1.0_1n0u_420_9492_1 running on a 3.2 Ghz P4 with CPU_Time: 19.26 Hrs, Complete: 7.28% and To Completion: 20.45. Both are version 5.01

The graphics seem to be updating and both work units are apparently making forward progress (tons of data points). Both appear to swapping between 2+ seconds per step and then the normal "fast" multiple steps per second. Obviously the green data points are the "fast steppers".

I plan on running these up to about 50 hours apiece before aborting them (if they appear to require significantly more time). My goal is to give you guys as much diagnostic info as possible which translates into run-time.



ID: 14529 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 14542 - Posted: 24 Apr 2006, 16:24:14 UTC
Last modified: 24 Apr 2006, 16:25:38 UTC

Three more aborted. shortest had a running time of 26 hours the longest was 36 hours.Plus one failed work unit.

#1 barton 2700
Result ID 17806384
Name FACONTACTS_RECENTER_NOFILTERS_1pgx__448_969_1
Workunit 14579018
Created 21 Apr 2006 5:58:36 UTC
Sent 21 Apr 2006 12:17:56 UTC
Received 24 Apr 2006 16:13:08 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 148992
Report deadline 5 May 2006 12:17:56 UTC
CPU time 89068.004663

#2 3 gig P4
Result ID 17783537
Name FACONTACTS_RECENTER_NOFILTERS_1enh__448_738_1
Workunit 14563296
Created 21 Apr 2006 0:23:48 UTC
Sent 21 Apr 2006 7:09:10 UTC
Received 24 Apr 2006 16:16:05 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 130403
Report deadline 5 May 2006 7:09:10 UTC
CPU time 126339.9375

and on the same machine
#3 Wasn't an aborted unit. It failed on its own.
Result ID 17796282
Name NO_TERM_STRAND_1ogw_423_2065_1
Workunit 13457432
Created 21 Apr 2006 3:19:41 UTC
Sent 21 Apr 2006 5:42:04 UTC
Received 23 Apr 2006 5:06:12 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -1073741819 (0xc0000005)
Computer ID 130403
Report deadline 5 May 2006 5:42:04 UTC
CPU time 6977.347175

#4 2500 Barton
Result ID 17809982
Name NO_TERM_STRAND_1ogw_423_8417_1
Workunit 13508136
Created 21 Apr 2006 6:57:56 UTC
Sent 21 Apr 2006 8:04:02 UTC
Received 24 Apr 2006 16:18:43 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 155638
Report deadline 5 May 2006 8:04:02 UTC
CPU time 131644.328125
ID: 14542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : Number crunching : Report stuck & aborted 5.01 WU here please - III



©2024 University of Washington
https://www.bakerlab.org