Report stuck & aborted 5.01 WU here please - III

Message boards : Number crunching : Report stuck & aborted 5.01 WU here please - III

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Ackiss

Send message
Joined: 13 Apr 06
Posts: 1
Credit: 607,960
RAC: 0
Message 14803 - Posted: 28 Apr 2006, 0:06:52 UTC

4/27/2006 7:49:18 PM|rosetta@home|Unrecoverable error for result HB_BARCODE_30_1a68__351_24311_2 (aborted via GUI RPC)


5.01 error. Had over 85 hours cpu time and claimed to be just over 11% done. Should've been watching more closely. I wasn't aware there was a problem until I checked the news.

Ackiss
ID: 14803 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mewbysea

Send message
Joined: 29 Jan 06
Posts: 17
Credit: 15,917,465
RAC: 2,183
Message 14809 - Posted: 28 Apr 2006, 0:49:13 UTC

Aborted this workunit ID 12454830. Two other computers failed to return this WU, or any others, so they may *still* be crunching it! See HB_BARCODE_30_1ctf_351_39751

Result ID 18270294

Stopped at 20:05 hours and 8..518%

Run on an HP D530, Pentium 4 @ 2.66 GHz (stock), under WIN XP Home.

ID: 14809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 14811 - Posted: 28 Apr 2006, 1:23:16 UTC

I thought FACONTACTS_RECENTER* were meant to be removed from the workunit queue on your end? Why are known bad units still being sent out? This one is now back in queue for the next lucky recipient.


name: FACONTACTS_RECENTER_NOFILTERS_1rnbA_448_803_1
WU name: FACONTACTS_RECENTER_NOFILTERS_1rnbA_448_803
project URL: https://boinc.bakerlab.org/rosetta/
report deadline: Thu May 11 13:42:38 2006
app version num: 501
checkpoint CPU time: 38443.460000
current CPU time: 40928.280000
fraction done: 0.016453
VM usage: 0.000000
resident set size: 0.000000
estimated CPU time remaining: 67723.265742
supports graphics: no


https://boinc.bakerlab.org/rosetta/result.php?resultid=18357585
ID: 14811 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mhhall

Send message
Joined: 28 Mar 06
Posts: 7
Credit: 10,193,127
RAC: 0
Message 14817 - Posted: 28 Apr 2006, 3:03:54 UTC - in response to Message 14816.  

[snip]

Please do not abort you current 5.01 Work Units if they are runing well. The science from them is still important to the project. Many of you have asked if credit will be awarded of these failed Work units, and the answer is yes.
[snip]

My current work unit has been running since Monday eve. and shows progress
at 5.90% and CPU time of 148:47:21. Does this meet the criteria of
"running well".

I don't mind letting this process continue to run.....
I'm just worried that its not getting finish anyway....

Mike Hall / Engineering Solutions, Inc.
ID: 14817 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile eNDo

Send message
Joined: 9 Apr 06
Posts: 9
Credit: 372,288
RAC: 0
Message 14827 - Posted: 28 Apr 2006, 5:31:03 UTC
Last modified: 28 Apr 2006, 5:46:19 UTC

I just realized there was a thread to even report this. I'd have to go through all my results just to see which and when. I'm new to rosetta and figured this out when all my team was having the same issues with the same WU's. Checked previous results on the WU's and no one had finished. I know it was at least 8 WU's for my server host alone. Do I need to pull up all the facts for you?

-edit-
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=14041906
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=14925740
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=14560586
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=14226095
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=14226095
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13409472
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13408713
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13408400
Hope I listed them correctly. Again this is one user. But I guess its their job to report theirs. The ones with the lower cpu time with abort were stuck at 1% for approx 1+hrs. to my recollection.
-edit-

Sincerely,

F. Bulson
Endonet Inc.

ID: 14827 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Team_Elteor_Borislavj~Intelligence

Send message
Joined: 7 Dec 05
Posts: 14
Credit: 56,027
RAC: 0
Message 14830 - Posted: 28 Apr 2006, 7:17:28 UTC

I lost 44 hours of cpu time, i'm aborting it now it hung @ 1.04%
HBLR_1.0_1ogw_420_6744_2


ID: 14830 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anbrager

Send message
Joined: 21 Nov 05
Posts: 3
Credit: 535,292
RAC: 0
Message 14835 - Posted: 28 Apr 2006, 7:54:31 UTC

Hi,

have aborted Result ID 18179436 "HBLR_1.0_1b72_ROT_TRIALS_TRIE_449_13_1"
after 10,5 hours at 5%.
ID: 14835 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Team_Elteor_Borislavj~Intelligence

Send message
Joined: 7 Dec 05
Posts: 14
Credit: 56,027
RAC: 0
Message 14841 - Posted: 28 Apr 2006, 8:54:45 UTC - in response to Message 14830.  

I lost 44 hours of cpu time, i'm aborting it now it hung @ 1.04%
HBLR_1.0_1ogw_420_6744_2


Can i get my credit for it?
https://boinc.bakerlab.org/rosetta/result.php?resultid=18194769
ID: 14841 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aglarond

Send message
Joined: 29 Jan 06
Posts: 26
Credit: 446,212
RAC: 0
Message 14843 - Posted: 28 Apr 2006, 9:21:58 UTC
Last modified: 28 Apr 2006, 9:25:12 UTC

Hi, I just aborted FA_RLXpt_hom003_1ptq__361_16_3 . It was running for more than 12 hours, while my runtime preference is default (4 hours). Also Rhiju suggested in this post that WUs with 1ptq in the title should be aborted. Why I had WU that was told to be wrong like 2 weeks ago?
ID: 14843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jimi@0wned.org.uk

Send message
Joined: 10 Mar 06
Posts: 29
Credit: 335,252
RAC: 0
Message 14866 - Posted: 28 Apr 2006, 13:23:48 UTC

Going nowhere, the last 5.01 on this box, 38 minutes for 1.06% and a rising completion time:

18432562
Name AB_CASP6_t216__456_5100_0
Workunit 15217912
Created 27 Apr 2006 16:50:33 UTC
Sent 27 Apr 2006 21:04:16 UTC
Received 28 Apr 2006 13:20:57 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 190714
Report deadline 11 May 2006 21:04:16 UTC
CPU time 2293.640625
ID: 14866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 14883 - Posted: 28 Apr 2006, 16:26:01 UTC

Result ID 18229643
Name HBLR_1.0_1hz6_ROT_TRIALS_TRIE_449_43_2
Workunit 14630563
Created 25 Apr 2006 17:03:34 UTC
Sent 25 Apr 2006 17:10:07 UTC
Received 28 Apr 2006 16:24:09 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -197 (0xffffff3b)
Computer ID 168844
Report deadline 9 May 2006 17:10:07 UTC
CPU time 101762.765625

ID: 14883 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
K1100LTSE
Avatar

Send message
Joined: 28 Feb 06
Posts: 7
Credit: 192,387
RAC: 0
Message 14886 - Posted: 28 Apr 2006, 16:41:06 UTC


ID: 14886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 14898 - Posted: 28 Apr 2006, 17:54:56 UTC

Four more 5.01 WUs were aborted this morning

50.1 hrs
https://boinc.bakerlab.org/rosetta/result.php?resultid=18296499
HB_BARCODE_30_5croA_351_21027

51.9 hrs
https://boinc.bakerlab.org/rosetta/result.php?resultid=18296492
HB_BARCODE_30_1a19A_351_28780_3

53.0 hrs
https://boinc.bakerlab.org/rosetta/result.php?resultid=18296362
HBLR_1.0_1dtj_ROT_TRIALS_TRIE_449_27

89.6 hrs
https://boinc.bakerlab.org/rosetta/result.php?resultid=18037119
FA_RLXfn_hom001_1fna__357_63
ID: 14898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_DDT's_Cattle_Prods

Send message
Joined: 24 Mar 06
Posts: 12
Credit: 1,180,072
RAC: 0
Message 14945 - Posted: 29 Apr 2006, 1:45:13 UTC

So, are all of the aborted and stuck WUs being granted 300 points, no matter the computation time?
ID: 14945 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Division_Brabant~OldButNotSoWise
Avatar

Send message
Joined: 23 Jan 06
Posts: 42
Credit: 371,797
RAC: 0
Message 14974 - Posted: 29 Apr 2006, 9:20:42 UTC - in response to Message 14945.  

So, are all of the aborted and stuck WUs being granted 300 points, no matter the computation time?


Checked that with my error results.

I think that's indeed what happens, this one has crunched for over 5 days.

Result ID 17773392
Name HBLR_1.0_2tif_420_9913_1
Workunit 13429467
Created 20 Apr 2006 21:56:28 UTC
Sent 21 Apr 2006 4:39:59 UTC
Received 26 Apr 2006 19:26:25 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -177 (0xffffff4f)
Computer ID 147219
Report deadline 5 May 2006 4:39:59 UTC
CPU time 369442.734375
stderr out

<core_client_version>5.3.12.tx36</core_client_version>
<message>Maximum CPU time exceeded
</message>
<stderr_txt>
# random seed: 1574792
# cpu_run_time_pref: 7200

</stderr_txt>

Validate state Invalid
Claimed credit 1792.64206533905
Granted credit 300
application version 5.01
ID: 14974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hassan

Send message
Joined: 7 Mar 06
Posts: 4
Credit: 750,146
RAC: 0
Message 15026 - Posted: 29 Apr 2006, 17:32:21 UTC - in response to Message 14956.  

So, are all of the aborted and stuck WUs being granted 300 points, no matter the computation time?


As far as I have been able to determine, the work Units are granted what they claim. If they do not claim any credit (mostly from win 98 machines) them I think they have been getting 30 credits. But I have nothing definitive from the project on this, it is from my own observations, so don't hold the project to these numbers until we hear from them directly.


Result ID 18122796
Name HBLR_1.0_1dtj_420_3452_3
Workunit 13389346
Created 24 Apr 2006 15:15:38 UTC
Sent 24 Apr 2006 20:33:06 UTC
Received 27 Apr 2006 5:24:19 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -177 (0xffffff4f)
Computer ID 175797
Report deadline 8 May 2006 20:33:06 UTC
CPU time 130784.75
stderr out <core_client_version>5.2.13</core_client_version>
<message>Maximum CPU time exceeded
</message>
<stderr_txt>
# random seed: 1597253
# cpu_run_time_pref: 7200
# random seed: 1597253
# random seed: 1597253

</stderr_txt>


Validate state Invalid
Claimed credit 1165.54456988046
Granted credit 300
application version 5.01

ID: 15026 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_lv_dicedealer

Send message
Joined: 3 Jan 06
Posts: 16
Credit: 1,761,309
RAC: 0
Message 15044 - Posted: 29 Apr 2006, 21:07:31 UTC

Here is another stuck 5.01 WU

https://boinc.bakerlab.org/rosetta/result.php?resultid=18392691

I have since exorcised my farm of the 5.01s and the 5.06s... this one slipped by me though.

Thanks for all you hard work at getting these snags worked out, the R@H team deserves a pat on the back for trying to get this fixed so quickly for us crunchers.
ID: 15044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile belldandy from pleiades

Send message
Joined: 2 Nov 05
Posts: 6
Credit: 102,731
RAC: 0
Message 15082 - Posted: 30 Apr 2006, 12:07:28 UTC

2 WUs that I aborted because it takes wayyyy to much time, they didn't hang though.

https://boinc.bakerlab.org/rosetta/result.php?resultid=17827510
FACONTACTS_NOFILTERS_1r69__441_248_1

https://boinc.bakerlab.org/rosetta/result.php?resultid=17773776
HBLR_1.0_2tif_420_9927_1
Campeones everywhere!
ID: 15082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Alexcj

Send message
Joined: 21 Mar 06
Posts: 3
Credit: 8,374
RAC: 0
Message 15319 - Posted: 2 May 2006, 19:31:27 UTC

I have also a WU that is taking WAY to long to complete.
It is progressing though, I would like to see it finished.
It's HB_BARCODE_30_1bm8__351_34196_3
allthough I think it is not going to finish in time.

Is it helpfull for the project to have it progress as much as possible ?

ID: 15319 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 15329 - Posted: 2 May 2006, 19:57:46 UTC - in response to Message 15319.  

Is it helpfull for the project to have it progress as much as possible ?

Alex, in general, yes, it's helpful. In your case other work units are running 10,000 seconds and that one looks like you aborted after 443,000 seconds! 5+ days! Unless you changed your preference to be 4 days... you were more than patient with that one.

I see it was crunched on release 5.01. The newer release has the "watchdog" and it should find WUs such as this and end them much sooner, thus saving you those days of wondering.

So, you did the right things here. You were patient and didn't end it in a sudden panic after 2 hrs and 1 minute, you reported it here, and you're now crunching more WUs. And I believe you will find the current release has resolved problems like this as well, so it shouldn't happen again.

Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 15329 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Report stuck & aborted 5.01 WU here please - III



©2025 University of Washington
https://www.bakerlab.org