Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 9 · Next

AuthorMessage
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 13331 - Posted: 9 Apr 2006, 16:29:43 UTC
Last modified: 12 May 2006, 20:38:54 UTC

This thread is for reporting Workunits that have hung (1% error), or that have been manually aborted for some reason. Please include the type of error in your report, and a link to the RESULT in your stats page. This thread replaces part one which is located here.
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 13331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stephenish

Send message
Joined: 26 Feb 06
Posts: 3
Credit: 757,327
RAC: 0
Message 13340 - Posted: 9 Apr 2006, 17:50:41 UTC

4/8/2006 2:46:40 PM|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_426_4794_0 ( - exit code -1073741819 (0xc0000005))

ID: 13340 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CremionisD

Send message
Joined: 10 Mar 06
Posts: 9
Credit: 37,604,006
RAC: 0
Message 13342 - Posted: 9 Apr 2006, 18:14:27 UTC - in response to Message 13331.  

Work unit aborted at 1.04% - CPU time used ~16 hours 30 minutes.

WU Name "FA_RLXpt_hom004_1ptq__361_478_1" - Application "rosetta 4.83"
Workunit = 11845498; Result ID = 16262949; System = AMD AXP 2400+, Win-XP SP 2

The workunit still reports "in progress" at the time of writing this message.
The workunit was aborted manually ("Aborted via GUI RPC").
ID: 13342 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Charley

Send message
Joined: 18 Mar 06
Posts: 9
Credit: 295,915
RAC: 0
Message 13363 - Posted: 9 Apr 2006, 21:23:41 UTC
Last modified: 9 Apr 2006, 21:24:17 UTC

Workunit: FA_RLXpt_hom007_1ptq__361_230
Reason: Stuck at 1.042% (after almost 9 hours)
Stop: Manual
Link: Workunit - Result

ID: 13363 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Alexcj

Send message
Joined: 21 Mar 06
Posts: 3
Credit: 8,374
RAC: 0
Message 13366 - Posted: 9 Apr 2006, 21:54:33 UTC
Last modified: 9 Apr 2006, 21:57:18 UTC

Another two stuck WU's both at 1.04%

The two stuck units:
FARELAX_NOFILTERS_1bq9A_417_622
and
FARELAX_NOFILTERS_1cg5B_417_562

machine where they were crunched on.

Good luck in hunting the bug(s) down!
ID: 13366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mikkie

Send message
Joined: 1 Apr 06
Posts: 9
Credit: 5,700
RAC: 0
Message 13370 - Posted: 9 Apr 2006, 23:33:09 UTC
Last modified: 9 Apr 2006, 23:44:40 UTC

By chance I saw what in my view caused this error. On the grafical replication model 4 finished in 3:15 hours on 59% but when model 5 was starting the percentage was instantly back on 38%. Some seconds after that I got the error message below.
Using r@h 4.98

https://boinc.bakerlab.org/rosetta/result.php?resultid=16733729
2006-04-10 01:04:43 [rosetta@home] Unrecoverable error for result FARELAX_NOFILTERS_1lis__427_426_0 ( - exit code -1073741811 (0xc000000d))
ID: 13370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 13378 - Posted: 10 Apr 2006, 4:07:08 UTC - in response to Message 13342.  

Work unit aborted at 1.04% - CPU time used ~16 hours 30 minutes.

WU Name "FA_RLXpt_hom004_1ptq__361_478_1" - Application "rosetta 4.83"
Workunit = 11845498; Result ID = 16262949; System = AMD AXP 2400+, Win-XP SP 2

The workunit still reports "in progress" at the time of writing this message.
The workunit was aborted manually ("Aborted via GUI RPC").



The FA_Rlx Workunits take a long time to complete a single model, usually over 4 hours. During that time they will only show 1.xx% complete. You should not be aborting them just because they take a while to run.
ID: 13378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 13381 - Posted: 10 Apr 2006, 9:03:25 UTC

WU 16856997 was aborted after 7 hours, when it was stuck on about 1.36%.

I have 3 more that seem to be stuck near 1% after an hour, but I won't abort them until they pass 2 hours or so.
ID: 13381 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13384 - Posted: 10 Apr 2006, 10:36:13 UTC
Last modified: 10 Apr 2006, 11:13:07 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=16873178
https://boinc.bakerlab.org/rosetta/result.php?resultid=16860083

Both aborted at 1% after no progress.

Whats with these new work units, I'm now seeing what appear to be 1% errors on linux systems previously error free? This added to yesterdays lost work & credit from all the windows systems is a little frustrating.

ID: 13384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,801,540
RAC: 2,804
Message 13385 - Posted: 10 Apr 2006, 10:53:34 UTC

This is a bit different to the 1% errored Work Units, I have just aborted 2 WU's that have been processing for 3 Days.
The WU's in question are FARELAX_NOFILTERS_1c9oA_417_15_0
and FARELAX_NOFILTERS_1e6iA_417_15_0
I stopped the first one on about 95% and the second I think was on about 56%, when running nothing was happening, not even the time was ticking over. I have a 3rd unit starting with HBLR_1.0_ that appears stuck on 92% also after about 3 days. I have had no Rosetta output from one machine for 2 days and reduced output from my other two due to more than a dozen Unrecoverable Errors across the machines, only since the 8/9th of April when the new units started to be issued.
My machines are all current models Opterons and X2 dual core so are not that slow it takes days to process WU's.
ID: 13385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sillytom

Send message
Joined: 13 Dec 05
Posts: 1
Credit: 38,013
RAC: 0
Message 13403 - Posted: 10 Apr 2006, 17:41:05 UTC

I aborted the work unit

FARELAX_NOFILTERS_1e6iA_413_113

after it hung up for hours at 1.04% and then for a full day at 38%

Besides this WU I have had few problems
ID: 13403 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cobra

Send message
Joined: 9 Nov 05
Posts: 7
Credit: 16,121,166
RAC: 3,359
Message 13433 - Posted: 11 Apr 2006, 3:40:12 UTC

I have had a work unit stuck ~32.9% for what I think is several days (I did not note the name of the work unit at first, so I cannot be 100% sure it's the smae one). CPU clock cycles are being consumed as normal (95-99%), and in the BOINC Manager, CPU time is incrementing. Problem is, "To completion" is incrementing just as fast, and the Progress is not incrementing (though it sometimes seems to fluctuate between 32.90 - 32.94%).

I have seen this work unit (if it's the same one) showing CPU time ~21:00:00 and time to completion as ~19:00:00. However, if I suspend calculation on that work unit, then resume, the times reset to 39:39 CPU time and ~1:45:00 To completion, then both proceed to count up from there again. (The same thing happens if I kill all the BOINC processes and restart them--CPU time resets to ~39:39, and To completion resets to ~1:45:00.)

The workunit in question is FA_RLXpt_hom002_1ptq__361_178_1 (workunit ID 11695526).

I will give the WU one more night before I abort it.
ID: 13433 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13436 - Posted: 11 Apr 2006, 4:31:29 UTC - in response to Message 13433.  

The workunit in question is FA_RLXpt_hom002_1ptq__361_178_1 (workunit ID 11695526).

I will give the WU one more night before I abort it.


Go ahead and abort it. If you look at the WU's creation date it's March 20. The WUs created back in March don't have the timeout enabled and they often cause trouble. WUs created in April should end after 24 hours or so of CPU, even if they are stuck.

This WU was aborted by someone else and was then sent out again.
ID: 13436 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robinski

Send message
Joined: 7 Mar 06
Posts: 51
Credit: 85,383
RAC: 0
Message 13442 - Posted: 11 Apr 2006, 9:48:17 UTC

I got a WU that had been running for about an hour, with 1,04%
No movement in the graphics,restarted it, same result.

this WU is broken:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13912844
result:
https://boinc.bakerlab.org/rosetta/result.php?resultid=16975198
Member of the Dutch Power Cows

Trying to get the world on IPv6, do you have it? check here: IPv6.RHarmsen.nl
ID: 13442 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 13443 - Posted: 11 Apr 2006, 10:00:37 UTC - in response to Message 13442.  

I got a WU that had been running for about an hour, with 1,04%
No movement in the graphics,restarted it, same result.

this WU is broken:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13912844
result:
https://boinc.bakerlab.org/rosetta/result.php?resultid=16975198


I had a WU, which was stuck at 1,03 % for an hour and then jumpoed to 25% (target time 4 hours). I think it was this:
https://boinc.bakerlab.org/rosetta/result.php?resultid=16891442

Perhaps waiting at least tow hours should be recommended.
ID: 13443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 13444 - Posted: 11 Apr 2006, 10:35:04 UTC
Last modified: 11 Apr 2006, 10:44:55 UTC

Of the 4 Computing errors I reported in another thread this one was the most frustrating

16811046 13764140 9 Apr 2006 10:36:23 UTC 11 Apr 2006 7:01:19 UTC Over Client error Computing 12,238.19 37.94 ---

As it got stuck on 1.5 for more than 25 hours and THEN it restarted computing back at O% ( it started from scratch) just to end in a computing error . The time that the error reported was the time spent im the second attempt. The type of project was a FULL ATOM Relax So more than 30 hours of CPU time went down the proverbial toilet)
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 13444 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 13445 - Posted: 11 Apr 2006, 10:43:41 UTC

BTW the work Unit my computer is working on seems to be going the same route.

Workunit 13908198 TRUNCATE_TERMINI_FULLRELAX_1fna_433_105_O

QITH MORE THAN 3 HOURS OF cpu time involved (3:16:43) it is stuck at 1.02% completion with more than 11 hours to complete and the quirk that with more CPU time reported it continues to report more time needed for completion.


This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 13445 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 13446 - Posted: 11 Apr 2006, 12:02:08 UTC - in response to Message 13445.  

BTW the work Unit my computer is working on seems to be going the same route.

Workunit 13908198 TRUNCATE_TERMINI_FULLRELAX_1fna_433_105_O

QITH MORE THAN 3 HOURS OF cpu time involved (3:16:43) it is stuck at 1.02% completion with more than 11 hours to complete and the quirk that with more CPU time reported it continues to report more time needed for completion.



759 AM AST
It is now reporting 1.02% 4:24:56 CPu time and it is showing a higher time for completion than before ( 12:10:29) I will give this one more chance. But it seems it is stuck and at the end, it will be anoter large chuck of time wasted. [ Insert very annoyed emotie here]
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 13446 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 13455 - Posted: 11 Apr 2006, 13:40:31 UTC - in response to Message 13446.  

BTW the work Unit my computer is working on seems to be going the same route.

Workunit 13908198 TRUNCATE_TERMINI_FULLRELAX_1fna_433_105_O

QITH MORE THAN 3 HOURS OF cpu time involved (3:16:43) it is stuck at 1.02% completion with more than 11 hours to complete and the quirk that with more CPU time reported it continues to report more time needed for completion.



759 AM AST
It is now reporting 1.02% 4:24:56 CPu time and it is showing a higher time for completion than before ( 12:10:29) I will give this one more chance. But it seems it is stuck and at the end, it will be anoter large chuck of time wasted. [ Insert very annoyed emotie here]


940 AM AST
I decided to abort the unit as it kept stuck on 1.02 and still with a higher time to completion.
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 13455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robinski

Send message
Joined: 7 Mar 06
Posts: 51
Credit: 85,383
RAC: 0
Message 13456 - Posted: 11 Apr 2006, 13:48:31 UTC - in response to Message 13455.  

BTW the work Unit my computer is working on seems to be going the same route.

Workunit 13908198 TRUNCATE_TERMINI_FULLRELAX_1fna_433_105_O

QITH MORE THAN 3 HOURS OF cpu time involved (3:16:43) it is stuck at 1.02% completion with more than 11 hours to complete and the quirk that with more CPU time reported it continues to report more time needed for completion.



759 AM AST
It is now reporting 1.02% 4:24:56 CPu time and it is showing a higher time for completion than before ( 12:10:29) I will give this one more chance. But it seems it is stuck and at the end, it will be anoter large chuck of time wasted. [ Insert very annoyed emotie here]


940 AM AST
I decided to abort the unit as it kept stuck on 1.02 and still with a higher time to completion.


I have got one to at this moment. 1.04% nog running 2 hours
I'll give it another 30 minutes.

it is the TRUNCATE_TERMINI_FULLRELAX_1ptq__433_291
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13922903

result will be here: https://boinc.bakerlab.org/rosetta/result.php?resultid=16986721
Member of the Dutch Power Cows

Trying to get the world on IPv6, do you have it? check here: IPv6.RHarmsen.nl
ID: 13456 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 9 · Next

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2024 University of Washington
https://www.bakerlab.org