Message boards : Number crunching : Report stuck & aborted WU here please - II
Author | Message |
---|---|
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
This thread is for reporting Workunits that have hung (1% error), or that have been manually aborted for some reason. Please include the type of error in your report, and a link to the RESULT in your stats page. This thread replaces part one which is located here. Moderator9 ROSETTA@home FAQ Moderator Contact |
Stephenish Send message Joined: 26 Feb 06 Posts: 3 Credit: 757,327 RAC: 0 |
4/8/2006 2:46:40 PM|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_426_4794_0 ( - exit code -1073741819 (0xc0000005)) |
CremionisD Send message Joined: 10 Mar 06 Posts: 9 Credit: 37,604,006 RAC: 0 |
Work unit aborted at 1.04% - CPU time used ~16 hours 30 minutes. WU Name "FA_RLXpt_hom004_1ptq__361_478_1" - Application "rosetta 4.83" Workunit = 11845498; Result ID = 16262949; System = AMD AXP 2400+, Win-XP SP 2 The workunit still reports "in progress" at the time of writing this message. The workunit was aborted manually ("Aborted via GUI RPC"). |
[DPC]Charley Send message Joined: 18 Mar 06 Posts: 9 Credit: 295,915 RAC: 0 |
|
[DPC]Alexcj Send message Joined: 21 Mar 06 Posts: 3 Credit: 8,374 RAC: 0 |
Another two stuck WU's both at 1.04% The two stuck units: FARELAX_NOFILTERS_1bq9A_417_622 and FARELAX_NOFILTERS_1cg5B_417_562 machine where they were crunched on. Good luck in hunting the bug(s) down! |
Mikkie Send message Joined: 1 Apr 06 Posts: 9 Credit: 5,700 RAC: 0 |
By chance I saw what in my view caused this error. On the grafical replication model 4 finished in 3:15 hours on 59% but when model 5 was starting the percentage was instantly back on 38%. Some seconds after that I got the error message below. Using r@h 4.98 https://boinc.bakerlab.org/rosetta/result.php?resultid=16733729 2006-04-10 01:04:43 [rosetta@home] Unrecoverable error for result FARELAX_NOFILTERS_1lis__427_426_0 ( - exit code -1073741811 (0xc000000d)) |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Work unit aborted at 1.04% - CPU time used ~16 hours 30 minutes. The FA_Rlx Workunits take a long time to complete a single model, usually over 4 hours. During that time they will only show 1.xx% complete. You should not be aborting them just because they take a while to run. |
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
WU 16856997 was aborted after 7 hours, when it was stuck on about 1.36%. I have 3 more that seem to be stuck near 1% after an hour, but I won't abort them until they pass 2 hours or so. |
Delk Send message Joined: 20 Feb 06 Posts: 25 Credit: 995,624 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=16873178 https://boinc.bakerlab.org/rosetta/result.php?resultid=16860083 Both aborted at 1% after no progress. Whats with these new work units, I'm now seeing what appear to be 1% errors on linux systems previously error free? This added to yesterdays lost work & credit from all the windows systems is a little frustrating. |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,208,824 RAC: 1,754 |
This is a bit different to the 1% errored Work Units, I have just aborted 2 WU's that have been processing for 3 Days. The WU's in question are FARELAX_NOFILTERS_1c9oA_417_15_0 and FARELAX_NOFILTERS_1e6iA_417_15_0 I stopped the first one on about 95% and the second I think was on about 56%, when running nothing was happening, not even the time was ticking over. I have a 3rd unit starting with HBLR_1.0_ that appears stuck on 92% also after about 3 days. I have had no Rosetta output from one machine for 2 days and reduced output from my other two due to more than a dozen Unrecoverable Errors across the machines, only since the 8/9th of April when the new units started to be issued. My machines are all current models Opterons and X2 dual core so are not that slow it takes days to process WU's. |
sillytom Send message Joined: 13 Dec 05 Posts: 1 Credit: 38,013 RAC: 0 |
I aborted the work unit FARELAX_NOFILTERS_1e6iA_413_113 after it hung up for hours at 1.04% and then for a full day at 38% Besides this WU I have had few problems |
Cobra Send message Joined: 9 Nov 05 Posts: 7 Credit: 16,586,367 RAC: 2,435 |
I have had a work unit stuck ~32.9% for what I think is several days (I did not note the name of the work unit at first, so I cannot be 100% sure it's the smae one). CPU clock cycles are being consumed as normal (95-99%), and in the BOINC Manager, CPU time is incrementing. Problem is, "To completion" is incrementing just as fast, and the Progress is not incrementing (though it sometimes seems to fluctuate between 32.90 - 32.94%). I have seen this work unit (if it's the same one) showing CPU time ~21:00:00 and time to completion as ~19:00:00. However, if I suspend calculation on that work unit, then resume, the times reset to 39:39 CPU time and ~1:45:00 To completion, then both proceed to count up from there again. (The same thing happens if I kill all the BOINC processes and restart them--CPU time resets to ~39:39, and To completion resets to ~1:45:00.) The workunit in question is FA_RLXpt_hom002_1ptq__361_178_1 (workunit ID 11695526). I will give the WU one more night before I abort it. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
The workunit in question is FA_RLXpt_hom002_1ptq__361_178_1 (workunit ID 11695526). Go ahead and abort it. If you look at the WU's creation date it's March 20. The WUs created back in March don't have the timeout enabled and they often cause trouble. WUs created in April should end after 24 hours or so of CPU, even if they are stuck. This WU was aborted by someone else and was then sent out again. |
Robinski Send message Joined: 7 Mar 06 Posts: 51 Credit: 85,383 RAC: 0 |
I got a WU that had been running for about an hour, with 1,04% No movement in the graphics,restarted it, same result. this WU is broken: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13912844 result: https://boinc.bakerlab.org/rosetta/result.php?resultid=16975198 Member of the Dutch Power Cows Trying to get the world on IPv6, do you have it? check here: IPv6.RHarmsen.nl |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
I got a WU that had been running for about an hour, with 1,04% I had a WU, which was stuck at 1,03 % for an hour and then jumpoed to 25% (target time 4 hours). I think it was this: https://boinc.bakerlab.org/rosetta/result.php?resultid=16891442 Perhaps waiting at least tow hours should be recommended. |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
Of the 4 Computing errors I reported in another thread this one was the most frustrating 16811046 13764140 9 Apr 2006 10:36:23 UTC 11 Apr 2006 7:01:19 UTC Over Client error Computing 12,238.19 37.94 --- As it got stuck on 1.5 for more than 25 hours and THEN it restarted computing back at O% ( it started from scratch) just to end in a computing error . The time that the error reported was the time spent im the second attempt. The type of project was a FULL ATOM Relax So more than 30 hours of CPU time went down the proverbial toilet) This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
BTW the work Unit my computer is working on seems to be going the same route. Workunit 13908198 TRUNCATE_TERMINI_FULLRELAX_1fna_433_105_O QITH MORE THAN 3 HOURS OF cpu time involved (3:16:43) it is stuck at 1.02% completion with more than 11 hours to complete and the quirk that with more CPU time reported it continues to report more time needed for completion. This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
BTW the work Unit my computer is working on seems to be going the same route. 759 AM AST It is now reporting 1.02% 4:24:56 CPu time and it is showing a higher time for completion than before ( 12:10:29) I will give this one more chance. But it seems it is stuck and at the end, it will be anoter large chuck of time wasted. [ Insert very annoyed emotie here] This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
BTW the work Unit my computer is working on seems to be going the same route. 940 AM AST I decided to abort the unit as it kept stuck on 1.02 and still with a higher time to completion. This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
Robinski Send message Joined: 7 Mar 06 Posts: 51 Credit: 85,383 RAC: 0 |
BTW the work Unit my computer is working on seems to be going the same route. I have got one to at this moment. 1.04% nog running 2 hours I'll give it another 30 minutes. it is the TRUNCATE_TERMINI_FULLRELAX_1ptq__433_291 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13922903 result will be here: https://boinc.bakerlab.org/rosetta/result.php?resultid=16986721 Member of the Dutch Power Cows Trying to get the world on IPv6, do you have it? check here: IPv6.RHarmsen.nl |
Message boards :
Number crunching :
Report stuck & aborted WU here please - II
©2024 University of Washington
https://www.bakerlab.org