Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 17 · Next
Author | Message |
---|---|
Stwato Send message Joined: 11 Jan 06 Posts: 150 Credit: 655,634 RAC: 0 |
I had a work unit stuck on 1% for over 3 hours. It was FA_RLXbg_hom004_1bgf_359_376_0 If another WU gets stuck should I abort it (I aborted this one, sorry if that was wrong)? [Edit: wrote down wrong work unit, oops] |
Steve Shedroff Send message Joined: 7 Nov 05 Posts: 11 Credit: 250,657 RAC: 0 |
I aborted the following WU it appeared stuck at 1% after 44 hours with 59 hours to go. It did not appear to be progressing. 3/15/2006 6:54:16 PM|rosetta@home|Unrecoverable error for result FA_RLXai_hom022_1ail__359_179_0 (aborted via GUI RPC) Regards, Steve |
Ib Rasmussen Send message Joined: 27 Sep 05 Posts: 16 Credit: 211,416 RAC: 0 |
This morning my machine ID 5309, which only runs rosetta, showed an empty Boinc-manager - no, work, no projects. stderrae.txt says ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x0038F114 read attempt to address 0x00000008 1: 03/15/06 22:02:49 1: SymGetLineFromAddr(): GetLastError = 126 stdoutae.txt says (with some editing) 2006-03-15 20:08:46 [rosetta@home] Starting result FA_RLXb3_hom003_1b3aA_359_477_0 using rosetta version 482 ... 2006-03-15 20:36:39 [rosetta@home] Computation for result FA_RLXbk_hom026_1bk2__359_477_0 finished 2006-03-15 20:36:40 [rosetta@home] Starting result FA_RLXac_hom001_1acf__359_478_0 using rosetta version 482 ... 2006-03-15 21:01:42 [---] request_reschedule_cpus: files downloaded 2006-03-15 22:02:41 [rosetta@home] Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 2006-03-15 22:02:41 [rosetta@home] Reason: To fetch work 2006-03-15 22:02:41 [rosetta@home] Requesting 1470 seconds of new work 2006-03-15 22:02:46 [rosetta@home] Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 2006-03-15 22:02:48 [rosetta@home] Started download of hom027_1ig5A.fasta.gz 2006-03-15 22:02:48 [rosetta@home] Started download of hom027_1ig5A.psipred_ss2.gz 2006-03-15 22:02:50 [rosetta@home] Finished download of hom027_1ig5A.psipred_ss2.gz 2006-03-15 22:02:50 [rosetta@home] Throughput 1687 bytes/sec 2006-03-15 22:02:50 [rosetta@home] Started download of hom027_aa1ig5A03_05.200_v1_3.gz After restarting the manager the two wus picked up at 87% and 71%. The last wu in the queue: FA_RLXig_hom027_1ig5a_360_73_0 /Ib |
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
Another one: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11167215 Stuck on 1% after 12 hours. |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,299,279 RAC: 4 |
This WU was stuck at 1% for a day - then started "on it's own" - got to about 70% done and then BOINC switched over to one of the other projects I'm running (BBC CCE) and immediately I got a "computation error" and the percentage went to 100%. 16/03/2006 23:19:42|rosetta@home|Unrecoverable error for result FA_RLXdh_hom025_1dhn__360_62_0 ( - exit code -164 (0xffffff5c)) So, that more wasted CPU cycles. This project used to be bullet-proof - what's changed....? regards, Tim PS - Our team with 430+ overall members (and at least 130 already joined up to Rosetta) were going to be concentrating on Rosetta for a "Crunching Weekend" on 25th-26th March. see here: http://www.ukboincteam.org.uk/uk-boinc-team.html If this project doesn't get sorted REAL QUICK, we'll be forced to switch our attentions to a different project....! |
m.mitch Send message Joined: 10 Feb 06 Posts: 34 Credit: 1,928,904 RAC: 0 |
I have a work unit that has been stuck at 1% for 10:53:49. I just noticed it. Does anyone want to have a look or should I just restart it? Click here to join the #1 Aussie Alliance on Rosetta |
xrobert Send message Joined: 28 Oct 05 Posts: 3 Credit: 168,865 RAC: 0 |
The WU I have has got stuck on 1%. FA_RLXai_hom023_1aiu__359_454_0 I've now also aborted another unit which begins with ai_hom, to be safe. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
This WU was stuck at 1% for a day - then started "on it's own" - got to about 70% done and then BOINC switched over to one of the other projects I'm running (BBC CCE) and immediately I got a "computation error" and the percentage went to 100%. believe me, we are trying! |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
WU 11139882 stuck for a couple of days since 15-Mar ps u -U boinc USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND boinc 27870 0.0 1.3 6800 2944 ? S Jan18 2:07 ./boinc boinc 4280 81.3 0.1 141408 284 ? RN Mar15 2324:55 rosetta_4.81_i68 boinc 4281 0.0 0.1 141408 284 ? SN Mar15 0:00 rosetta_4.81_i686 boinc 4282 0.0 0.1 141408 284 ? SN Mar15 0:00 rosetta_4.81_i686 fgrep pct_ stdout.txt BOINC :: [2006-03-15 16:43:09] :: mode: abrelax :: nstartnm: 1 :: number_of_output: 10 :: num_decoys: 0 :: pct_complete: 0.01 BOINC :: [2006-03-15 16:57:00] :: num_decoys: 1 :: number_of_output: 35 :: pct_complete: 0.0283208 BOINC :: [2006-03-15 17:09:48] :: num_decoys: 2 :: number_of_output: 36 :: pct_complete: 0.0546174 I've killed the task, let's see how it goes... Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Ib Rasmussen Send message Joined: 27 Sep 05 Posts: 16 Credit: 211,416 RAC: 0 |
HB_BARCODE_30_1elwA_351_8105_0 stuck for two hours. Still stuck efter restart. Aborted. Computer ID 177498. /Ib |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,299,279 RAC: 4 |
This project used to be bullet-proof - what's changed....? GREAT NEWS - Perhaps it might be an idea to let people know there is a problem and to stop making work available until it's fixed - that'll take the pressure off you guys. What sort of percentage of the work returned to you is being trashed by this bug? Would imagine it's fairly high - although, if it was an epidemic failure, would assume you would have stopped sending out work before now. But surely you must be going into damage limitation mode by now. Can you afford to lose lots of crunchers? regards and good luck. Tim |
anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0 |
|
Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0 |
I'm getting a 1% freeze at step 22183 on this WU: FA_RLXbq_hom030_1bq9A_359_467_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=13795176 Stopping and re-starting BOINC will cause the WU to begin again from step 0 but it always halts at the same place. Aborting unit. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
This project used to be bullet-proof - what's changed....? |
hob. Send message Joined: 4 Nov 05 Posts: 64 Credit: 250,683 RAC: 0 |
3/4/2006 6:51:19 AM|rosetta@home|Starting result ABINITli_hom010_1lis__322_14_1 using rosetta version 482 restarted boink .............unit reset to 13 hours then stuck again for several hours at the same point..............aborted now..............imho not worth sending to anyone else 46 years dc so far join team FaDbeens join us |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
|
Tallguy-13088 Send message Joined: 14 Dec 05 Posts: 9 Credit: 843,378 RAC: 0 |
Hello, I just aborted WU: FA_RLXac_hom017_1acf_359_113_0 after two BOINC Mgr. restarts and about 50+ (combined) hours of processing. All three attempts stalled at STEP: 23171. The last ACCEPTED RMSD was 14.75 and the last ACCEPTED ENERGY was 15.47414. Each attempt showed activity for about 1 minute and hung at 1% complete with the ELAPSED TIME continuing to increment. Rosetta version was 4.82 under Win2K. Just an FYI in case someone wants to debug the workunit. |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,299,279 RAC: 4 |
The exciting news is that the Boinc consultant we have hired, Rom, has made an improvement in how Hi David, Well that ties in with an observation I can make, which I've seen once or twice. I have noticed that a Rosetta work units "fail" when my 3GHz P4/HT switches from one project to another - so there seem to issues when the Rosetta process seems to be "suspended" by BOINC as it then switches over to another project - (I'm running BBC CCE as a second simultaneous BOINC project on the same PC - this "switch over problem" tends to occur when one "CPU" switches out of working on a Rosetta WU and then switches over to the CCE WU). Maybe this is a help - but seems Rom is on the right trail. regards Tim (edit) typo |
Rom Walton (BOINC) Volunteer moderator Project developer Send message Joined: 17 Sep 05 Posts: 18 Credit: 40,071 RAC: 0 |
The ERR_NESTED_UNHANDLED_EXCEPTION_DETECTED no longer appears on the list at all, and the 0xC0000005 only accounts for 6 of the 49 errors reported in the last 24 hours. If the data of Ralph is any indication about how the application is going to behave on the public project it should result in a 60%-70% in error rate for the public project. ----- Rom ----- Rom My Blog |
Angus Send message Joined: 17 Sep 05 Posts: 412 Credit: 321,053 RAC: 0 |
Isn't the biggest failure mode still the "stuck at 1%" issue? Do those even get reported as errors, outside of the forum? All the ones I have had have neede to be aborted, after failing to run. Doesn't that just show as "aborted by user", effectively hiding the scope of the problem? Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :) "You can't fix stupid" (Ron White) |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2025 University of Washington
https://www.bakerlab.org