Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · Next
Author | Message |
---|---|
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
I've got two more somewhere that are stuck on 1.04% after 2 hours and 1.5 hours. |
Delk Send message Joined: 20 Feb 06 Posts: 25 Credit: 995,624 RAC: 0 |
Another frozen workunit: name: FA_RLXpt_hom006_1ptq__361_80_1 WU name: FA_RLXpt_hom006_1ptq__361_80 app version num: 483 checkpoint CPU time: 2112.812500 current CPU time: 121543.171875 fraction done: 0.293420 VM usage: 0.000000 resident set size: 0.000000 estimated CPU time remaining: 89449.556235 result id: 15996230 workunit id: 11646530 It was meant to be a 2 hour although obviously things went wrong, 121543 seconds later I noticed the issue. |
Delk Send message Joined: 20 Feb 06 Posts: 25 Credit: 995,624 RAC: 0 |
In regard to my previous post, shouldn't the max unit runtime of 24 hours aborted it automatically or has that been removed? |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
In regard to my previous post, shouldn't the max unit runtime of 24 hours aborted it automatically or has that been removed? It should have; will try to figure out why these weren't terminated. also the reports here will be very helpful in pinning down the problem |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I think all these FA_* WUs are old. Someone else aborted them, so they were sent out again. You can check the creation date on the WUs page, and anything created in March should be aborted if it seems stuck. The 24hour timout is only in newer WUs, which started coming out at the end of March. |
Dave Wilson Send message Joined: 8 Jan 06 Posts: 35 Credit: 379,049 RAC: 0 |
Just found https://boinc.bakerlab.org/rosetta/result.php?resultid=16005553 sorry I did not get the rest of the info but it was stuck at around 17 hours and 34.--- % |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=15702999 |
Mikkie Send message Joined: 1 Apr 06 Posts: 9 Credit: 5,700 RAC: 0 |
Somebody need to find a solution about this series. If this goes on it's no fun anymore. All crashed after more than 1+ hour crunching. 2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC) 2006-04-07 14:28:40 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_3457_0 (aborted via GUI RPC) 2006-04-07 18:49:47 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_6531_0 (aborted via GUI RPC) 2006-04-08 10:29:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1b72_424_2309_1 (aborted via GUI RPC) 2006-04-08 15:26:29 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_426_2934_0 (aborted via GUI RPC) 2006-04-08 15:32:46 [rosetta@home] Unrecoverable error for result HBLR_1.0_1di2_426_4357_0 ( - exit code -1073741819 (0xc0000005)) 2006-04-08 15:41:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1r69_426_325_1 ( - exit code -1073741819 (0xc0000005)) |
Mikkie Send message Joined: 1 Apr 06 Posts: 9 Credit: 5,700 RAC: 0 |
Just some seconds after I post this the next one is down the drain. 2006-04-08 16:32:43 [rosetta@home] Unrecoverable error for result HBLR_1.0_1mky_426_4765_0 (aborted via GUI RPC) I quit/reject/go on hold till these problem[s] been solved and stable. Couldn't get a result/point on the board. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
can you try setting the work unit time to 1 hour? thanks, David |
OldButNotSoWise Send message Joined: 5 Nov 05 Posts: 2 Credit: 0 RAC: 0 |
Unrecoverable error for result HBLR_1.0_1r69_426_2081_0 ( - exit code -1073741819 (0xc0000005)) |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Just some seconds after I post this the next one is down the drain. 2006-04-08 16:32:43 [rosetta@home] Unrecoverable error for result HBLR_1.0_1mky_426_4765_0 (aborted via GUI RPC) Where it says "aborted via GUI RPC" means you aborted the work unit and it's just reporting it as an error. unless I'm missing something you caused this on all but the last two you've listed. 2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC) |
Robert J Send message Joined: 7 Oct 05 Posts: 3 Credit: 397,467 RAC: 0 |
Got this message on a work unit a few minutes ago. 4/8/2006 11:47:26 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_2tif_425_8010_1 ( - exit code -1073741819 (0xc0000005)) 4/8/2006 11:47:26 AM||request_reschedule_cpus: process exited 4/8/2006 11:47:26 AM|rosetta@home|Computation for result HBLR_1.0_2tif_425_8010_1 finished Running Win XP SP2, P4 3.2 GHz 1.5Gb memory. Boinc set to keep in memory. Work unit run time set to 4 hours. Second time this has happened in the last 24 hours. |
[DPC]Division_Brabant~OldButNotSoWise Send message Joined: 23 Jan 06 Posts: 42 Credit: 371,797 RAC: 0 |
Unrecoverable error for result FARELAX_NOFILTERS_1c8cA_427_175_0 ( - exit code -1073741819 (0xc0000005)) |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
With a Max Cpu Time setting of 1 hour, only 2 of the last 12 of my HBLR WUs would have uploaded properly. Even with a 1 hour Max CPU Time setting, these WUs have an incredibly high failure rate. Dr. Baker: Can the HBLRs be totally removed from the system so they're not released to anyone else this weekend? Or 4.83 be re-released as client 4.98 (If 4.83 can handle these WUs)? Or both? |
Buffalo Bill Send message Joined: 25 Mar 06 Posts: 71 Credit: 1,630,458 RAC: 0 |
I've noticed most of the posted failures are HBLR_1.0.... WU's. I did get a couple finished by shutting down BOINC and logging out/in and restarting BOINC. They then ran clean for 4 hours and finished on 2 different machines. |
Sander Send message Joined: 18 Dec 05 Posts: 1 Credit: 452,447 RAC: 0 |
After some HLBR errors, now farelax errors. Job was 100%, and then I've got: 08/04/2006 22:29:04|rosetta@home|Unrecoverable error for result FARELAX_NOFILTERS_1a68__427_51_0 ( - exit code -1073741819 (0xc0000005)) Using R@h v497 |
Mikkie Send message Joined: 1 Apr 06 Posts: 9 Credit: 5,700 RAC: 0 |
I'm switchting back to the project I left. Stable as a rock 2006-04-08 22:53:57 [rosetta@home] Unrecoverable error for result FARELAX_NOFILTERS_1fkb__427_59_0 ( - exit code -1073741819 (0xc0000005)) 2006-04-08 23:07:52 [rosetta@home] Unrecoverable error for result FARELAX_NOFILTERS_5croA_427_242_0 (aborted via GUI RPC) using r@h 4.97 |
Walter Roberson Send message Joined: 5 Dec 05 Posts: 2 Credit: 13,937 RAC: 0 |
I've just aborted an overdue WU "stuck at 1%". Windows XP SP1, 512 Mb, running under BOINC. https://boinc.bakerlab.org/rosetta/result.php?resultid=17048614 This was the first WU issued to me after the recent Rosetta upgrade. Now that I have aborted it, I will run another unit and see if the same problem occurs. workunit TRUNCATE_TERMINI_FULLRELAX_2tif__433_873 1.042% complete CPU time: 43 hr 52 min 10 sec Walter Roberson -= Total credit: 9051.8 - RAC: 47.0815 Rossetta@home v4.98 Stage: full atom relax Model: 1 step 283223 Accepted RMSD: 2.039 Accepted Energy: -73.13141 |
yoner Send message Joined: 17 Sep 05 Posts: 10 Credit: 2,581,874 RAC: 0 |
Hello, I have several units that are into the very high numbers for computing. NO_TERM__STRAND_1ogw_423_2138_1 (v5.01) NO_TERM__STRAND_1ogw_423_6238_1 (v5.01) Both have run for approx 100 hours on a dual PII 233, I know that they are still processing, as looking at the Graphics options shows the Step counter increasing. How many steps are in the work units? I have another unit: HB_BARCODE_30_1bm8__351_25694_2 (v 5.01) that is at over 30 hours on a P4 3GHz, 2 gig ram. There is a possibility that this unit on this computer got fubarred by a system re-boot for the hours of computation, but should not be that bad. Any ideas what is going on? |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2024 University of Washington
https://www.bakerlab.org