Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 14 · 15 · 16 · 17
Author | Message |
---|---|
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Just some seconds after I post this the next one is down the drain. 2006-04-08 16:32:43 [rosetta@home] Unrecoverable error for result HBLR_1.0_1mky_426_4765_0 (aborted via GUI RPC) Where it says "aborted via GUI RPC" means you aborted the work unit and it's just reporting it as an error. unless I'm missing something you caused this on all but the last two you've listed. 2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC) |
Robert J Send message Joined: 7 Oct 05 Posts: 3 Credit: 397,467 RAC: 0 |
Got this message on a work unit a few minutes ago. 4/8/2006 11:47:26 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_2tif_425_8010_1 ( - exit code -1073741819 (0xc0000005)) 4/8/2006 11:47:26 AM||request_reschedule_cpus: process exited 4/8/2006 11:47:26 AM|rosetta@home|Computation for result HBLR_1.0_2tif_425_8010_1 finished Running Win XP SP2, P4 3.2 GHz 1.5Gb memory. Boinc set to keep in memory. Work unit run time set to 4 hours. Second time this has happened in the last 24 hours. |
[DPC]Division_Brabant~OldButNotSoWise Send message Joined: 23 Jan 06 Posts: 42 Credit: 371,797 RAC: 0 |
Unrecoverable error for result FARELAX_NOFILTERS_1c8cA_427_175_0 ( - exit code -1073741819 (0xc0000005)) |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
With a Max Cpu Time setting of 1 hour, only 2 of the last 12 of my HBLR WUs would have uploaded properly. Even with a 1 hour Max CPU Time setting, these WUs have an incredibly high failure rate. Dr. Baker: Can the HBLRs be totally removed from the system so they're not released to anyone else this weekend? Or 4.83 be re-released as client 4.98 (If 4.83 can handle these WUs)? Or both? |
Buffalo Bill Send message Joined: 25 Mar 06 Posts: 71 Credit: 1,630,458 RAC: 0 |
I've noticed most of the posted failures are HBLR_1.0.... WU's. I did get a couple finished by shutting down BOINC and logging out/in and restarting BOINC. They then ran clean for 4 hours and finished on 2 different machines. |
Sander Send message Joined: 18 Dec 05 Posts: 1 Credit: 452,447 RAC: 0 |
After some HLBR errors, now farelax errors. Job was 100%, and then I've got: 08/04/2006 22:29:04|rosetta@home|Unrecoverable error for result FARELAX_NOFILTERS_1a68__427_51_0 ( - exit code -1073741819 (0xc0000005)) Using R@h v497 |
Mikkie Send message Joined: 1 Apr 06 Posts: 9 Credit: 5,700 RAC: 0 |
I'm switchting back to the project I left. Stable as a rock 2006-04-08 22:53:57 [rosetta@home] Unrecoverable error for result FARELAX_NOFILTERS_1fkb__427_59_0 ( - exit code -1073741819 (0xc0000005)) 2006-04-08 23:07:52 [rosetta@home] Unrecoverable error for result FARELAX_NOFILTERS_5croA_427_242_0 (aborted via GUI RPC) using r@h 4.97 |
Walter Roberson Send message Joined: 5 Dec 05 Posts: 2 Credit: 13,937 RAC: 0 |
I've just aborted an overdue WU "stuck at 1%". Windows XP SP1, 512 Mb, running under BOINC. https://boinc.bakerlab.org/rosetta/result.php?resultid=17048614 This was the first WU issued to me after the recent Rosetta upgrade. Now that I have aborted it, I will run another unit and see if the same problem occurs. workunit TRUNCATE_TERMINI_FULLRELAX_2tif__433_873 1.042% complete CPU time: 43 hr 52 min 10 sec Walter Roberson -= Total credit: 9051.8 - RAC: 47.0815 Rossetta@home v4.98 Stage: full atom relax Model: 1 step 283223 Accepted RMSD: 2.039 Accepted Energy: -73.13141 |
yoner Send message Joined: 17 Sep 05 Posts: 10 Credit: 2,581,874 RAC: 0 |
Hello, I have several units that are into the very high numbers for computing. NO_TERM__STRAND_1ogw_423_2138_1 (v5.01) NO_TERM__STRAND_1ogw_423_6238_1 (v5.01) Both have run for approx 100 hours on a dual PII 233, I know that they are still processing, as looking at the Graphics options shows the Step counter increasing. How many steps are in the work units? I have another unit: HB_BARCODE_30_1bm8__351_25694_2 (v 5.01) that is at over 30 hours on a P4 3GHz, 2 gig ram. There is a possibility that this unit on this computer got fubarred by a system re-boot for the hours of computation, but should not be that bad. Any ideas what is going on? |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Hiya: If you see a workunit going on for more than four times your preferred CPU run time (by default it has been 3 hours, so >12 hours), I'd delete the job. We had some reports of old WUs getting stuck on some machines. We've put in a feature in the newest application Rosetta@home 5.06 (a "watchdog" timer) that should automatically carry out an abort if the job has been going on too long. So hopefully this will be the last time you'll need to manually abort jobs that seem to be going on forever. Also, please note that we will grant credit for your aborted jobs even if they are reported as errors, about a week after you abort them. Hello, |
yoner Send message Joined: 17 Sep 05 Posts: 10 Credit: 2,581,874 RAC: 0 |
Thanks, As a side note, I found out exactly what was happening with the unit that was running on the P4. Unit was completes model 1 and then starts over from step 1 again. Happened to catch it as it was doing that. The other two units are still counting upwards on the Dual PII though, going to see what happens. |
Walter Roberson Send message Joined: 5 Dec 05 Posts: 2 Credit: 13,937 RAC: 0 |
I've just aborted an overdue WU "stuck at 1%". Windows XP SP1, 512 Mb, Clarification: the "recent Rosetta upgrade" I referred to was about April 12th, one of the 4.x improvements. When I allowed new work, 5.x was downloaded, and so far the WU have been progressing fine with that. |
Bespin Reactor Shaft Send message Joined: 29 Nov 05 Posts: 1 Credit: 100,592 RAC: 0 |
OK. Here's one: rosetta 5.01 FACONTACTS_RECENTER_NOFILTERS_1b3aA_448_266_2 CPU time: 35:52:47 Progress: 1.15% To completion: 38:23:33 Deadline: 6 May 2006 |
Winkle Send message Joined: 22 May 06 Posts: 88 Credit: 1,354,930 RAC: 0 |
I have t307__CASP7_ABRELAX_SAVE_ALL_OUT_BARCODE_hom001__714_20997_0 using rosetta version 5.22 and it has been running now for 24 hrs. It has been stuck on 100% for at least the last hour I have been watching it. Mem usage of Rosetta was 88M and id now 94M after 30 mins. Now 97M ans climbing. CPU usage doesn't change when I suspend the task from the BOINC manager. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=20861564 The show graphics screen says... 68.601% complete CPU time: 24 hr 0 min Stage: Ab initio + relax Model 116 step 0 Accepted Enrgy 44.55485 Nothing is changing on the screen. The protein looks like a single zig-zag line Target CPU time is set to 8 hrs. The machine became unworkable, but is back to normal after the abort. |
Rich Send message Joined: 30 Nov 05 Posts: 5 Credit: 594,384 RAC: 0 |
Good morning. I have just sumitted 2: FRA_t323_CASP7_hom001_2_IGNORE_THE_RESTt323_2_dec00_1.pdb_771_81 and FRA_t323_CASP7_hom001_2_IGNORE_THE_RESTt323_2_dec23_4.pdb_771_80. Both originally were in the 33hr range, one at 1.65% and one around 1.07%. I also noticed that my stats were not updating, so I rebooted. After an hour or so they got stuck again, this time at 19.15% and 18.51% respectively. I did another reboot, they regressed to 18.50% and 17.90% and stayed there for more than an hour. I have to run to work now but hope that this information might be useful. Take care and have a good day. Rich Rich Seyfert Eatontown, NJ SeyfertR@att.net |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2025 University of Washington
https://www.bakerlab.org