Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · Next

AuthorMessage
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 13078 - Posted: 5 Apr 2006, 10:16:38 UTC

I've got two more somewhere that are stuck on 1.04% after 2 hours and 1.5 hours.
ID: 13078 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13079 - Posted: 5 Apr 2006, 10:20:41 UTC

Another frozen workunit:

name: FA_RLXpt_hom006_1ptq__361_80_1
WU name: FA_RLXpt_hom006_1ptq__361_80
app version num: 483
checkpoint CPU time: 2112.812500
current CPU time: 121543.171875
fraction done: 0.293420
VM usage: 0.000000
resident set size: 0.000000
estimated CPU time remaining: 89449.556235

result id: 15996230
workunit id: 11646530

It was meant to be a 2 hour although obviously things went wrong, 121543 seconds later I noticed the issue.
ID: 13079 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13080 - Posted: 5 Apr 2006, 12:03:57 UTC

In regard to my previous post, shouldn't the max unit runtime of 24 hours aborted it automatically or has that been removed?
ID: 13080 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13082 - Posted: 5 Apr 2006, 14:33:41 UTC - in response to Message 13080.  

In regard to my previous post, shouldn't the max unit runtime of 24 hours aborted it automatically or has that been removed?


It should have; will try to figure out why these weren't terminated. also the reports here will be very helpful in pinning down the problem
ID: 13082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13101 - Posted: 5 Apr 2006, 22:35:54 UTC

I think all these FA_* WUs are old. Someone else aborted them, so they were sent out again. You can check the creation date on the WUs page, and anything created in March should be aborted if it seems stuck.

The 24hour timout is only in newer WUs, which started coming out at the end of March.
ID: 13101 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dave Wilson

Send message
Joined: 8 Jan 06
Posts: 35
Credit: 379,049
RAC: 0
Message 13130 - Posted: 6 Apr 2006, 20:06:14 UTC

Just found https://boinc.bakerlab.org/rosetta/result.php?resultid=16005553 sorry I did not get the rest of the info but it was stuck at around 17 hours and 34.--- %


ID: 13130 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 13154 - Posted: 7 Apr 2006, 6:07:23 UTC
Last modified: 7 Apr 2006, 6:51:54 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=15702999

ID: 13154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mikkie

Send message
Joined: 1 Apr 06
Posts: 9
Credit: 5,700
RAC: 0
Message 13232 - Posted: 8 Apr 2006, 14:24:15 UTC


Somebody need to find a solution about this series. If this goes on it's no fun anymore. All crashed after more than 1+ hour crunching.

2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC)
2006-04-07 14:28:40 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_3457_0 (aborted via GUI RPC)
2006-04-07 18:49:47 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_6531_0 (aborted via GUI RPC)
2006-04-08 10:29:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1b72_424_2309_1 (aborted via GUI RPC)
2006-04-08 15:26:29 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_426_2934_0 (aborted via GUI RPC)
2006-04-08 15:32:46 [rosetta@home] Unrecoverable error for result HBLR_1.0_1di2_426_4357_0 ( - exit code -1073741819 (0xc0000005))
2006-04-08 15:41:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1r69_426_325_1 ( - exit code -1073741819 (0xc0000005))
ID: 13232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mikkie

Send message
Joined: 1 Apr 06
Posts: 9
Credit: 5,700
RAC: 0
Message 13236 - Posted: 8 Apr 2006, 14:47:05 UTC - in response to Message 13232.  


Somebody need to find a solution about this series. If this goes on it's no fun anymore. All crashed after more than 1+ hour crunching.

2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC)
2006-04-07 14:28:40 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_3457_0 (aborted via GUI RPC)
2006-04-07 18:49:47 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_6531_0 (aborted via GUI RPC)
2006-04-08 10:29:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1b72_424_2309_1 (aborted via GUI RPC)
2006-04-08 15:26:29 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_426_2934_0 (aborted via GUI RPC)
2006-04-08 15:32:46 [rosetta@home] Unrecoverable error for result HBLR_1.0_1di2_426_4357_0 ( - exit code -1073741819 (0xc0000005))
2006-04-08 15:41:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1r69_426_325_1 ( - exit code -1073741819 (0xc0000005))


Just some seconds after I post this the next one is down the drain. 2006-04-08 16:32:43 [rosetta@home] Unrecoverable error for result HBLR_1.0_1mky_426_4765_0 (aborted via GUI RPC)
I quit/reject/go on hold till these problem[s] been solved and stable. Couldn't get a result/point on the board.
ID: 13236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13239 - Posted: 8 Apr 2006, 14:55:06 UTC - in response to Message 13232.  


Somebody need to find a solution about this series. If this goes on it's no fun anymore. All crashed after more than 1+ hour crunching.

2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC)
2006-04-07 14:28:40 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_3457_0 (aborted via GUI RPC)
2006-04-07 18:49:47 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_6531_0 (aborted via GUI RPC)
2006-04-08 10:29:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1b72_424_2309_1 (aborted via GUI RPC)
2006-04-08 15:26:29 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_426_2934_0 (aborted via GUI RPC)
2006-04-08 15:32:46 [rosetta@home] Unrecoverable error for result HBLR_1.0_1di2_426_4357_0 ( - exit code -1073741819 (0xc0000005))
2006-04-08 15:41:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1r69_426_325_1 ( - exit code -1073741819 (0xc0000005))


can you try setting the work unit time to 1 hour? thanks, David

ID: 13239 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
OldButNotSoWise
Avatar

Send message
Joined: 5 Nov 05
Posts: 2
Credit: 0
RAC: 0
Message 13240 - Posted: 8 Apr 2006, 15:00:11 UTC

Unrecoverable error for result HBLR_1.0_1r69_426_2081_0 ( - exit code -1073741819 (0xc0000005))
ID: 13240 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 13262 - Posted: 8 Apr 2006, 19:30:19 UTC - in response to Message 13236.  
Last modified: 8 Apr 2006, 19:32:15 UTC

Just some seconds after I post this the next one is down the drain. 2006-04-08 16:32:43 [rosetta@home] Unrecoverable error for result HBLR_1.0_1mky_426_4765_0 (aborted via GUI RPC)
I quit/reject/go on hold till these problem[s] been solved and stable. Couldn't get a result/point on the board.

Where it says "aborted via GUI RPC" means you aborted the work unit and it's just reporting it as an error. unless I'm missing something you caused this on all but the last two you've listed.

2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC)
2006-04-07 14:28:40 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_3457_0 (aborted via GUI RPC)
2006-04-07 18:49:47 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_6531_0 (aborted via GUI RPC)
2006-04-08 10:29:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1b72_424_2309_1 (aborted via GUI RPC)
2006-04-08 15:26:29 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_426_2934_0 (aborted via GUI RPC)
2006-04-08 15:32:46 [rosetta@home] Unrecoverable error for result HBLR_1.0_1di2_426_4357_0 ( - exit code -1073741819 (0xc0000005))
2006-04-08 15:41:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1r69_426_325_1 ( - exit code -1073741819 (0xc0000005))
ID: 13262 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robert J
Avatar

Send message
Joined: 7 Oct 05
Posts: 3
Credit: 397,467
RAC: 0
Message 13267 - Posted: 8 Apr 2006, 19:58:26 UTC
Last modified: 8 Apr 2006, 19:59:23 UTC

Got this message on a work unit a few minutes ago.

4/8/2006 11:47:26 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_2tif_425_8010_1 ( - exit code -1073741819 (0xc0000005))
4/8/2006 11:47:26 AM||request_reschedule_cpus: process exited
4/8/2006 11:47:26 AM|rosetta@home|Computation for result HBLR_1.0_2tif_425_8010_1 finished

Running Win XP SP2, P4 3.2 GHz 1.5Gb memory.

Boinc set to keep in memory.

Work unit run time set to 4 hours.

Second time this has happened in the last 24 hours.


ID: 13267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Division_Brabant~OldButNotSoWise
Avatar

Send message
Joined: 23 Jan 06
Posts: 42
Credit: 371,797
RAC: 0
Message 13270 - Posted: 8 Apr 2006, 20:21:38 UTC

Unrecoverable error for result FARELAX_NOFILTERS_1c8cA_427_175_0 ( - exit code -1073741819 (0xc0000005))
ID: 13270 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13271 - Posted: 8 Apr 2006, 20:24:32 UTC

With a Max Cpu Time setting of 1 hour, only 2 of the last 12 of my HBLR WUs would have uploaded properly. Even with a 1 hour Max CPU Time setting, these WUs have an incredibly high failure rate.

Dr. Baker: Can the HBLRs be totally removed from the system so they're not released to anyone else this weekend? Or 4.83 be re-released as client 4.98 (If 4.83 can handle these WUs)? Or both?


ID: 13271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Buffalo Bill
Avatar

Send message
Joined: 25 Mar 06
Posts: 71
Credit: 1,630,458
RAC: 0
Message 13272 - Posted: 8 Apr 2006, 20:27:48 UTC
Last modified: 8 Apr 2006, 20:36:46 UTC

I've noticed most of the posted failures are HBLR_1.0.... WU's. I did get a couple finished by shutting down BOINC and logging out/in and restarting BOINC. They then ran clean for 4 hours and finished on 2 different machines.
ID: 13272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sander

Send message
Joined: 18 Dec 05
Posts: 1
Credit: 452,447
RAC: 0
Message 13274 - Posted: 8 Apr 2006, 20:49:17 UTC

After some HLBR errors, now farelax errors.
Job was 100%, and then I've got:
08/04/2006 22:29:04|rosetta@home|Unrecoverable error for result FARELAX_NOFILTERS_1a68__427_51_0 ( - exit code -1073741819 (0xc0000005))

Using R@h v497
ID: 13274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mikkie

Send message
Joined: 1 Apr 06
Posts: 9
Credit: 5,700
RAC: 0
Message 13277 - Posted: 8 Apr 2006, 21:21:31 UTC

I'm switchting back to the project I left. Stable as a rock

2006-04-08 22:53:57 [rosetta@home] Unrecoverable error for result FARELAX_NOFILTERS_1fkb__427_59_0 ( - exit code -1073741819 (0xc0000005))
2006-04-08 23:07:52 [rosetta@home] Unrecoverable error for result FARELAX_NOFILTERS_5croA_427_242_0 (aborted via GUI RPC) using r@h 4.97
ID: 13277 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Walter Roberson

Send message
Joined: 5 Dec 05
Posts: 2
Credit: 13,937
RAC: 0
Message 14706 - Posted: 27 Apr 2006, 0:08:48 UTC

I've just aborted an overdue WU "stuck at 1%". Windows XP SP1, 512 Mb,
running under BOINC.

https://boinc.bakerlab.org/rosetta/result.php?resultid=17048614

This was the first WU issued to me after the recent Rosetta upgrade. Now that
I have aborted it, I will run another unit and see if the same problem occurs.

workunit TRUNCATE_TERMINI_FULLRELAX_2tif__433_873
1.042% complete
CPU time: 43 hr 52 min 10 sec
Walter Roberson -= Total credit: 9051.8 - RAC: 47.0815
Rossetta@home v4.98
Stage: full atom relax
Model: 1 step 283223
Accepted RMSD: 2.039
Accepted Energy: -73.13141
ID: 14706 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
yoner
Avatar

Send message
Joined: 17 Sep 05
Posts: 10
Credit: 2,581,874
RAC: 0
Message 14825 - Posted: 28 Apr 2006, 5:09:57 UTC

Hello,

I have several units that are into the very high numbers for computing.

NO_TERM__STRAND_1ogw_423_2138_1 (v5.01)
NO_TERM__STRAND_1ogw_423_6238_1 (v5.01)

Both have run for approx 100 hours on a dual PII 233, I know that they are still processing, as looking at the Graphics options shows the Step counter increasing. How many steps are in the work units?

I have another unit: HB_BARCODE_30_1bm8__351_25694_2 (v 5.01) that is at over 30 hours on a P4 3GHz, 2 gig ram. There is a possibility that this unit on this computer got fubarred by a system re-boot for the hours of computation, but should not be that bad.

Any ideas what is going on?


ID: 14825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2024 University of Washington
https://www.bakerlab.org