Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · Next

AuthorMessage
warlock

Send message
Joined: 9 Oct 05
Posts: 1
Credit: 3,379,414
RAC: 0
Message 12970 - Posted: 3 Apr 2006, 3:52:24 UTC

4/1/2006 9:53:00 PM|rosetta@home|Resuming result FA_RLXpt_hom006_1ptq__361_13_0 using rosetta version 482

seems to be another w/u that gets stuck.

The graphic display indicates 1.00% done after 26+ hours and:
Stage: Full Atom Relax
Model: 1 Step: 331496
Accepted RMSD: 10.84
Accepted Energy: -51.44441

My machine and relivant software are:
4/1/2006 5:24:19 PM||Starting BOINC client version 5.2.13 for windows_intelx86
4/1/2006 5:24:19 PM||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3
4/1/2006 5:24:19 PM||Data directory: C:Program FilesBOINC
4/1/2006 5:24:20 PM||Processor: 2 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.80GHz
4/1/2006 5:24:20 PM||Memory: 1006.73 MB physical, 2.37 GB virtual

ID: 12970 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kelemvor

Send message
Joined: 28 Dec 05
Posts: 7
Credit: 458,146
RAC: 0
Message 12997 - Posted: 3 Apr 2006, 18:29:15 UTC
Last modified: 3 Apr 2006, 18:33:04 UTC

Just checked a couple of my PCs that weren't reporting and found a few stuck on WUs with the FA_ start.

One is FA_RLXpt_hom006_1ptq__361_29_0. It's been runnign for 95 hours.
Other is FA_RLXpt_hom003_1ptq_361_283_0. That one's been running for 227 hours!

What should I do? Let me know if you need more info.
ID: 12997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13002 - Posted: 3 Apr 2006, 19:00:59 UTC - in response to Message 12997.  

One is FA_RLXpt_hom006_1ptq__361_29_0. It's been runnign for 95 hours.
Other is FA_RLXpt_hom003_1ptq_361_283_0. That one's been running for 227 hours!

What should I do? Let me know if you need more info.


Those look like old WUs from before the latest round of fixes. Just abort them.
ID: 13002 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kelemvor

Send message
Joined: 28 Dec 05
Posts: 7
Credit: 458,146
RAC: 0
Message 13003 - Posted: 3 Apr 2006, 19:17:16 UTC - in response to Message 13002.  

One is FA_RLXpt_hom006_1ptq__361_29_0. It's been runnign for 95 hours.
Other is FA_RLXpt_hom003_1ptq_361_283_0. That one's been running for 227 hours!

What should I do? Let me know if you need more info.


Those look like old WUs from before the latest round of fixes. Just abort them.


In the queue there's a ton more WUs with that same general name. Should I go through and abort them all?
ID: 13003 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13034 - Posted: 4 Apr 2006, 5:54:45 UTC - in response to Message 13003.  

In the queue there's a ton more WUs with that same general name. Should I go through and abort them all?


If you have WUs that call for the older client, and they are giving you problems, then go ahead and abort them all.

I believe the current version of rosetta is 4.83 for windows and 4.82 for linux.

(The "application" field of the "work" tab of boinc manager gives the version that the WU is asking for.)
ID: 13034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jimi@0wned.org.uk

Send message
Joined: 10 Mar 06
Posts: 29
Credit: 335,252
RAC: 0
Message 13036 - Posted: 4 Apr 2006, 7:29:41 UTC

First failure on this rig, a bit odd and not one I've seen before:

7449_2_fullatom_relax_dec7449_2_09_3.pdb_415_4_0

<core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# random seed: 1794757
# cpu_run_time_pref: 14400
# DONE :: 1 starting structures built 4 (nstruct) times
# This process generated 4 decoys from 4 attempts
# 0 starting pdbs were skipped
ERROR:: Exit at: .read_paths.cc line:346

</stderr_txt>
ID: 13036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Halifax--lad
Avatar

Send message
Joined: 17 Sep 05
Posts: 157
Credit: 2,687
RAC: 0
Message 13037 - Posted: 4 Apr 2006, 7:49:31 UTC

Had my 1st crash in a very long time, this is the 1st Rosetta WU I have done since the new bug tracking code was put into the WU's I wonder if that caused my crash

https://boinc.bakerlab.org/rosetta/result.php?resultid=15959192

Have not got my messages for it have searched them high and low and could not find any, woke up this morning and this WU was just sat there at around 12%, nothing else happening, so I have aborted it, will see what the next WU does
Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 13037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 13038 - Posted: 4 Apr 2006, 8:04:51 UTC
Last modified: 4 Apr 2006, 9:02:12 UTC

This WU seems to be stuck: 13107954

Over 3 hours in and it's on 1.19%. Job CPU time is set to 2 hours.

Edit: Now at 4 hours and 1.30%.
ID: 13038 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dufva

Send message
Joined: 2 Jan 06
Posts: 1
Credit: 31,453
RAC: 0
Message 13061 - Posted: 4 Apr 2006, 22:57:38 UTC

Have experienced unrecoverable error twice last week when opening graphics from BOINC client. It has happend both times when I have run heavy parallell processes (perhaps CPU overload?).

Windows XP got stuck and switched over to "secure" 16 color mode as I tried to open the graphics, and it resulted both times in unrecoverable errors:

2006-04-05 00:11:46 [rosetta@home] Unrecoverable error for result DOUBLE_SS_WEIGHT_1vie__419_6_0 ( - exit code -1073741819 (0xc0000005))
2006-04-05 00:11:49 [---] request_reschedule_cpus: process exited
2006-04-05 00:11:49 [rosetta@home] Computation for result DOUBLE_SS_WEIGHT_1vie__419_6_0 finished
ID: 13061 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Alexcj

Send message
Joined: 21 Mar 06
Posts: 3
Credit: 8,374
RAC: 0
Message 13074 - Posted: 5 Apr 2006, 8:12:59 UTC
Last modified: 5 Apr 2006, 8:17:41 UTC

Hi, I got a stuck workunit also:
It's this one: FAH_RLXpt_hom007_1ptq_361_166_1 also known as:11689531

It is stuck at 1.04% (according to the GUI) with 6:54:44 hours CPU time done and 8:50:57 to go ;-)
It's on this machine.

The other participant wasn't so lucky either.


ID: 13074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 13078 - Posted: 5 Apr 2006, 10:16:38 UTC

I've got two more somewhere that are stuck on 1.04% after 2 hours and 1.5 hours.
ID: 13078 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13079 - Posted: 5 Apr 2006, 10:20:41 UTC

Another frozen workunit:

name: FA_RLXpt_hom006_1ptq__361_80_1
WU name: FA_RLXpt_hom006_1ptq__361_80
app version num: 483
checkpoint CPU time: 2112.812500
current CPU time: 121543.171875
fraction done: 0.293420
VM usage: 0.000000
resident set size: 0.000000
estimated CPU time remaining: 89449.556235

result id: 15996230
workunit id: 11646530

It was meant to be a 2 hour although obviously things went wrong, 121543 seconds later I noticed the issue.
ID: 13079 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13080 - Posted: 5 Apr 2006, 12:03:57 UTC

In regard to my previous post, shouldn't the max unit runtime of 24 hours aborted it automatically or has that been removed?
ID: 13080 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13082 - Posted: 5 Apr 2006, 14:33:41 UTC - in response to Message 13080.  

In regard to my previous post, shouldn't the max unit runtime of 24 hours aborted it automatically or has that been removed?


It should have; will try to figure out why these weren't terminated. also the reports here will be very helpful in pinning down the problem
ID: 13082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13101 - Posted: 5 Apr 2006, 22:35:54 UTC

I think all these FA_* WUs are old. Someone else aborted them, so they were sent out again. You can check the creation date on the WUs page, and anything created in March should be aborted if it seems stuck.

The 24hour timout is only in newer WUs, which started coming out at the end of March.
ID: 13101 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dave Wilson

Send message
Joined: 8 Jan 06
Posts: 35
Credit: 379,049
RAC: 0
Message 13130 - Posted: 6 Apr 2006, 20:06:14 UTC

Just found https://boinc.bakerlab.org/rosetta/result.php?resultid=16005553 sorry I did not get the rest of the info but it was stuck at around 17 hours and 34.--- %


ID: 13130 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 13154 - Posted: 7 Apr 2006, 6:07:23 UTC
Last modified: 7 Apr 2006, 6:51:54 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=15702999

ID: 13154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mikkie

Send message
Joined: 1 Apr 06
Posts: 9
Credit: 5,700
RAC: 0
Message 13232 - Posted: 8 Apr 2006, 14:24:15 UTC


Somebody need to find a solution about this series. If this goes on it's no fun anymore. All crashed after more than 1+ hour crunching.

2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC)
2006-04-07 14:28:40 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_3457_0 (aborted via GUI RPC)
2006-04-07 18:49:47 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_6531_0 (aborted via GUI RPC)
2006-04-08 10:29:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1b72_424_2309_1 (aborted via GUI RPC)
2006-04-08 15:26:29 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_426_2934_0 (aborted via GUI RPC)
2006-04-08 15:32:46 [rosetta@home] Unrecoverable error for result HBLR_1.0_1di2_426_4357_0 ( - exit code -1073741819 (0xc0000005))
2006-04-08 15:41:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1r69_426_325_1 ( - exit code -1073741819 (0xc0000005))
ID: 13232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mikkie

Send message
Joined: 1 Apr 06
Posts: 9
Credit: 5,700
RAC: 0
Message 13236 - Posted: 8 Apr 2006, 14:47:05 UTC - in response to Message 13232.  


Somebody need to find a solution about this series. If this goes on it's no fun anymore. All crashed after more than 1+ hour crunching.

2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC)
2006-04-07 14:28:40 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_3457_0 (aborted via GUI RPC)
2006-04-07 18:49:47 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_6531_0 (aborted via GUI RPC)
2006-04-08 10:29:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1b72_424_2309_1 (aborted via GUI RPC)
2006-04-08 15:26:29 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_426_2934_0 (aborted via GUI RPC)
2006-04-08 15:32:46 [rosetta@home] Unrecoverable error for result HBLR_1.0_1di2_426_4357_0 ( - exit code -1073741819 (0xc0000005))
2006-04-08 15:41:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1r69_426_325_1 ( - exit code -1073741819 (0xc0000005))


Just some seconds after I post this the next one is down the drain. 2006-04-08 16:32:43 [rosetta@home] Unrecoverable error for result HBLR_1.0_1mky_426_4765_0 (aborted via GUI RPC)
I quit/reject/go on hold till these problem[s] been solved and stable. Couldn't get a result/point on the board.
ID: 13236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13239 - Posted: 8 Apr 2006, 14:55:06 UTC - in response to Message 13232.  


Somebody need to find a solution about this series. If this goes on it's no fun anymore. All crashed after more than 1+ hour crunching.

2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC)
2006-04-07 14:28:40 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_3457_0 (aborted via GUI RPC)
2006-04-07 18:49:47 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_6531_0 (aborted via GUI RPC)
2006-04-08 10:29:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1b72_424_2309_1 (aborted via GUI RPC)
2006-04-08 15:26:29 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_426_2934_0 (aborted via GUI RPC)
2006-04-08 15:32:46 [rosetta@home] Unrecoverable error for result HBLR_1.0_1di2_426_4357_0 ( - exit code -1073741819 (0xc0000005))
2006-04-08 15:41:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1r69_426_325_1 ( - exit code -1073741819 (0xc0000005))


can you try setting the work unit time to 1 hour? thanks, David

ID: 13239 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2025 University of Washington
https://www.bakerlab.org