Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 18 · Next

AuthorMessage
Lee Carre

Send message
Joined: 6 Oct 05
Posts: 96
Credit: 79,331
RAC: 0
Message 9663 - Posted: 23 Jan 2006, 21:29:37 UTC - in response to Message 9339.  
Last modified: 23 Jan 2006, 21:31:34 UTC

I have a result that hasn't failed or anything yet, but has been going for about 7 hours at 0%
normally rosetta results finish sooner than 7 hours on that host, i'll leave it and see what it does thou, because it's a "PRODUCTION" WU, a type i haven't seen before

the WU name is "PRODUCTION_ABINITIO_1urnA_250_1147" if that helps

my WU completed sucessfully (it took 27.73 hours) and was valid, so please ignore my previous post
ID: 9663 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 9736 - Posted: 24 Jan 2006, 21:36:35 UTC

I just noticed that on one of my computers
NEW_SOFT_CENTROID_PACKING_1di2_225_7586_0
has been running since 6 January.

boincmgr shows
CPU Time 01:10:08
Progress 20%
To completion 05:01:08

but the messages show about 120 one-hour slices spent in execution.

The "pausing" messages show that it is being left in memory.
ID: 9736 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BigMike

Send message
Joined: 2 Nov 05
Posts: 1
Credit: 10,600
RAC: 0
Message 9785 - Posted: 25 Jan 2006, 7:33:03 UTC

I've been running BOINC R@H 4.81 on my WinXP Pro laptop, and I think I see a correlation between the 0xc0000005 failures and putting my laptop into hibernation. When I power it up and resume, the WU I was crunching gets an 0xc0000005, and it starts a new one.

Has anyone else seen this?

BigMike
------
What doesn't kill you still requires a co-pay.
ID: 9785 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 25 Sep 05
Posts: 16
Credit: 15,524
RAC: 0
Message 9843 - Posted: 25 Jan 2006, 19:59:30 UTC

Not sure if you want errors reported here, but this is the first errored-out WU I've had in a long time, and I see it wasn't just my result.
6543036

2006-01-25 11:48:42 [rosetta@home] Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_RLX_NATIVE_1mky_281_29_2 (Incorrect function. (0x1) - exit code 1 (0x1))
ID: 9843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 9886 - Posted: 26 Jan 2006, 1:57:45 UTC - in response to Message 9736.  

I just noticed that on one of my computers
NEW_SOFT_CENTROID_PACKING_1di2_225_7586_0
has been running since 6 January.

boincmgr shows
CPU Time 01:10:08
Progress 20%
To completion 05:01:08

but the messages show about 120 one-hour slices spent in execution.

The "pausing" messages show that it is being left in memory.


This is a known R@H bug. To prevent the problem you must do the following-

In your user preferences you should set the time between application switching (or swaps) to something cole to 2 hours (120 Min). That is usually enough to keep things going, But if you want to be really certain you should set the system so that it keeps the R@H application in memory during application swaps.

The is more about this in the FAQ sticky here.
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 9886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile carl.h
Avatar

Send message
Joined: 28 Dec 05
Posts: 555
Credit: 183,449
RAC: 0
Message 10023 - Posted: 27 Jan 2006, 14:42:01 UTC

1/27/2006 14:22:09|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_2ci2I_250_2374_0 ( - exit code -1073741819 (0xc0000005))


3 minutes 17 seconds
Not all Czech`s bounce but I`d like to try with Barbar ;-)

Make no mistake This IS the TEDDIES TEAM.
ID: 10023 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Roland Windsor Vincent

Send message
Joined: 6 Jan 06
Posts: 12
Credit: 730
RAC: 0
Message 10100 - Posted: 28 Jan 2006, 9:37:31 UTC

I'm just jumping in here because I couldn't find any thread directly on point. I've been having WUs report a "client error" after they appeared to have run successfully. Aside from the wasted time, no credit is given. What's up?
Gravity is just a scientific theory. We should also teach the religious view that God is pushing us down.
ID: 10100 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 10112 - Posted: 28 Jan 2006, 16:01:44 UTC

NO_SIM_ANNEAL_BARCODE_30_2reb_278_2151
25 Jan 2006 4:40:09 UTC 28 Jan 2006 5:25:43 UTC Over Client error Computing 260,224.27 1,133.72 ---

and

OMEGA_WT_1.0_2reb_275_2901
24 Jan 2006 11:04:19 UTC 28 Jan 2006 5:26:21 UTC Over Client error Computing 224,124.79 883.46

Two different machines. Had to abort both
ID: 10112 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 10153 - Posted: 28 Jan 2006, 23:47:24 UTC - in response to Message 8741.  

I had to abort one today:

https://boinc.bakerlab.org/rosetta/result.php?resultid=8293849

-Sid

(I'm also seeing several erroring out in the first 20 or 30 seconds.. too many to link them all here)
Proudly crunching with TeAm Anandtech
ID: 10153 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 10163 - Posted: 29 Jan 2006, 2:40:10 UTC - in response to Message 9886.  

I just noticed that on one of my computers
NEW_SOFT_CENTROID_PACKING_1di2_225_7586_0
has been running since 6 January.

boincmgr shows
CPU Time 01:10:08
Progress 20%
To completion 05:01:08

but the messages show about 120 one-hour slices spent in execution.

The "pausing" messages show that it is being left in memory.


This is a known R@H bug. To prevent the problem you must do the following-

In your user preferences you should set the time between application switching (or swaps) to something cole to 2 hours (120 Min). That is usually enough to keep things going, But if you want to be really certain you should set the system so that it keeps the R@H application in memory during application swaps.

The is more about this in the FAQ sticky here.



Yes, I know about that. I have the preferences set to keep the app in memory and the "pausing" messages say that it is being kept in memory. So far, this is the only stuck WU that I have encountered.

ID: 10163 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
grnarrow

Send message
Joined: 27 Jan 06
Posts: 1
Credit: 138
RAC: 0
Message 10164 - Posted: 29 Jan 2006, 2:42:31 UTC

1/27/2006 8:58:29 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1hz6_286_968_0 ( - exit code -1073741819 (0xc0000005))
1/27/2006 11:00:02 PM|rosetta@home|Unrecoverable error for result 1b72_bar1821_288_1253_0 ( - exit code -1073741819 (0xc0000005))
1/28/2006 7:16:50 AM|rosetta@home|Unrecoverable error for result INCREASE_CYCLES_10_1mky_226_3068_1 ( - exit code -1073741819 (0xc0000005))
1/28/2006 9:32:34 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1b72_286_4901_0 ( - exit code -1073741819 (0xc0000005))
ID: 10164 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viking69
Avatar

Send message
Joined: 3 Oct 05
Posts: 20
Credit: 6,815,776
RAC: 2,337
Message 10171 - Posted: 29 Jan 2006, 6:04:06 UTC

I have a rosetta WU that has stopped processing and has not allowed the BOINC to switch to another WU.

It is 1/28/2006 9:55:39 PM|rosetta@home|Resuming result 17535_fullatom_ev1b0xA_.0365_0003_277_17_0 using rosetta version 481

I am sure that nothing is happening because my CPU is at 99% idle and it was 4.5 hours ago that I was running fine. Whe i suspend the Wu another BOINC WU starts from another study just fine. as soon as I resume the Rosetta it switches again and the CPU goes into IDLE again.
Hi all you enthusiastic crunchers.....
ID: 10171 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 10218 - Posted: 30 Jan 2006, 14:14:32 UTC

Now I have a couple that refuse to upload


"1/30/2006 8:05:16 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/204/NO_SIM_ANNEAL_BARCODE_30_2tif_286_6287_0_0 2170 bytes != offset 0 bytes
1/30/2006 8:05:16 AM|rosetta@home|Temporarily failed upload of NO_SIM_ANNEAL_BARCODE_30_2tif_286_6287_0_0: transient upload error
1/30/2006 8:05:16 AM|rosetta@home|Backing off 1 hours, 44 minutes, and 28 seconds on upload of file NO_SIM_ANNEAL_BARCODE_30_2tif_286_6287_0_0
1/30/2006 8:06:30 AM|rosetta@home|Started upload of OMEGA_WT_1.0_1n0u_282_11548_0_0
1/30/2006 8:06:32 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/e3/OMEGA_WT_1.0_1n0u_282_11548_0_0 1947 bytes != offset 0 bytes
1/30/2006 8:06:32 AM|rosetta@home|Temporarily failed upload of OMEGA_WT_1.0_1n0u_282_11548_0_0: transient upload error"
ID: 10218 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pisi78

Send message
Joined: 25 Oct 05
Posts: 2
Credit: 199,062
RAC: 0
Message 10224 - Posted: 30 Jan 2006, 17:12:45 UTC
Last modified: 30 Jan 2006, 17:13:38 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=6318786

30/01/2006 17.17.21|rosetta@home|Started download of t000_.fasta
30/01/2006 17.17.21|rosetta@home|Checksum or signature error for t000.pdb
30/01/2006 17.17.22|rosetta@home|Unrecoverable error for result 17535_looprlx_round1_ev1b4fA_.0163_0001_273_33_1 (WU download error: couldn't get input files:<file_xfer_error> <file_name>t000.pdb</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>)
30/01/2006 17.17.22|rosetta@home|Finished download of t000_.fasta


i have got this error
ID: 10224 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JDHalter

Send message
Joined: 3 Nov 05
Posts: 13
Credit: 722,679
RAC: 0
Message 10225 - Posted: 30 Jan 2006, 17:25:00 UTC
Last modified: 30 Jan 2006, 17:27:05 UTC

I had a work unit run at 1% for ~ 54 hrs over this past weekend. I aborted the work-unit this morning when I found it. The command and random seed # from the slots directory/stdout.txt file are below.

command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.81_windows_intelx86.exe aa 2reb _ -abrelax -stringent_relax -more_relax_cycles -relax_score_filter -output_chi_silent -vary_omega -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -new_centroid_packing -barcode_from_fragments_length 30 -ssblocks -barcode_mode 3 -omega_weight 1.0 -jitter_frag -jitter_variation gauss -max_frags 400 -output_silent_gz -paths frags400.txt -filter1 -90 -filter2 -115 -nstruct 10

# =====================================
# random seed: 500481
# =====================================

I'm running Win2k w/ Boinc set-up as a service (no graphics) to run Rosetta only.

Hopefully this may help to solve the issue of 1% error...???

JDHalter
ID: 10225 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marie Lucie

Send message
Joined: 9 Dec 05
Posts: 5
Credit: 40,616
RAC: 0
Message 10260 - Posted: 31 Jan 2006, 13:23:19 UTC

This work unit is stucked for 67 hours at 1%. I will abort it now.

NO_SIM_ANNEAL_BARCODE_30_2reb_283_4989_0


ID: 10260 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JDHalter

Send message
Joined: 3 Nov 05
Posts: 13
Credit: 722,679
RAC: 0
Message 10285 - Posted: 31 Jan 2006, 21:23:29 UTC
Last modified: 31 Jan 2006, 21:27:14 UTC

It appears that the 1% bug was found on at least three seperate 2reb work units!...

Assuming that the issue is somewhere between Rosetta & BOINC (as previous data from Dr. Kim proved that Rosetta completed a BOINC failed work unit command & seed without failing when run independant from BOINC), there is something that appears to be unique in the reb work units that initiates the 1% bug...

I aborted a second 2reb work unit on a different machine that had been stuck on 1% for 9+hrs, but I forgot to get its stdout.txt file before aborting it, so I didn't post it here.

...that makes 2 2reb work units that failed w/ 1% bug from me, and 1 2reb work unit from Marie Lucie that failed w/ 1% bug...

Three scenerios could account for this...1) there is a problem with the 2reb work unit files generated by Bakerlab...2) there is a problem that arrises when compressing/uncompressing the 2reb work units (something not found when tested at Bakerlab, as they don't have to compress/uncompress the reb files)...or 3) something unique is in the 2reb work unit commands (that is not in other work unit commands that don't fail) that initiates the 1% error.

JDHalter

ID: 10285 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 10287 - Posted: 31 Jan 2006, 22:19:49 UTC
Last modified: 31 Jan 2006, 22:20:31 UTC

I had to abort the transfer of a completed WU that would not upload. Just kept giving an error message about "file length". After several hours of retrys, I just aborted it.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=5714903

-Sid

(edited to fix linky)

Proudly crunching with TeAm Anandtech
ID: 10287 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SafeAggie

Send message
Joined: 22 Oct 05
Posts: 3
Credit: 458,414
RAC: 0
Message 10294 - Posted: 1 Feb 2006, 2:10:01 UTC

Unrecoverable error for result INCREASE_CYCLES_10_1n0u_208_56_4 (Incorrect function. (0x1) - exit code 1 (0x1))
ID: 10294 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gpcola

Send message
Joined: 31 Dec 05
Posts: 8
Credit: 361,118
RAC: 0
Message 10314 - Posted: 1 Feb 2006, 13:16:20 UTC
Last modified: 1 Feb 2006, 13:17:10 UTC

Another failed WU with 'Maximum CPU time exceeded' error - this one ran for 39hrs on a P3 before failing... :/

https://boinc.bakerlab.org/rosetta/result.php?resultid=7409788
ID: 10314 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 18 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2024 University of Washington
https://www.bakerlab.org