Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 18 · Next
Author | Message |
---|---|
Lee Carre Send message Joined: 6 Oct 05 Posts: 96 Credit: 79,331 RAC: 0 |
I have a result that hasn't failed or anything yet, but has been going for about 7 hours at 0% my WU completed sucessfully (it took 27.73 hours) and was valid, so please ignore my previous post |
TCU Computer Science Send message Joined: 7 Dec 05 Posts: 28 Credit: 12,861,977 RAC: 0 |
I just noticed that on one of my computers NEW_SOFT_CENTROID_PACKING_1di2_225_7586_0 has been running since 6 January. boincmgr shows CPU Time 01:10:08 Progress 20% To completion 05:01:08 but the messages show about 120 one-hour slices spent in execution. The "pausing" messages show that it is being left in memory. |
BigMike Send message Joined: 2 Nov 05 Posts: 1 Credit: 10,600 RAC: 0 |
I've been running BOINC R@H 4.81 on my WinXP Pro laptop, and I think I see a correlation between the 0xc0000005 failures and putting my laptop into hibernation. When I power it up and resume, the WU I was crunching gets an 0xc0000005, and it starts a new one. Has anyone else seen this? BigMike ------ What doesn't kill you still requires a co-pay. |
[B^S] sTrey Send message Joined: 25 Sep 05 Posts: 16 Credit: 15,524 RAC: 0 |
Not sure if you want errors reported here, but this is the first errored-out WU I've had in a long time, and I see it wasn't just my result. 6543036 2006-01-25 11:48:42 [rosetta@home] Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_RLX_NATIVE_1mky_281_29_2 (Incorrect function. (0x1) - exit code 1 (0x1)) |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I just noticed that on one of my computers This is a known R@H bug. To prevent the problem you must do the following- In your user preferences you should set the time between application switching (or swaps) to something cole to 2 hours (120 Min). That is usually enough to keep things going, But if you want to be really certain you should set the system so that it keeps the R@H application in memory during application swaps. The is more about this in the FAQ sticky here. Moderator9 ROSETTA@home FAQ Moderator Contact |
carl.h Send message Joined: 28 Dec 05 Posts: 555 Credit: 183,449 RAC: 0 |
1/27/2006 14:22:09|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_2ci2I_250_2374_0 ( - exit code -1073741819 (0xc0000005)) 3 minutes 17 seconds Not all Czech`s bounce but I`d like to try with Barbar ;-) Make no mistake This IS the TEDDIES TEAM. |
Roland Windsor Vincent Send message Joined: 6 Jan 06 Posts: 12 Credit: 730 RAC: 0 |
I'm just jumping in here because I couldn't find any thread directly on point. I've been having WUs report a "client error" after they appeared to have run successfully. Aside from the wasted time, no credit is given. What's up? Gravity is just a scientific theory. We should also teach the religious view that God is pushing us down. |
Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0 |
NO_SIM_ANNEAL_BARCODE_30_2reb_278_2151 25 Jan 2006 4:40:09 UTC 28 Jan 2006 5:25:43 UTC Over Client error Computing 260,224.27 1,133.72 --- and OMEGA_WT_1.0_2reb_275_2901 24 Jan 2006 11:04:19 UTC 28 Jan 2006 5:26:21 UTC Over Client error Computing 224,124.79 883.46 Two different machines. Had to abort both |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
I had to abort one today: https://boinc.bakerlab.org/rosetta/result.php?resultid=8293849 -Sid (I'm also seeing several erroring out in the first 20 or 30 seconds.. too many to link them all here) Proudly crunching with TeAm Anandtech |
TCU Computer Science Send message Joined: 7 Dec 05 Posts: 28 Credit: 12,861,977 RAC: 0 |
I just noticed that on one of my computers Yes, I know about that. I have the preferences set to keep the app in memory and the "pausing" messages say that it is being kept in memory. So far, this is the only stuck WU that I have encountered. |
grnarrow Send message Joined: 27 Jan 06 Posts: 1 Credit: 138 RAC: 0 |
1/27/2006 8:58:29 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1hz6_286_968_0 ( - exit code -1073741819 (0xc0000005)) 1/27/2006 11:00:02 PM|rosetta@home|Unrecoverable error for result 1b72_bar1821_288_1253_0 ( - exit code -1073741819 (0xc0000005)) 1/28/2006 7:16:50 AM|rosetta@home|Unrecoverable error for result INCREASE_CYCLES_10_1mky_226_3068_1 ( - exit code -1073741819 (0xc0000005)) 1/28/2006 9:32:34 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1b72_286_4901_0 ( - exit code -1073741819 (0xc0000005)) |
Viking69 Send message Joined: 3 Oct 05 Posts: 20 Credit: 6,815,776 RAC: 2,337 |
I have a rosetta WU that has stopped processing and has not allowed the BOINC to switch to another WU. It is 1/28/2006 9:55:39 PM|rosetta@home|Resuming result 17535_fullatom_ev1b0xA_.0365_0003_277_17_0 using rosetta version 481 I am sure that nothing is happening because my CPU is at 99% idle and it was 4.5 hours ago that I was running fine. Whe i suspend the Wu another BOINC WU starts from another study just fine. as soon as I resume the Rosetta it switches again and the CPU goes into IDLE again. Hi all you enthusiastic crunchers..... |
Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0 |
Now I have a couple that refuse to upload "1/30/2006 8:05:16 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/204/NO_SIM_ANNEAL_BARCODE_30_2tif_286_6287_0_0 2170 bytes != offset 0 bytes 1/30/2006 8:05:16 AM|rosetta@home|Temporarily failed upload of NO_SIM_ANNEAL_BARCODE_30_2tif_286_6287_0_0: transient upload error 1/30/2006 8:05:16 AM|rosetta@home|Backing off 1 hours, 44 minutes, and 28 seconds on upload of file NO_SIM_ANNEAL_BARCODE_30_2tif_286_6287_0_0 1/30/2006 8:06:30 AM|rosetta@home|Started upload of OMEGA_WT_1.0_1n0u_282_11548_0_0 1/30/2006 8:06:32 AM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/e3/OMEGA_WT_1.0_1n0u_282_11548_0_0 1947 bytes != offset 0 bytes 1/30/2006 8:06:32 AM|rosetta@home|Temporarily failed upload of OMEGA_WT_1.0_1n0u_282_11548_0_0: transient upload error" |
pisi78 Send message Joined: 25 Oct 05 Posts: 2 Credit: 199,062 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=6318786 30/01/2006 17.17.21|rosetta@home|Started download of t000_.fasta 30/01/2006 17.17.21|rosetta@home|Checksum or signature error for t000.pdb 30/01/2006 17.17.22|rosetta@home|Unrecoverable error for result 17535_looprlx_round1_ev1b4fA_.0163_0001_273_33_1 (WU download error: couldn't get input files:<file_xfer_error> <file_name>t000.pdb</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>) 30/01/2006 17.17.22|rosetta@home|Finished download of t000_.fasta i have got this error |
JDHalter Send message Joined: 3 Nov 05 Posts: 13 Credit: 722,679 RAC: 0 |
I had a work unit run at 1% for ~ 54 hrs over this past weekend. I aborted the work-unit this morning when I found it. The command and random seed # from the slots directory/stdout.txt file are below. command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.81_windows_intelx86.exe aa 2reb _ -abrelax -stringent_relax -more_relax_cycles -relax_score_filter -output_chi_silent -vary_omega -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -new_centroid_packing -barcode_from_fragments_length 30 -ssblocks -barcode_mode 3 -omega_weight 1.0 -jitter_frag -jitter_variation gauss -max_frags 400 -output_silent_gz -paths frags400.txt -filter1 -90 -filter2 -115 -nstruct 10 # ===================================== # random seed: 500481 # ===================================== I'm running Win2k w/ Boinc set-up as a service (no graphics) to run Rosetta only. Hopefully this may help to solve the issue of 1% error...??? JDHalter |
Marie Lucie Send message Joined: 9 Dec 05 Posts: 5 Credit: 40,616 RAC: 0 |
This work unit is stucked for 67 hours at 1%. I will abort it now. NO_SIM_ANNEAL_BARCODE_30_2reb_283_4989_0 |
JDHalter Send message Joined: 3 Nov 05 Posts: 13 Credit: 722,679 RAC: 0 |
It appears that the 1% bug was found on at least three seperate 2reb work units!... Assuming that the issue is somewhere between Rosetta & BOINC (as previous data from Dr. Kim proved that Rosetta completed a BOINC failed work unit command & seed without failing when run independant from BOINC), there is something that appears to be unique in the reb work units that initiates the 1% bug... I aborted a second 2reb work unit on a different machine that had been stuck on 1% for 9+hrs, but I forgot to get its stdout.txt file before aborting it, so I didn't post it here. ...that makes 2 2reb work units that failed w/ 1% bug from me, and 1 2reb work unit from Marie Lucie that failed w/ 1% bug... Three scenerios could account for this...1) there is a problem with the 2reb work unit files generated by Bakerlab...2) there is a problem that arrises when compressing/uncompressing the 2reb work units (something not found when tested at Bakerlab, as they don't have to compress/uncompress the reb files)...or 3) something unique is in the 2reb work unit commands (that is not in other work unit commands that don't fail) that initiates the 1% error. JDHalter |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
I had to abort the transfer of a completed WU that would not upload. Just kept giving an error message about "file length". After several hours of retrys, I just aborted it. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=5714903 -Sid (edited to fix linky) Proudly crunching with TeAm Anandtech |
SafeAggie Send message Joined: 22 Oct 05 Posts: 3 Credit: 458,414 RAC: 0 |
Unrecoverable error for result INCREASE_CYCLES_10_1n0u_208_56_4 (Incorrect function. (0x1) - exit code 1 (0x1)) |
gpcola Send message Joined: 31 Dec 05 Posts: 8 Credit: 361,118 RAC: 0 |
Another failed WU with 'Maximum CPU time exceeded' error - this one ran for 39hrs on a P3 before failing... :/ https://boinc.bakerlab.org/rosetta/result.php?resultid=7409788 |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2024 University of Washington
https://www.bakerlab.org