Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 18 · Next

AuthorMessage
sharder8
Avatar

Send message
Joined: 2 Feb 06
Posts: 7
Credit: 15,648,378
RAC: 0
Message 12346 - Posted: 20 Mar 2006, 19:31:00 UTC

Someone may want to take a look at the results on this one as well, there's plenty of them and it isn't the 1% "stuck" problem.

https://boinc.bakerlab.org/rosetta/results.php?hostid=181476

I've stopped Rosetta on this machine, as it would run through a ton of jobs and client error them until it gave the message "daily quota met".

Harder
ID: 12346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 12353 - Posted: 20 Mar 2006, 21:59:34 UTC - in response to Message 8786.  


I don't know exactly what is going on. for each work unit, we have now close to the targeted 10,000 successful completions, so there are clearly no systematic errors affecting all instantces of a wu. I would love to know how many failures of the sort you had there have been. It is possible that for certain random number seeds very rare rosetta bugs are encountered--this would have to be at less than 1 in 100 since we don't see them in our in house tests. so question: what fraction of your WU have this problem?

we can search for rosetta bugs by starting runs in house with the random number seed and command line from your run. we are doing this now



David,

Many moons ago, in a different thread, you gave instructions for manually restarting a "1% stuck" WU. I've got on one one of my systems, do you want me to restart it, and is there anything else that I can do to help identify the problem? Things like taking a snapshot of the rosetta and slots/0 folders, zipping it up and making it available for you to download, or anything else that might help.
ID: 12353 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Team_Elteor_Borislavj~Intelligence

Send message
Joined: 7 Dec 05
Posts: 14
Credit: 56,027
RAC: 0
Message 12358 - Posted: 20 Mar 2006, 23:14:47 UTC

HB_BARCODE_30_1bk2__351_7729_0 is stuck! After 5 hours of crunching still at 1% :(

ID: 12358 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 12360 - Posted: 20 Mar 2006, 23:16:27 UTC

These HB_BARCODE_30 were stuck in a slot for ~24 hours without progressing. Some had ~25min CPU time; at least one had ~55min CPU.

It's really annoying that I lost around 4-cpu-days of work because of these four.

dag

https://boinc.bakerlab.org/rosetta/result.php?resultid=14185477
https://boinc.bakerlab.org/rosetta/result.php?resultid=14184543
https://boinc.bakerlab.org/rosetta/result.php?resultid=14100307
https://boinc.bakerlab.org/rosetta/result.php?resultid=14099752
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 12360 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 12468 - Posted: 21 Mar 2006, 22:12:55 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=14068243

Rather strange : I did not, repeat, did not abort it myself.
Didn't touch the machine, it runs on it own.
Had more of these and still don't know what happens.

ID: 12468 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 12497 - Posted: 22 Mar 2006, 8:47:15 UTC
Last modified: 22 Mar 2006, 8:51:25 UTC

Another one, aborted after 16 hours stuck on 1%

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11717839

This one was a bit different - it was stuck on 30.19% after 8 hours. After restarting BOINC, it reset back to 38 mins CPU time and 30.19% and got stuck again.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11743501

It's getting increasingly frustrating having to babysit this project all the time. Fingers crossed for those working on a fix.
ID: 12497 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Team_Elteor_Borislavj~Intelligence

Send message
Joined: 7 Dec 05
Posts: 14
Credit: 56,027
RAC: 0
Message 12502 - Posted: 22 Mar 2006, 10:13:45 UTC

I'm experiencing a lot of stuck WU's with FA_RLX****
I'm now at the point, if a WU is at 1 percent after 1 hour, i'm manually aborting it... i want credit for my cpu time :(
ID: 12502 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 12522 - Posted: 22 Mar 2006, 17:44:29 UTC

This unit stuck for 9 hours:

FA_RLXpt_hom003_1ptq__361_234

Brought up the graphics screen (I dont run any graphics) and it was all froze except for the cpu clock was still counting.

Resetting boinc did no good. Ended up aborting it.

out of 183 results i have 4 errors of the frozen or 1 to 15 percent type.

Cheers all!!!
ID: 12522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mr.kjellen

Send message
Joined: 5 Dec 05
Posts: 3
Credit: 1,226,674
RAC: 0
Message 12525 - Posted: 22 Mar 2006, 19:51:52 UTC

this one stuck at one percent.

HBLR_1.0_1dtj_332_2576

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=9848871

had it crunching for about 350000sek/1500creds :( Aborted it. Seems someone did crunch it eventually. No luck for me tho.
/anton
ID: 12525 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tim Myers

Send message
Joined: 11 Jan 06
Posts: 2
Credit: 10,903,965
RAC: 0
Message 12531 - Posted: 22 Mar 2006, 22:53:14 UTC

I've got a job stuck at 40.06% running for 10 hours.

FA_RLXpt_hom002_1ptq_361_370_0

I only run seti@home other than this. 50/50. This is the first one stuck since I started running Rosetta in January. I had I lot abort, but that stopped when I turned off the screen saver and left the jobs in memory.

So how do I kill this thing?
ID: 12531 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
GimpyOne

Send message
Joined: 13 Dec 05
Posts: 2
Credit: 40,123
RAC: 0
Message 12535 - Posted: 22 Mar 2006, 23:46:27 UTC

I just noticed this guy stuck for 29hours:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11741001

I have restarted Boinc and it appears to be running again. I'll keep an eye on it. Interestingly enough, when I restarted, Boinc reset the CPU time from 29hrs to 38min. I used the CPU time on the graphic to judge, it wasn't moving the first time and now is.

I'll post back in a few hours and let you know if it finished.

ID: 12535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
GimpyOne

Send message
Joined: 13 Dec 05
Posts: 2
Credit: 40,123
RAC: 0
Message 12545 - Posted: 23 Mar 2006, 3:21:42 UTC - in response to Message 12535.  

I just noticed this guy stuck for 29hours:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11741001

I have restarted Boinc and it appears to be running again. I'll keep an eye on it. Interestingly enough, when I restarted, Boinc reset the CPU time from 29hrs to 38min. I used the CPU time on the graphic to judge, it wasn't moving the first time and now is.

I'll post back in a few hours and let you know if it finished.



Well, it stalled three more times and I finally aborted it. It would restart when I exited and restarted Boinc, then freeze again after 20-30%.
ID: 12545 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 12547 - Posted: 23 Mar 2006, 4:48:31 UTC

Got one stuck here:

FA_RLXbk_hom010_1bk2__359_406_0

This is the first of these I've had in a long long time. I'm going to restart it (reboot the machine), I'll post here what happens.
ID: 12547 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bo-Arne

Send message
Joined: 16 Dec 05
Posts: 2
Credit: 2,850,415
RAC: 0
Message 12557 - Posted: 23 Mar 2006, 7:37:07 UTC

I had to abort FA_RLXpt_hom006_1ptq_361_427. Restarted three times but it got stuck at 23,73%, Model 3, Step 332516, just a few minutes after entering full atom relax.
ID: 12557 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MAOJC

Send message
Joined: 19 Jan 06
Posts: 15
Credit: 2,727,567
RAC: 0
Message 12565 - Posted: 23 Mar 2006, 12:14:43 UTC
Last modified: 23 Mar 2006, 12:20:09 UTC

FA_RLXpt-hom006_1ptq_361_120_0
FA_RLXpt-hom004_1ptq_361_120_0

both the above WU were stuck @ 10+ hours with 1+ days to complete at ~14-15% completion on Linux Dual core Opteron. Aborting the first WU with Boinc Manager cause BOINC to Segfault and core and left a running Rosetta process and the gui_rpc port bound. Had to hard kill the running rosetta process.

here is the killed process off a ps -ef, note the 7200 cpu_run_time parameter.

XXXXXX 23730 1 12 04:50 pts/0 00:00:30 rosetta_4.81_i686-pc-linux-gnu xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom008_ -frags_name_prefix hom008_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2482751

ID: 12565 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bulrush

Send message
Joined: 14 Mar 06
Posts: 3
Credit: 186,848
RAC: 0
Message 12566 - Posted: 23 Mar 2006, 13:53:47 UTC
Last modified: 23 Mar 2006, 13:54:43 UTC

I am also getting an unrecoverable error from work units which begin with FA_RLX. Here is the error:
3/23/2006 8:00:40 AM|rosetta@home|Unrecoverable error for result FA_RLXvl_hom007_1vls__362_326_0 ( - exit code -1073741811 (0xc000000d))
3/23/2006 8:00:42 AM||request_reschedule_cpus: process exited
3/23/2006 8:00:42 AM|rosetta@home|Computation for result FA_RLXvl_hom007_1vls__362_326_0 finished


Computer is brand new, set up 2 weeks ago. Here is my config as reported by Editq, a text editor:

====================================================
3/23/2006 (System Information from Editquick)
OS name: WinXP
OS version: 5.1.2600 WinXP, full 32-bit Service Pack 2

Free RAM: 546mb
Total RAM: 1,023mb

Disk info
=========
Sectors Per Cluster:8
Bytes Per Sector: 512
Free Clusters: 57,217,680
Total Clusters: 61,046,992
Free mBytes: ,223,506
Total mBytes: ,238,464

CPU Vendor: AuthenticAMD
CPU Mfr: AMD
CPU Speed (mhz): 2395/2400
CPU type: 586
OEMID: 0
Number of CPUs: 1 (wrong, there are actually 2 CPUs)
ProcessorType: 586 (15)
ProcessorRevision: 9473

BIOS name:
BIOS date: 9/16/2005
BIOS copyright:
Bios Extended info:
IP Address: 192.168.1.129

DirectX version: 4.09.00.0904
Bits per pixel: 32
Display resolution: 1152 x 864

Registry info
=============
Video driver desc: NVIDIA Quadro FX 1400
Video Driver date: 11-4-2005
Video Driver version: 8.1.6.7
System bios date: 09/16/05
System bios version: HP - 20050916
Video bios date: 05/10/05
Video bios version: Version 5.41.02.43.03
Video driver file: nv4_disp.dll
====================================================

When I first set up Rosetta last week it worked fine with no errors. As soon as I started getting FA_RLX work units, I started getting errors. At this point about 40% of the WUs which start with FA_RLX get an error and abort.

ID: 12566 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bulrush

Send message
Joined: 14 Mar 06
Posts: 3
Credit: 186,848
RAC: 0
Message 12567 - Posted: 23 Mar 2006, 14:04:12 UTC

Here are the 3 errors I found in my log. Perhaps I was a bit hasty in saying I had a 40% error rate for FA_RLX. All 3 errors appear to have the same exit code. It's probably around 10%.

3/21/2006 1:24:16 PM|rosetta@home|Unrecoverable error for result FA_RLXpg_hom016_1pgx__361_292_0 ( - exit code -1073741811 (0xc000000d))
3/21/2006 1:24:18 PM||request_reschedule_cpus: process exited
3/21/2006 1:24:18 PM|rosetta@home|Computation for result FA_RLXpg_hom016_1pgx__361_292_0 finished


3/22/2006 7:57:24 AM|rosetta@home|Unrecoverable error for result FA_RLXti_hom005_1tit__362_317_0 ( - exit code -1073741811 (0xc000000d))
3/22/2006 7:57:26 AM||request_reschedule_cpus: process exited
3/22/2006 7:57:26 AM|rosetta@home|Computation for result FA_RLXti_hom005_1tit__362_317_0 finished


3/23/2006 8:00:40 AM|rosetta@home|Unrecoverable error for result FA_RLXvl_hom007_1vls__362_326_0 ( - exit code -1073741811 (0xc000000d))
3/23/2006 8:00:42 AM||request_reschedule_cpus: process exited
3/23/2006 8:00:42 AM|rosetta@home|Computation for result FA_RLXvl_hom007_1vls__362_326_0 finished


====================================================
3/23/2006 (System Information from Editquick)
OS name: WinXP
OS version: 5.1.2600 WinNT, full 32-bit Service Pack 2

Free RAM: 548mb
Total RAM: 1,023mb

Disk info
=========
Sectors Per Cluster:8
Bytes Per Sector: 512
Free Clusters: 56,948,933
Total Clusters: 61,046,992
Free mBytes: ,222,456
Total mBytes: ,238,464

CPU Vendor: AuthenticAMD
CPU Mfr: AMD
CPU Speed (mhz): 2395/2400
CPU type: 586
OEMID: 0
Number of CPUs: 1 (wrong, actually 2 CPUs)
ProcessorType: 586 (15)
ProcessorRevision: 9473

BIOS name:
BIOS date: 9/16/2005
BIOS copyright:
Bios Extended info:
IP Address: 192.168.1.129

DirectX version: 4.09.00.0904
Bits per pixel: 32
Display resolution: 1152 x 864

Registry info
=============
Video driver desc: NVIDIA Quadro FX 1400
Video Driver date: 11-4-2005
Video Driver version: 8.1.6.7
System bios date: 09/16/05
System bios version: HP - 20050916
Video bios date: 05/10/05
Video bios version: Version 5.41.02.43.03
Video driver file: nv4_disp.dll
====================================================

ID: 12567 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bulrush

Send message
Joined: 14 Mar 06
Posts: 3
Credit: 186,848
RAC: 0
Message 12568 - Posted: 23 Mar 2006, 14:06:06 UTC

Everyone having WUs stuck or their PC crash on a WU that begins with FA_RLX, please report your PC configuration, including number of CPUs. I wonder if this is a problem with dual CPU AMD chips.

ID: 12568 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MarkL.

Send message
Joined: 3 Dec 05
Posts: 3
Credit: 2,920
RAC: 0
Message 12585 - Posted: 23 Mar 2006, 21:31:13 UTC - in response to Message 12568.  


Mark L.

ID: 12585 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 12586 - Posted: 23 Mar 2006, 22:29:16 UTC - in response to Message 12568.  

Everyone having WUs stuck or their PC crash on a WU that begins with FA_RLX, please report your PC configuration, including number of CPUs. I wonder if this is a problem with dual CPU AMD chips.


I had one this morning I aborted:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11730796

Single CPU Intel 2.4GHZ, 1G Memory, XP PRO SP2



ID: 12586 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 18 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2024 University of Washington
https://www.bakerlab.org