Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 17 · Next

AuthorMessage
Tim Myers

Send message
Joined: 11 Jan 06
Posts: 2
Credit: 10,903,965
RAC: 0
Message 12531 - Posted: 22 Mar 2006, 22:53:14 UTC

I've got a job stuck at 40.06% running for 10 hours.

FA_RLXpt_hom002_1ptq_361_370_0

I only run seti@home other than this. 50/50. This is the first one stuck since I started running Rosetta in January. I had I lot abort, but that stopped when I turned off the screen saver and left the jobs in memory.

So how do I kill this thing?
ID: 12531 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
GimpyOne

Send message
Joined: 13 Dec 05
Posts: 2
Credit: 40,123
RAC: 0
Message 12535 - Posted: 22 Mar 2006, 23:46:27 UTC

I just noticed this guy stuck for 29hours:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11741001

I have restarted Boinc and it appears to be running again. I'll keep an eye on it. Interestingly enough, when I restarted, Boinc reset the CPU time from 29hrs to 38min. I used the CPU time on the graphic to judge, it wasn't moving the first time and now is.

I'll post back in a few hours and let you know if it finished.

ID: 12535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
GimpyOne

Send message
Joined: 13 Dec 05
Posts: 2
Credit: 40,123
RAC: 0
Message 12545 - Posted: 23 Mar 2006, 3:21:42 UTC - in response to Message 12535.  

I just noticed this guy stuck for 29hours:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11741001

I have restarted Boinc and it appears to be running again. I'll keep an eye on it. Interestingly enough, when I restarted, Boinc reset the CPU time from 29hrs to 38min. I used the CPU time on the graphic to judge, it wasn't moving the first time and now is.

I'll post back in a few hours and let you know if it finished.



Well, it stalled three more times and I finally aborted it. It would restart when I exited and restarted Boinc, then freeze again after 20-30%.
ID: 12545 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 12547 - Posted: 23 Mar 2006, 4:48:31 UTC

Got one stuck here:

FA_RLXbk_hom010_1bk2__359_406_0

This is the first of these I've had in a long long time. I'm going to restart it (reboot the machine), I'll post here what happens.
ID: 12547 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bo-Arne

Send message
Joined: 16 Dec 05
Posts: 2
Credit: 2,850,415
RAC: 0
Message 12557 - Posted: 23 Mar 2006, 7:37:07 UTC

I had to abort FA_RLXpt_hom006_1ptq_361_427. Restarted three times but it got stuck at 23,73%, Model 3, Step 332516, just a few minutes after entering full atom relax.
ID: 12557 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MAOJC

Send message
Joined: 19 Jan 06
Posts: 15
Credit: 2,727,567
RAC: 0
Message 12565 - Posted: 23 Mar 2006, 12:14:43 UTC
Last modified: 23 Mar 2006, 12:20:09 UTC

FA_RLXpt-hom006_1ptq_361_120_0
FA_RLXpt-hom004_1ptq_361_120_0

both the above WU were stuck @ 10+ hours with 1+ days to complete at ~14-15% completion on Linux Dual core Opteron. Aborting the first WU with Boinc Manager cause BOINC to Segfault and core and left a running Rosetta process and the gui_rpc port bound. Had to hard kill the running rosetta process.

here is the killed process off a ps -ef, note the 7200 cpu_run_time parameter.

XXXXXX 23730 1 12 04:50 pts/0 00:00:30 rosetta_4.81_i686-pc-linux-gnu xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom008_ -frags_name_prefix hom008_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2482751

ID: 12565 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bulrush

Send message
Joined: 14 Mar 06
Posts: 3
Credit: 186,848
RAC: 0
Message 12566 - Posted: 23 Mar 2006, 13:53:47 UTC
Last modified: 23 Mar 2006, 13:54:43 UTC

I am also getting an unrecoverable error from work units which begin with FA_RLX. Here is the error:
3/23/2006 8:00:40 AM|rosetta@home|Unrecoverable error for result FA_RLXvl_hom007_1vls__362_326_0 ( - exit code -1073741811 (0xc000000d))
3/23/2006 8:00:42 AM||request_reschedule_cpus: process exited
3/23/2006 8:00:42 AM|rosetta@home|Computation for result FA_RLXvl_hom007_1vls__362_326_0 finished


Computer is brand new, set up 2 weeks ago. Here is my config as reported by Editq, a text editor:

====================================================
3/23/2006 (System Information from Editquick)
OS name: WinXP
OS version: 5.1.2600 WinXP, full 32-bit Service Pack 2

Free RAM: 546mb
Total RAM: 1,023mb

Disk info
=========
Sectors Per Cluster:8
Bytes Per Sector: 512
Free Clusters: 57,217,680
Total Clusters: 61,046,992
Free mBytes: ,223,506
Total mBytes: ,238,464

CPU Vendor: AuthenticAMD
CPU Mfr: AMD
CPU Speed (mhz): 2395/2400
CPU type: 586
OEMID: 0
Number of CPUs: 1 (wrong, there are actually 2 CPUs)
ProcessorType: 586 (15)
ProcessorRevision: 9473

BIOS name:
BIOS date: 9/16/2005
BIOS copyright:
Bios Extended info:
IP Address: 192.168.1.129

DirectX version: 4.09.00.0904
Bits per pixel: 32
Display resolution: 1152 x 864

Registry info
=============
Video driver desc: NVIDIA Quadro FX 1400
Video Driver date: 11-4-2005
Video Driver version: 8.1.6.7
System bios date: 09/16/05
System bios version: HP - 20050916
Video bios date: 05/10/05
Video bios version: Version 5.41.02.43.03
Video driver file: nv4_disp.dll
====================================================

When I first set up Rosetta last week it worked fine with no errors. As soon as I started getting FA_RLX work units, I started getting errors. At this point about 40% of the WUs which start with FA_RLX get an error and abort.

ID: 12566 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bulrush

Send message
Joined: 14 Mar 06
Posts: 3
Credit: 186,848
RAC: 0
Message 12567 - Posted: 23 Mar 2006, 14:04:12 UTC

Here are the 3 errors I found in my log. Perhaps I was a bit hasty in saying I had a 40% error rate for FA_RLX. All 3 errors appear to have the same exit code. It's probably around 10%.

3/21/2006 1:24:16 PM|rosetta@home|Unrecoverable error for result FA_RLXpg_hom016_1pgx__361_292_0 ( - exit code -1073741811 (0xc000000d))
3/21/2006 1:24:18 PM||request_reschedule_cpus: process exited
3/21/2006 1:24:18 PM|rosetta@home|Computation for result FA_RLXpg_hom016_1pgx__361_292_0 finished


3/22/2006 7:57:24 AM|rosetta@home|Unrecoverable error for result FA_RLXti_hom005_1tit__362_317_0 ( - exit code -1073741811 (0xc000000d))
3/22/2006 7:57:26 AM||request_reschedule_cpus: process exited
3/22/2006 7:57:26 AM|rosetta@home|Computation for result FA_RLXti_hom005_1tit__362_317_0 finished


3/23/2006 8:00:40 AM|rosetta@home|Unrecoverable error for result FA_RLXvl_hom007_1vls__362_326_0 ( - exit code -1073741811 (0xc000000d))
3/23/2006 8:00:42 AM||request_reschedule_cpus: process exited
3/23/2006 8:00:42 AM|rosetta@home|Computation for result FA_RLXvl_hom007_1vls__362_326_0 finished


====================================================
3/23/2006 (System Information from Editquick)
OS name: WinXP
OS version: 5.1.2600 WinNT, full 32-bit Service Pack 2

Free RAM: 548mb
Total RAM: 1,023mb

Disk info
=========
Sectors Per Cluster:8
Bytes Per Sector: 512
Free Clusters: 56,948,933
Total Clusters: 61,046,992
Free mBytes: ,222,456
Total mBytes: ,238,464

CPU Vendor: AuthenticAMD
CPU Mfr: AMD
CPU Speed (mhz): 2395/2400
CPU type: 586
OEMID: 0
Number of CPUs: 1 (wrong, actually 2 CPUs)
ProcessorType: 586 (15)
ProcessorRevision: 9473

BIOS name:
BIOS date: 9/16/2005
BIOS copyright:
Bios Extended info:
IP Address: 192.168.1.129

DirectX version: 4.09.00.0904
Bits per pixel: 32
Display resolution: 1152 x 864

Registry info
=============
Video driver desc: NVIDIA Quadro FX 1400
Video Driver date: 11-4-2005
Video Driver version: 8.1.6.7
System bios date: 09/16/05
System bios version: HP - 20050916
Video bios date: 05/10/05
Video bios version: Version 5.41.02.43.03
Video driver file: nv4_disp.dll
====================================================

ID: 12567 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bulrush

Send message
Joined: 14 Mar 06
Posts: 3
Credit: 186,848
RAC: 0
Message 12568 - Posted: 23 Mar 2006, 14:06:06 UTC

Everyone having WUs stuck or their PC crash on a WU that begins with FA_RLX, please report your PC configuration, including number of CPUs. I wonder if this is a problem with dual CPU AMD chips.

ID: 12568 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MarkL.

Send message
Joined: 3 Dec 05
Posts: 3
Credit: 2,920
RAC: 0
Message 12585 - Posted: 23 Mar 2006, 21:31:13 UTC - in response to Message 12568.  


Mark L.

ID: 12585 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 12586 - Posted: 23 Mar 2006, 22:29:16 UTC - in response to Message 12568.  

Everyone having WUs stuck or their PC crash on a WU that begins with FA_RLX, please report your PC configuration, including number of CPUs. I wonder if this is a problem with dual CPU AMD chips.


I had one this morning I aborted:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11730796

Single CPU Intel 2.4GHZ, 1G Memory, XP PRO SP2



ID: 12586 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12593 - Posted: 24 Mar 2006, 0:19:40 UTC
Last modified: 24 Mar 2006, 0:52:35 UTC

ID: 12593 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 12609 - Posted: 24 Mar 2006, 8:10:21 UTC - in response to Message 12547.  
Last modified: 24 Mar 2006, 8:13:46 UTC

The last 1% guy that I reported: FA_RLXbk_hom010_1bk2__359_406_0 ran OK after a reboot.

However, another system has a different one stuck: FA_RLXpg_hom005_1pgx__361_334_0. Same drill - rebooting now.


Everyone having WUs stuck or their PC crash on a WU that begins with FA_RLX, please report your PC configuration, including number of CPUs. I wonder if this is a problem with dual CPU AMD chips.


Probably not. The system this WU is stuck on is a single core Celeron: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=157182

ID: 12609 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grutte Pier [Wa Oars]~Nemesis

Send message
Joined: 8 Nov 05
Posts: 3
Credit: 386,730
RAC: 0
Message 12624 - Posted: 24 Mar 2006, 16:20:22 UTC
Last modified: 24 Mar 2006, 16:21:35 UTC

This one got stuck on 1%:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11045076
ID: 12624 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MAOJC

Send message
Joined: 19 Jan 06
Posts: 15
Credit: 2,727,567
RAC: 0
Message 12633 - Posted: 24 Mar 2006, 18:28:19 UTC - in response to Message 12565.  
Last modified: 24 Mar 2006, 18:30:15 UTC

FA_RLXpt-hom006_1ptq_361_120_0
FA_RLXpt-hom004_1ptq_361_120_0

both the above WU were stuck @ 10+ hours with 1+ days to complete at ~14-15% completion on Linux Dual core Opteron. Aborting the first WU with Boinc Manager cause BOINC to Segfault and core and left a running Rosetta process and the gui_rpc port bound. Had to hard kill the running rosetta process.

here is the killed process off a ps -ef, note the 7200 cpu_run_time parameter.

XXXXXX 23730 1 12 04:50 pts/0 00:00:30 rosetta_4.81_i686-pc-linux-gnu xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom008_ -frags_name_prefix hom008_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2482751


got another one it looks like:

FA_RLXpt_hom003-1ptq_361_274_0 stuck @ 8.73% with 3:45 hrs out of 8 total target hours but 10+ predicted hours and climbing. after abouting it shows 0:42 hours computing time.

ID: 12633 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John Perko

Send message
Joined: 1 Jan 06
Posts: 3
Credit: 604,568
RAC: 0
Message 12640 - Posted: 24 Mar 2006, 20:07:15 UTC

3/24/2006 3:04:45 PM|rosetta@home|Unrecoverable error for result FA_RLXch_hom007_2chf__362_214_0 (aborted via GUI RPC)

The above unit stuck at 1% for about 2:40:00. It is the responsibility of the people who program these projects to find the bugs. My advice to people is to abort them and let the program move on.
ID: 12640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 12645 - Posted: 24 Mar 2006, 22:45:29 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=14676361

24 hours, ~46%. Had 100% usage of a cpu core for that whole time.
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 12645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 12647 - Posted: 25 Mar 2006, 1:27:25 UTC

work-units aborted at 1%:

FA_RLXci_hom029_2ci2I_362_311_0 after 109,513.44 secs
FA_RLXpt_hom002_1ptq__361_439_0 after 207,726.50 secs

maybe its time to stop doing longer work units so I can see when servers haven't reported results in the last few hours...
ID: 12647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bo-Arne

Send message
Joined: 16 Dec 05
Posts: 2
Credit: 2,850,415
RAC: 0
Message 12661 - Posted: 25 Mar 2006, 7:19:01 UTC - in response to Message 12557.  

I had to abort FA_RLXpt_hom006_1ptq_361_427. Restarted three times but it got stuck at 23,73%, Model 3, Step 332516, just a few minutes after entering full atom relax.

Result ID 14585714. (per moderator request)
ID: 12661 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Loki

Send message
Joined: 9 Dec 05
Posts: 9
Credit: 36,264
RAC: 0
Message 12662 - Posted: 25 Mar 2006, 7:38:54 UTC

Stuck WU (Result ID = 14857109) at 17.5 % in AB initio calculation.
ID: 12662 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 17 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2025 University of Washington
https://www.bakerlab.org