Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 17 · Next
Author | Message |
---|---|
Tim Myers Send message Joined: 11 Jan 06 Posts: 2 Credit: 10,903,965 RAC: 0 |
I've got a job stuck at 40.06% running for 10 hours. FA_RLXpt_hom002_1ptq_361_370_0 I only run seti@home other than this. 50/50. This is the first one stuck since I started running Rosetta in January. I had I lot abort, but that stopped when I turned off the screen saver and left the jobs in memory. So how do I kill this thing? |
GimpyOne Send message Joined: 13 Dec 05 Posts: 2 Credit: 40,123 RAC: 0 |
I just noticed this guy stuck for 29hours: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11741001 I have restarted Boinc and it appears to be running again. I'll keep an eye on it. Interestingly enough, when I restarted, Boinc reset the CPU time from 29hrs to 38min. I used the CPU time on the graphic to judge, it wasn't moving the first time and now is. I'll post back in a few hours and let you know if it finished. |
GimpyOne Send message Joined: 13 Dec 05 Posts: 2 Credit: 40,123 RAC: 0 |
I just noticed this guy stuck for 29hours: Well, it stalled three more times and I finally aborted it. It would restart when I exited and restarted Boinc, then freeze again after 20-30%. |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
Got one stuck here: FA_RLXbk_hom010_1bk2__359_406_0 This is the first of these I've had in a long long time. I'm going to restart it (reboot the machine), I'll post here what happens. |
Bo-Arne Send message Joined: 16 Dec 05 Posts: 2 Credit: 2,850,415 RAC: 0 |
I had to abort FA_RLXpt_hom006_1ptq_361_427. Restarted three times but it got stuck at 23,73%, Model 3, Step 332516, just a few minutes after entering full atom relax. |
MAOJC Send message Joined: 19 Jan 06 Posts: 15 Credit: 2,727,567 RAC: 0 |
FA_RLXpt-hom006_1ptq_361_120_0 FA_RLXpt-hom004_1ptq_361_120_0 both the above WU were stuck @ 10+ hours with 1+ days to complete at ~14-15% completion on Linux Dual core Opteron. Aborting the first WU with Boinc Manager cause BOINC to Segfault and core and left a running Rosetta process and the gui_rpc port bound. Had to hard kill the running rosetta process. here is the killed process off a ps -ef, note the 7200 cpu_run_time parameter. XXXXXX 23730 1 12 04:50 pts/0 00:00:30 rosetta_4.81_i686-pc-linux-gnu xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom008_ -frags_name_prefix hom008_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2482751 |
bulrush Send message Joined: 14 Mar 06 Posts: 3 Credit: 186,848 RAC: 0 |
I am also getting an unrecoverable error from work units which begin with FA_RLX. Here is the error: 3/23/2006 8:00:40 AM|rosetta@home|Unrecoverable error for result FA_RLXvl_hom007_1vls__362_326_0 ( - exit code -1073741811 (0xc000000d)) 3/23/2006 8:00:42 AM||request_reschedule_cpus: process exited 3/23/2006 8:00:42 AM|rosetta@home|Computation for result FA_RLXvl_hom007_1vls__362_326_0 finished Computer is brand new, set up 2 weeks ago. Here is my config as reported by Editq, a text editor: ==================================================== 3/23/2006 (System Information from Editquick) OS name: WinXP OS version: 5.1.2600 WinXP, full 32-bit Service Pack 2 Free RAM: 546mb Total RAM: 1,023mb Disk info ========= Sectors Per Cluster:8 Bytes Per Sector: 512 Free Clusters: 57,217,680 Total Clusters: 61,046,992 Free mBytes: ,223,506 Total mBytes: ,238,464 CPU Vendor: AuthenticAMD CPU Mfr: AMD CPU Speed (mhz): 2395/2400 CPU type: 586 OEMID: 0 Number of CPUs: 1 (wrong, there are actually 2 CPUs) ProcessorType: 586 (15) ProcessorRevision: 9473 BIOS name: BIOS date: 9/16/2005 BIOS copyright: Bios Extended info: IP Address: 192.168.1.129 DirectX version: 4.09.00.0904 Bits per pixel: 32 Display resolution: 1152 x 864 Registry info ============= Video driver desc: NVIDIA Quadro FX 1400 Video Driver date: 11-4-2005 Video Driver version: 8.1.6.7 System bios date: 09/16/05 System bios version: HP - 20050916 Video bios date: 05/10/05 Video bios version: Version 5.41.02.43.03 Video driver file: nv4_disp.dll ==================================================== When I first set up Rosetta last week it worked fine with no errors. As soon as I started getting FA_RLX work units, I started getting errors. At this point about 40% of the WUs which start with FA_RLX get an error and abort. |
bulrush Send message Joined: 14 Mar 06 Posts: 3 Credit: 186,848 RAC: 0 |
Here are the 3 errors I found in my log. Perhaps I was a bit hasty in saying I had a 40% error rate for FA_RLX. All 3 errors appear to have the same exit code. It's probably around 10%. 3/21/2006 1:24:16 PM|rosetta@home|Unrecoverable error for result FA_RLXpg_hom016_1pgx__361_292_0 ( - exit code -1073741811 (0xc000000d)) 3/21/2006 1:24:18 PM||request_reschedule_cpus: process exited 3/21/2006 1:24:18 PM|rosetta@home|Computation for result FA_RLXpg_hom016_1pgx__361_292_0 finished 3/22/2006 7:57:24 AM|rosetta@home|Unrecoverable error for result FA_RLXti_hom005_1tit__362_317_0 ( - exit code -1073741811 (0xc000000d)) 3/22/2006 7:57:26 AM||request_reschedule_cpus: process exited 3/22/2006 7:57:26 AM|rosetta@home|Computation for result FA_RLXti_hom005_1tit__362_317_0 finished 3/23/2006 8:00:40 AM|rosetta@home|Unrecoverable error for result FA_RLXvl_hom007_1vls__362_326_0 ( - exit code -1073741811 (0xc000000d)) 3/23/2006 8:00:42 AM||request_reschedule_cpus: process exited 3/23/2006 8:00:42 AM|rosetta@home|Computation for result FA_RLXvl_hom007_1vls__362_326_0 finished ==================================================== 3/23/2006 (System Information from Editquick) OS name: WinXP OS version: 5.1.2600 WinNT, full 32-bit Service Pack 2 Free RAM: 548mb Total RAM: 1,023mb Disk info ========= Sectors Per Cluster:8 Bytes Per Sector: 512 Free Clusters: 56,948,933 Total Clusters: 61,046,992 Free mBytes: ,222,456 Total mBytes: ,238,464 CPU Vendor: AuthenticAMD CPU Mfr: AMD CPU Speed (mhz): 2395/2400 CPU type: 586 OEMID: 0 Number of CPUs: 1 (wrong, actually 2 CPUs) ProcessorType: 586 (15) ProcessorRevision: 9473 BIOS name: BIOS date: 9/16/2005 BIOS copyright: Bios Extended info: IP Address: 192.168.1.129 DirectX version: 4.09.00.0904 Bits per pixel: 32 Display resolution: 1152 x 864 Registry info ============= Video driver desc: NVIDIA Quadro FX 1400 Video Driver date: 11-4-2005 Video Driver version: 8.1.6.7 System bios date: 09/16/05 System bios version: HP - 20050916 Video bios date: 05/10/05 Video bios version: Version 5.41.02.43.03 Video driver file: nv4_disp.dll ==================================================== |
bulrush Send message Joined: 14 Mar 06 Posts: 3 Credit: 186,848 RAC: 0 |
Everyone having WUs stuck or their PC crash on a WU that begins with FA_RLX, please report your PC configuration, including number of CPUs. I wonder if this is a problem with dual CPU AMD chips. |
MarkL. Send message Joined: 3 Dec 05 Posts: 3 Credit: 2,920 RAC: 0 |
Mark L. |
Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
Everyone having WUs stuck or their PC crash on a WU that begins with FA_RLX, please report your PC configuration, including number of CPUs. I wonder if this is a problem with dual CPU AMD chips. I had one this morning I aborted: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11730796 Single CPU Intel 2.4GHZ, 1G Memory, XP PRO SP2 |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
|
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
The last 1% guy that I reported: FA_RLXbk_hom010_1bk2__359_406_0 ran OK after a reboot. However, another system has a different one stuck: FA_RLXpg_hom005_1pgx__361_334_0. Same drill - rebooting now.
Probably not. The system this WU is stuck on is a single core Celeron: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=157182 |
Grutte Pier [Wa Oars]~Nemesis Send message Joined: 8 Nov 05 Posts: 3 Credit: 386,730 RAC: 0 |
This one got stuck on 1%: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11045076 |
MAOJC Send message Joined: 19 Jan 06 Posts: 15 Credit: 2,727,567 RAC: 0 |
FA_RLXpt-hom006_1ptq_361_120_0 got another one it looks like: FA_RLXpt_hom003-1ptq_361_274_0 stuck @ 8.73% with 3:45 hrs out of 8 total target hours but 10+ predicted hours and climbing. after abouting it shows 0:42 hours computing time. |
John Perko Send message Joined: 1 Jan 06 Posts: 3 Credit: 604,568 RAC: 0 |
3/24/2006 3:04:45 PM|rosetta@home|Unrecoverable error for result FA_RLXch_hom007_2chf__362_214_0 (aborted via GUI RPC) The above unit stuck at 1% for about 2:40:00. It is the responsibility of the people who program these projects to find the bugs. My advice to people is to abort them and let the program move on. |
dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=14676361 24 hours, ~46%. Had 100% usage of a cpu core for that whole time. dag --Finding aliens is cool, but understanding the structure of proteins is useful. |
Delk Send message Joined: 20 Feb 06 Posts: 25 Credit: 995,624 RAC: 0 |
work-units aborted at 1%: FA_RLXci_hom029_2ci2I_362_311_0 after 109,513.44 secs FA_RLXpt_hom002_1ptq__361_439_0 after 207,726.50 secs maybe its time to stop doing longer work units so I can see when servers haven't reported results in the last few hours... |
Bo-Arne Send message Joined: 16 Dec 05 Posts: 2 Credit: 2,850,415 RAC: 0 |
I had to abort FA_RLXpt_hom006_1ptq_361_427. Restarted three times but it got stuck at 23,73%, Model 3, Step 332516, just a few minutes after entering full atom relax. Result ID 14585714. (per moderator request) |
Loki Send message Joined: 9 Dec 05 Posts: 9 Credit: 36,264 RAC: 0 |
Stuck WU (Result ID = 14857109) at 17.5 % in AB initio calculation. |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2025 University of Washington
https://www.bakerlab.org