Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 18 · Next
Author | Message |
---|---|
sharder8 Send message Joined: 2 Feb 06 Posts: 7 Credit: 15,648,378 RAC: 0 |
Someone may want to take a look at the results on this one as well, there's plenty of them and it isn't the 1% "stuck" problem. https://boinc.bakerlab.org/rosetta/results.php?hostid=181476 I've stopped Rosetta on this machine, as it would run through a ton of jobs and client error them until it gave the message "daily quota met". Harder |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
David, Many moons ago, in a different thread, you gave instructions for manually restarting a "1% stuck" WU. I've got on one one of my systems, do you want me to restart it, and is there anything else that I can do to help identify the problem? Things like taking a snapshot of the rosetta and slots/0 folders, zipping it up and making it available for you to download, or anything else that might help. |
Team_Elteor_Borislavj~Intelligence Send message Joined: 7 Dec 05 Posts: 14 Credit: 56,027 RAC: 0 |
HB_BARCODE_30_1bk2__351_7729_0 is stuck! After 5 hours of crunching still at 1% :( |
dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0 |
These HB_BARCODE_30 were stuck in a slot for ~24 hours without progressing. Some had ~25min CPU time; at least one had ~55min CPU. It's really annoying that I lost around 4-cpu-days of work because of these four. dag https://boinc.bakerlab.org/rosetta/result.php?resultid=14185477 https://boinc.bakerlab.org/rosetta/result.php?resultid=14184543 https://boinc.bakerlab.org/rosetta/result.php?resultid=14100307 https://boinc.bakerlab.org/rosetta/result.php?resultid=14099752 dag --Finding aliens is cool, but understanding the structure of proteins is useful. |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=14068243 Rather strange : I did not, repeat, did not abort it myself. Didn't touch the machine, it runs on it own. Had more of these and still don't know what happens. |
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
Another one, aborted after 16 hours stuck on 1% https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11717839 This one was a bit different - it was stuck on 30.19% after 8 hours. After restarting BOINC, it reset back to 38 mins CPU time and 30.19% and got stuck again. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11743501 It's getting increasingly frustrating having to babysit this project all the time. Fingers crossed for those working on a fix. |
Team_Elteor_Borislavj~Intelligence Send message Joined: 7 Dec 05 Posts: 14 Credit: 56,027 RAC: 0 |
I'm experiencing a lot of stuck WU's with FA_RLX**** I'm now at the point, if a WU is at 1 percent after 1 hour, i'm manually aborting it... i want credit for my cpu time :( |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
This unit stuck for 9 hours: FA_RLXpt_hom003_1ptq__361_234 Brought up the graphics screen (I dont run any graphics) and it was all froze except for the cpu clock was still counting. Resetting boinc did no good. Ended up aborting it. out of 183 results i have 4 errors of the frozen or 1 to 15 percent type. Cheers all!!! |
mr.kjellen Send message Joined: 5 Dec 05 Posts: 3 Credit: 1,226,674 RAC: 0 |
this one stuck at one percent. HBLR_1.0_1dtj_332_2576 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=9848871 had it crunching for about 350000sek/1500creds :( Aborted it. Seems someone did crunch it eventually. No luck for me tho. /anton |
Tim Myers Send message Joined: 11 Jan 06 Posts: 2 Credit: 10,903,965 RAC: 0 |
I've got a job stuck at 40.06% running for 10 hours. FA_RLXpt_hom002_1ptq_361_370_0 I only run seti@home other than this. 50/50. This is the first one stuck since I started running Rosetta in January. I had I lot abort, but that stopped when I turned off the screen saver and left the jobs in memory. So how do I kill this thing? |
GimpyOne Send message Joined: 13 Dec 05 Posts: 2 Credit: 40,123 RAC: 0 |
I just noticed this guy stuck for 29hours: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11741001 I have restarted Boinc and it appears to be running again. I'll keep an eye on it. Interestingly enough, when I restarted, Boinc reset the CPU time from 29hrs to 38min. I used the CPU time on the graphic to judge, it wasn't moving the first time and now is. I'll post back in a few hours and let you know if it finished. |
GimpyOne Send message Joined: 13 Dec 05 Posts: 2 Credit: 40,123 RAC: 0 |
I just noticed this guy stuck for 29hours: Well, it stalled three more times and I finally aborted it. It would restart when I exited and restarted Boinc, then freeze again after 20-30%. |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
Got one stuck here: FA_RLXbk_hom010_1bk2__359_406_0 This is the first of these I've had in a long long time. I'm going to restart it (reboot the machine), I'll post here what happens. |
Bo-Arne Send message Joined: 16 Dec 05 Posts: 2 Credit: 2,850,415 RAC: 0 |
I had to abort FA_RLXpt_hom006_1ptq_361_427. Restarted three times but it got stuck at 23,73%, Model 3, Step 332516, just a few minutes after entering full atom relax. |
MAOJC Send message Joined: 19 Jan 06 Posts: 15 Credit: 2,727,567 RAC: 0 |
FA_RLXpt-hom006_1ptq_361_120_0 FA_RLXpt-hom004_1ptq_361_120_0 both the above WU were stuck @ 10+ hours with 1+ days to complete at ~14-15% completion on Linux Dual core Opteron. Aborting the first WU with Boinc Manager cause BOINC to Segfault and core and left a running Rosetta process and the gui_rpc port bound. Had to hard kill the running rosetta process. here is the killed process off a ps -ef, note the 7200 cpu_run_time parameter. XXXXXX 23730 1 12 04:50 pts/0 00:00:30 rosetta_4.81_i686-pc-linux-gnu xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom008_ -frags_name_prefix hom008_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2482751 |
bulrush Send message Joined: 14 Mar 06 Posts: 3 Credit: 186,848 RAC: 0 |
I am also getting an unrecoverable error from work units which begin with FA_RLX. Here is the error: 3/23/2006 8:00:40 AM|rosetta@home|Unrecoverable error for result FA_RLXvl_hom007_1vls__362_326_0 ( - exit code -1073741811 (0xc000000d)) 3/23/2006 8:00:42 AM||request_reschedule_cpus: process exited 3/23/2006 8:00:42 AM|rosetta@home|Computation for result FA_RLXvl_hom007_1vls__362_326_0 finished Computer is brand new, set up 2 weeks ago. Here is my config as reported by Editq, a text editor: ==================================================== 3/23/2006 (System Information from Editquick) OS name: WinXP OS version: 5.1.2600 WinXP, full 32-bit Service Pack 2 Free RAM: 546mb Total RAM: 1,023mb Disk info ========= Sectors Per Cluster:8 Bytes Per Sector: 512 Free Clusters: 57,217,680 Total Clusters: 61,046,992 Free mBytes: ,223,506 Total mBytes: ,238,464 CPU Vendor: AuthenticAMD CPU Mfr: AMD CPU Speed (mhz): 2395/2400 CPU type: 586 OEMID: 0 Number of CPUs: 1 (wrong, there are actually 2 CPUs) ProcessorType: 586 (15) ProcessorRevision: 9473 BIOS name: BIOS date: 9/16/2005 BIOS copyright: Bios Extended info: IP Address: 192.168.1.129 DirectX version: 4.09.00.0904 Bits per pixel: 32 Display resolution: 1152 x 864 Registry info ============= Video driver desc: NVIDIA Quadro FX 1400 Video Driver date: 11-4-2005 Video Driver version: 8.1.6.7 System bios date: 09/16/05 System bios version: HP - 20050916 Video bios date: 05/10/05 Video bios version: Version 5.41.02.43.03 Video driver file: nv4_disp.dll ==================================================== When I first set up Rosetta last week it worked fine with no errors. As soon as I started getting FA_RLX work units, I started getting errors. At this point about 40% of the WUs which start with FA_RLX get an error and abort. |
bulrush Send message Joined: 14 Mar 06 Posts: 3 Credit: 186,848 RAC: 0 |
Here are the 3 errors I found in my log. Perhaps I was a bit hasty in saying I had a 40% error rate for FA_RLX. All 3 errors appear to have the same exit code. It's probably around 10%. 3/21/2006 1:24:16 PM|rosetta@home|Unrecoverable error for result FA_RLXpg_hom016_1pgx__361_292_0 ( - exit code -1073741811 (0xc000000d)) 3/21/2006 1:24:18 PM||request_reschedule_cpus: process exited 3/21/2006 1:24:18 PM|rosetta@home|Computation for result FA_RLXpg_hom016_1pgx__361_292_0 finished 3/22/2006 7:57:24 AM|rosetta@home|Unrecoverable error for result FA_RLXti_hom005_1tit__362_317_0 ( - exit code -1073741811 (0xc000000d)) 3/22/2006 7:57:26 AM||request_reschedule_cpus: process exited 3/22/2006 7:57:26 AM|rosetta@home|Computation for result FA_RLXti_hom005_1tit__362_317_0 finished 3/23/2006 8:00:40 AM|rosetta@home|Unrecoverable error for result FA_RLXvl_hom007_1vls__362_326_0 ( - exit code -1073741811 (0xc000000d)) 3/23/2006 8:00:42 AM||request_reschedule_cpus: process exited 3/23/2006 8:00:42 AM|rosetta@home|Computation for result FA_RLXvl_hom007_1vls__362_326_0 finished ==================================================== 3/23/2006 (System Information from Editquick) OS name: WinXP OS version: 5.1.2600 WinNT, full 32-bit Service Pack 2 Free RAM: 548mb Total RAM: 1,023mb Disk info ========= Sectors Per Cluster:8 Bytes Per Sector: 512 Free Clusters: 56,948,933 Total Clusters: 61,046,992 Free mBytes: ,222,456 Total mBytes: ,238,464 CPU Vendor: AuthenticAMD CPU Mfr: AMD CPU Speed (mhz): 2395/2400 CPU type: 586 OEMID: 0 Number of CPUs: 1 (wrong, actually 2 CPUs) ProcessorType: 586 (15) ProcessorRevision: 9473 BIOS name: BIOS date: 9/16/2005 BIOS copyright: Bios Extended info: IP Address: 192.168.1.129 DirectX version: 4.09.00.0904 Bits per pixel: 32 Display resolution: 1152 x 864 Registry info ============= Video driver desc: NVIDIA Quadro FX 1400 Video Driver date: 11-4-2005 Video Driver version: 8.1.6.7 System bios date: 09/16/05 System bios version: HP - 20050916 Video bios date: 05/10/05 Video bios version: Version 5.41.02.43.03 Video driver file: nv4_disp.dll ==================================================== |
bulrush Send message Joined: 14 Mar 06 Posts: 3 Credit: 186,848 RAC: 0 |
Everyone having WUs stuck or their PC crash on a WU that begins with FA_RLX, please report your PC configuration, including number of CPUs. I wonder if this is a problem with dual CPU AMD chips. |
MarkL. Send message Joined: 3 Dec 05 Posts: 3 Credit: 2,920 RAC: 0 |
Mark L. |
Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
Everyone having WUs stuck or their PC crash on a WU that begins with FA_RLX, please report your PC configuration, including number of CPUs. I wonder if this is a problem with dual CPU AMD chips. I had one this morning I aborted: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11730796 Single CPU Intel 2.4GHZ, 1G Memory, XP PRO SP2 |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2024 University of Washington
https://www.bakerlab.org