Message boards : Number crunching : Minirosetta v1.40 bug thread
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 15 · Next
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 5 |
P.S.: no...if it continues on those tasks specifically report it in the correct thread. it's just one of those bugs that shows up at random. I get those now and then. it's a pain in the backside, but thats just life in DC world. keep on crunching, there will be others that are better. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 5 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=206368130 IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_2a7m_4683_239_0 CPU time 15318.72 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 21600 ====================================================== DONE :: 1 starting structures 15318.6 cpu seconds This process generated 0 decoys from 0 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> this is odd...it ran about 75% of its time and came up with 0 decoys? and then stopped? what's up with that? |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 5 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=206127181 IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1w2l_4683_215_0 CPU time 12843.17 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 No heartbeat from core client for 30 sec - exiting # cpu_run_time_pref: 21600 ====================================================== DONE :: 1 starting structures 12842.9 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 5 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=206127181 IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1w2l_4683_215_0 CPU time 12843.17 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 No heartbeat from core client for 30 sec - exiting # cpu_run_time_pref: 21600 ====================================================== DONE :: 1 starting structures 12842.9 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 5 |
more of the recovering checkpoint blah blah.... https://boinc.bakerlab.org/rosetta/result.php?resultid=207631513 1xxxA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1xxxA-_4658_78912_0 CPU time 21412.55 stderr out ----- https://boinc.bakerlab.org/rosetta/result.php?resultid=207390655 1xxxA_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1xxxA-_4658_55517_0 ----- https://boinc.bakerlab.org/rosetta/result.php?resultid=207329937 1xxxA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1xxxA-_4658_46952_0 to name a few...i think it is all the 1xxxA that produce this message: <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 21600 recovering checkpoint of tag S_00000001 with id abrelax_rg_state recovering checkpoint of tag S_00000001 with id stage_1 recovering checkpoint of tag S_00000001 with id stage_2 recovering checkpoint of tag S_00000001 with id stage_3_iter1_1 recovering checkpoint of tag S_00000001 with id stage_3_iter1_2 recovering checkpoint of tag S_00000001 with id stage_3_iter1_3 recovering checkpoint of tag S_00000001 with id stage_3_iter1_4 recovering checkpoint of tag S_00000001 with id stage_3_iter1_5 recovering checkpoint of tag S_00000001 with id stage_3_iter1_6 recovering checkpoint of tag S_00000001 with id stage_3_iter1_7 recovering checkpoint of tag S_00000001 with id stage_3_iter1_8 recovering checkpoint of tag S_00000001 with id stage_3_iter1_9 recovering checkpoint of tag S_00000001 with id stage_3_iter1_10 recovering checkpoint of tag S_00000001 with id stage4_kk_1 recovering checkpoint of tag S_00000001 with id stage4_kk_2 recovering checkpoint of tag S_00000001 with id stage4_kk_3 recovering checkpoint of tag S_00000001 with id abrelax_relax recovering checkpoint of tag S_00000002 with id abrelax_rg_state recovering checkpoint of tag S_00000002 with id stage_1 recovering checkpoint of tag S_00000002 with id stage_2 recovering checkpoint of tag S_00000002 with id stage_3_iter1_1 recovering checkpoint of tag S_00000002 with id stage_3_iter1_2 and so on..... of course the end message varies, but they all complete within this time frame and give good credit. DONE :: 1 starting structures 21412.3 cpu seconds This process generated 18 decoys from 18 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Boy, that *IS* odd. And it gave you credit too, that doesn't look like it was for an error. I'd have to guess that it did some work, then restarted the task and somehow the stderr info. got reset. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 5 |
some more info behind this, at the time i was running rosie and einstein at 175/25 respectively. the cycle time is 60 min which i believe is the default? so maybe it got interrupted and went to einstein and then came back and tripped up. still strange..no errors and no other info. maybe you guys can pull something on your end. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 5 |
default cpu time 21600 this ran 3146.078 https://boinc.bakerlab.org/rosetta/result.php?resultid=207892330 h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-11-S3-8--h001b-_4769_556_0 Client state Compute error Exit status 1 (0x1) Computer ID 871217 Report deadline 26 Nov 2008 22:35:22 UTC CPU time 3146.078 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> recovering checkpoint of tag S_U11X8X_00000001 with id abrelax_rg_state recovering checkpoint of tag S_U11X8X_00000001 with id stage_1 recovering checkpoint of tag S_U11X8X_00000001 with id stage_2 # cpu_run_time_pref: 21600 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_1 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_2 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_3 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_4 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_5 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_6 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_7 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_8 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_9 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_10 and this repeats then this stderr: ERROR: NANs occured in hbonding! ERROR:: Exit from: ....srccorescoringhbondshbonds_geom.cc line: 763 called boinc_finish </stderr_txt> ]]> Validate state Invalid Claimed credit 21.0970375448934 |
Thomas Send message Joined: 20 Feb 06 Posts: 1 Credit: 136,649 RAC: 0 |
Another WU with extremely bad credit / CPU-time ratio: https://boinc.bakerlab.org/rosetta/result.php?resultid=206839093 7.45 Credit for more than 7.5 hours of crunching! I decided to wait until this has been sorted out before crunching more of these WU's, at least on this computer. |
Guido Platteau Send message Joined: 11 Sep 06 Posts: 2 Credit: 283,392 RAC: 0 |
I tried another WU on our Windows Vista Home system PC and it failed (again!) WU and this WU also failed on another computer: Details 19/11/2008 13:40:51|rosetta@home|Sending scheduler request: To fetch work. Requesting 24469 seconds of work, reporting 0 completed tasks 19/11/2008 13:40:56|rosetta@home|Scheduler request succeeded: got 1 new tasks 19/11/2008 13:40:58|rosetta@home|Started download of minirosetta_1.40_windows_intelx86.exe 19/11/2008 13:40:58|rosetta@home|Started download of minirosetta_graphics_1.40_windows_intelx86.exe 19/11/2008 13:41:06|rosetta@home|Finished download of minirosetta_graphics_1.40_windows_intelx86.exe 19/11/2008 13:41:06|rosetta@home|Started download of Helvetica.txf 19/11/2008 13:41:08|rosetta@home|Finished download of Helvetica.txf 19/11/2008 13:41:08|rosetta@home|Started download of minirosetta_database_rev25538.zip 19/11/2008 13:41:24|rosetta@home|Finished download of minirosetta_1.40_windows_intelx86.exe 19/11/2008 13:41:24|rosetta@home|Started download of boinc_yebf_aah012_05_05.200_v1_3.gz 19/11/2008 13:41:31|rosetta@home|Finished download of boinc_yebf_aah012_05_05.200_v1_3.gz 19/11/2008 13:41:31|rosetta@home|Started download of boinc_yebf_aah012_03_05.200_v1_3.gz 19/11/2008 13:41:47|rosetta@home|Finished download of boinc_yebf_aah012_03_05.200_v1_3.gz 19/11/2008 13:41:47|rosetta@home|Started download of yebf_h012_.psipred_ss2 19/11/2008 13:41:49|rosetta@home|Finished download of yebf_h012_.psipred_ss2 19/11/2008 13:41:49|rosetta@home|Started download of yebf_h012_.fasta.gz 19/11/2008 13:41:50|rosetta@home|Finished download of yebf_h012_.fasta.gz 19/11/2008 13:42:06||Suspending computation - user is active 19/11/2008 13:42:29||Resuming computation 19/11/2008 13:43:26|rosetta@home|Finished download of minirosetta_database_rev25538.zip 19/11/2008 14:44:08|rosetta@home|Starting h012__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-5-S3-3--h012_-_4675_98_1 19/11/2008 14:44:10|rosetta@home|Starting task h012__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-5-S3-3--h012_-_4675_98_1 using minirosetta version 140 19/11/2008 14:57:50|rosetta@home|Task h012__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-5-S3-3--h012_-_4675_98_1 exited with zero status but no 'finished' file 19/11/2008 14:57:50|rosetta@home|If this happens repeatedly you may need to reset the project. 19/11/2008 14:57:50|rosetta@home|Restarting task h012__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-5-S3-3--h012_-_4675_98_1 using minirosetta version 140 19/11/2008 14:57:53||Suspending computation - user is active 19/11/2008 14:58:13||Resuming computation 19/11/2008 14:58:54|rosetta@home|Task h012__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-5-S3-3--h012_-_4675_98_1 exited with zero status but no 'finished' file 19/11/2008 14:58:54|rosetta@home|If this happens repeatedly you may need to reset the project. |
Speedy Send message Joined: 25 Sep 05 Posts: 163 Credit: 808,098 RAC: 0 |
No graphics again 12.9 credits per hour 6.22 hours 80.67 credits total Have a crunching good day!! |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2087 Credit: 40,638,564 RAC: 4,752 |
I tried another WU on our Windows Vista Home system PC and it failed (again!) Outcome Client error Client state Compute error Exit status -226 (0xffffff1e) CPU time 431.4052 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> too many exit(0)s </message> <stderr_txt> [...] Can't acquire lockfile - exiting [...] This error is becoming so widespread now (on non-Vista64 systems too] it really needs some dedicated attention. Can we have some formal comment on it, even if it's just to say you haven't tracked down the source of the problem or a practical workaround? It's just frustrating otherwise. Until it's solved I'm really struggling to see a reason why any Minis should be issued. I could double my output for the project (as could several others) if it was either solved or Beta 5.98 WUs were issued, which run 100% for me. |
Erwin Schlonz Send message Joined: 20 May 07 Posts: 5 Credit: 203,397 RAC: 0 |
What's up with this compute error??? It seems to me that the file name is way too long to handle for WinXP! Isn't there a maximum file name length (including path) of 255 characters? My disc space is definitely not full. <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 7200 WARNING! attempt to create gzipped file ../../projects/boinc.bakerlab.org_rosetta/loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t288__olange_IGNORE_THE_REST_2FNEA_7_4818_50_0_0 failed. ====================================================== DONE :: 1 starting structures 7120.77 cpu seconds This process generated 45 decoys from 45 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> <message> <file_xfer_error> <file_name>loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t288__olange_IGNORE_THE_REST_2FNEA_7_4818_50_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> </message> ]]> WU affected so far: https://boinc.bakerlab.org/rosetta/result.php?resultid=208686725 https://boinc.bakerlab.org/rosetta/result.php?resultid=208672659 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 5 |
What's up with this compute error??? read this thread over in ufluids which references Dr. David Anderson. The article says: -161 means there's a "dangling references" in your client_state.xml file, for example there's <file_ref> <name>foobar</name> </file_ref> but there's not <file_info> with name foobar. It looks like the problem is that the ufluids app sometimes doesn't create all of its output files. I.E., the app finishes successfully but some of the output files don't exist. BOINC treats this as an error; the app must create all the files, even if they're empty. same would apply to Rosie. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
The curse of the NANS strikes again: 208596316 Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 ERROR: NANs occured in hbonding! ERROR:: Exit from: ....srccorescoringhbondshbonds_geom.cc line: 763 called boinc_finish |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
I have several of these loopbuild_minimalist_core3_homo_bench- .... tasks and several of them are way overtime... get to just under 10 minutes to go and stay that way for hours.... What could be up with these ??? All of my machines are Linux 2.6 kernels... Fedora/RedHat EL/CentOS Looking for a team ??? Join BoincSynergy!! |
Not2Nutz Send message Joined: 21 Jan 08 Posts: 1 Credit: 76,372 RAC: 0 |
It looks like this problem has been ongoing for several weeks. And not one word about it on the Rosetta project web site face page. I am glad my frustration and curiosity finally rose to the level that caused me to visit this forum. I too, have 8 WU's of Mini 1.40 in progress for 15+ hours and stuck at above 98% completion, and still showing 9 hours 57 minutes left to completition. In fact the time-to-completion hasn't changed in over 10 hours. One WU did complete in a timely fashion with a computation error. I don't think my problem is for lack of RAM as I have 24GB installed. I am running Vista X64 on twin dual-core Xeons at 3.0GHz. I have suspended all but one WU and I have bumped the task priority by two levels, just to see if I could hasten this one WU along. It doesn't seem to be helping as my CPUs are hardly even taxed at this point. So the problem does not seem to be a shortage of compute power. And I have over 1 Terabyte of free disk space. So it can't be for a lack of disk space either. I am really at a loss of what to do here. Should I just abort them all and wait for the detectives to do the forensic thing and a fix to be implemented? Any suggestions would be appreciated. n2n |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
I just started crunching Rosetta and have 3 tasks running for over 3 hours now with 15 hours to go. I had to abort this morning that ran for well over 18 hours. Is this normal? I had one task finish alright but took almost 18? hours. My task managershows minirosetta consuming 165000K for each of the 3 cores it is using. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 5 |
I just started crunching Rosetta and have 3 tasks running for over 3 hours now with 15 hours to go. I had to abort this morning that ran for well over 18 hours. your memory usage is in line with mine for a dual core. time remaining, see my reply in your first question thread about newbie questions. there are some things to check, if they are all ok, then let boinc manager learn its way around your system. it will settle down over time. also could you post links to the work units that you aborted. show what your cpu run time was set for and what the run time was when you aborted the task. also show what the stderr out message was or any other messages. people can comment on what they see from that data. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1230 Credit: 14,178,114 RAC: 1,048 |
Rob, Rosetta@home workunits seem to behave that way when they have a significant underestimate of the amount of CPU time they need to run. I had one about a week ago that needed about 19.5 hours to run, instead of the 6 hours length I was then asking for, but completed normally otherwise. Don't be surprised if the underestimate of the time it needs to run also gives you a rather poor ratio to credits received to credits requested. Also, these Minirosetta v1.40 workunits with such underestimates are also poor at recovering from restarting your machine after a shurdown or reboot. Earlier in this thread, you should find about 5 items in workunit names that indicate they are likely to have these problems - for example, zinc as part of the name. Robert |
Message boards :
Number crunching :
Minirosetta v1.40 bug thread
©2024 University of Washington
https://www.bakerlab.org