Minirosetta v1.40 bug thread

Message boards : Number crunching : Minirosetta v1.40 bug thread

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 15 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57058 - Posted: 19 Nov 2008, 7:55:45 UTC - in response to Message 57053.  

P.S.:
19/11/2008 02:40:10|rosetta@home|Starting foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t312__olange_IGNORE_THE_REST_1XV2A_5_4741_186_0
19/11/2008 02:40:14|rosetta@home|Starting task foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t312__olange_IGNORE_THE_REST_1XV2A_5_4741_186_0 using minirosetta version 140
19/11/2008 02:56:51|rosetta@home|Restarting task foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t312__olange_IGNORE_THE_REST_1XV2A_5_4741_186_0 using minirosetta version 140
19/11/2008 02:57:32|rosetta@home|Task foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t312__olange_IGNORE_THE_REST_1XV2A_5_4741_186_0 exited with zero status but no 'finished' file
19/11/2008 02:57:32|rosetta@home|If this happens repeatedly you may need to reset the project.
.
.
.

Again. I should suspend the Rosetta project altogether until this stops happening, right?


no...if it continues on those tasks specifically report it in the correct thread. it's just one of those bugs that shows up at random. I get those now and then. it's a pain in the backside, but thats just life in DC world.
keep on crunching, there will be others that are better.
ID: 57058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57059 - Posted: 19 Nov 2008, 8:02:32 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=206368130
IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_2a7m_4683_239_0

CPU time 15318.72
stderr out

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 15318.6 cpu seconds
This process generated 0 decoys from 0 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

this is odd...it ran about 75% of its time and came up with 0 decoys? and then stopped? what's up with that?
ID: 57059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57060 - Posted: 19 Nov 2008, 8:05:49 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=206127181
IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1w2l_4683_215_0

CPU time 12843.17
stderr out

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
No heartbeat from core client for 30 sec - exiting
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 12842.9 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>
ID: 57060 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57061 - Posted: 19 Nov 2008, 8:07:01 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=206127181
IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1w2l_4683_215_0
CPU time 12843.17
stderr out

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
No heartbeat from core client for 30 sec - exiting
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 12842.9 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>
ID: 57061 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57062 - Posted: 19 Nov 2008, 8:10:30 UTC
Last modified: 19 Nov 2008, 8:13:35 UTC

more of the recovering checkpoint blah blah....

https://boinc.bakerlab.org/rosetta/result.php?resultid=207631513
1xxxA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1xxxA-_4658_78912_0

CPU time 21412.55
stderr out

-----

https://boinc.bakerlab.org/rosetta/result.php?resultid=207390655
1xxxA_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1xxxA-_4658_55517_0

-----

https://boinc.bakerlab.org/rosetta/result.php?resultid=207329937
1xxxA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1xxxA-_4658_46952_0

to name a few...i think it is all the 1xxxA that produce this message:
<core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
recovering checkpoint of tag S_00000001 with id abrelax_rg_state
recovering checkpoint of tag S_00000001 with id stage_1
recovering checkpoint of tag S_00000001 with id stage_2
recovering checkpoint of tag S_00000001 with id stage_3_iter1_1
recovering checkpoint of tag S_00000001 with id stage_3_iter1_2
recovering checkpoint of tag S_00000001 with id stage_3_iter1_3
recovering checkpoint of tag S_00000001 with id stage_3_iter1_4
recovering checkpoint of tag S_00000001 with id stage_3_iter1_5
recovering checkpoint of tag S_00000001 with id stage_3_iter1_6
recovering checkpoint of tag S_00000001 with id stage_3_iter1_7
recovering checkpoint of tag S_00000001 with id stage_3_iter1_8
recovering checkpoint of tag S_00000001 with id stage_3_iter1_9
recovering checkpoint of tag S_00000001 with id stage_3_iter1_10
recovering checkpoint of tag S_00000001 with id stage4_kk_1
recovering checkpoint of tag S_00000001 with id stage4_kk_2
recovering checkpoint of tag S_00000001 with id stage4_kk_3
recovering checkpoint of tag S_00000001 with id abrelax_relax
recovering checkpoint of tag S_00000002 with id abrelax_rg_state
recovering checkpoint of tag S_00000002 with id stage_1
recovering checkpoint of tag S_00000002 with id stage_2
recovering checkpoint of tag S_00000002 with id stage_3_iter1_1
recovering checkpoint of tag S_00000002 with id stage_3_iter1_2

and so on.....
of course the end message varies, but they all complete within this time frame and give good credit.

DONE :: 1 starting structures 21412.3 cpu seconds
This process generated 18 decoys from 18 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>
ID: 57062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57068 - Posted: 19 Nov 2008, 14:21:27 UTC - in response to Message 57059.  


this is odd...it ran about 75% of its time and came up with 0 decoys? and then stopped? what's up with that?


Boy, that *IS* odd. And it gave you credit too, that doesn't look like it was for an error. I'd have to guess that it did some work, then restarted the task and somehow the stderr info. got reset.
Rosetta Moderator: Mod.Sense
ID: 57068 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57069 - Posted: 19 Nov 2008, 15:42:51 UTC - in response to Message 57068.  


this is odd...it ran about 75% of its time and came up with 0 decoys? and then stopped? what's up with that?


Boy, that *IS* odd. And it gave you credit too, that doesn't look like it was for an error. I'd have to guess that it did some work, then restarted the task and somehow the stderr info. got reset.


some more info behind this, at the time i was running rosie and einstein at 175/25 respectively. the cycle time is 60 min which i believe is the default?
so maybe it got interrupted and went to einstein and then came back and tripped up. still strange..no errors and no other info. maybe you guys can pull something on your end.
ID: 57069 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57077 - Posted: 19 Nov 2008, 19:37:21 UTC

default cpu time 21600 this ran 3146.078
https://boinc.bakerlab.org/rosetta/result.php?resultid=207892330
h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-11-S3-8--h001b-_4769_556_0
Client state Compute error
Exit status 1 (0x1)
Computer ID 871217
Report deadline 26 Nov 2008 22:35:22 UTC
CPU time 3146.078
stderr out

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
recovering checkpoint of tag S_U11X8X_00000001 with id abrelax_rg_state
recovering checkpoint of tag S_U11X8X_00000001 with id stage_1
recovering checkpoint of tag S_U11X8X_00000001 with id stage_2
# cpu_run_time_pref: 21600
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_1
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_2
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_3
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_4
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_5
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_6
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_7
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_8
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_9
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_10

and this repeats

then this stderr:
ERROR: NANs occured in hbonding!
ERROR:: Exit from: ....srccorescoringhbondshbonds_geom.cc line: 763
called boinc_finish

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 21.0970375448934
ID: 57077 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas

Send message
Joined: 20 Feb 06
Posts: 1
Credit: 137,565
RAC: 13
Message 57078 - Posted: 19 Nov 2008, 19:51:00 UTC

Another WU with extremely bad credit / CPU-time ratio:

https://boinc.bakerlab.org/rosetta/result.php?resultid=206839093

7.45 Credit for more than 7.5 hours of crunching!

I decided to wait until this has been sorted out before crunching more of these WU's, at least on this computer.
ID: 57078 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Guido Platteau

Send message
Joined: 11 Sep 06
Posts: 2
Credit: 283,392
RAC: 0
Message 57079 - Posted: 19 Nov 2008, 20:06:30 UTC
Last modified: 19 Nov 2008, 20:13:58 UTC

I tried another WU on our Windows Vista Home system PC and it failed (again!)
WU
and this WU also failed on another computer:
Details

19/11/2008 13:40:51|rosetta@home|Sending scheduler request: To fetch work. Requesting 24469 seconds of work, reporting 0 completed tasks
19/11/2008 13:40:56|rosetta@home|Scheduler request succeeded: got 1 new tasks
19/11/2008 13:40:58|rosetta@home|Started download of minirosetta_1.40_windows_intelx86.exe
19/11/2008 13:40:58|rosetta@home|Started download of minirosetta_graphics_1.40_windows_intelx86.exe
19/11/2008 13:41:06|rosetta@home|Finished download of minirosetta_graphics_1.40_windows_intelx86.exe
19/11/2008 13:41:06|rosetta@home|Started download of Helvetica.txf
19/11/2008 13:41:08|rosetta@home|Finished download of Helvetica.txf
19/11/2008 13:41:08|rosetta@home|Started download of minirosetta_database_rev25538.zip
19/11/2008 13:41:24|rosetta@home|Finished download of minirosetta_1.40_windows_intelx86.exe
19/11/2008 13:41:24|rosetta@home|Started download of boinc_yebf_aah012_05_05.200_v1_3.gz
19/11/2008 13:41:31|rosetta@home|Finished download of boinc_yebf_aah012_05_05.200_v1_3.gz
19/11/2008 13:41:31|rosetta@home|Started download of boinc_yebf_aah012_03_05.200_v1_3.gz
19/11/2008 13:41:47|rosetta@home|Finished download of boinc_yebf_aah012_03_05.200_v1_3.gz
19/11/2008 13:41:47|rosetta@home|Started download of yebf_h012_.psipred_ss2
19/11/2008 13:41:49|rosetta@home|Finished download of yebf_h012_.psipred_ss2
19/11/2008 13:41:49|rosetta@home|Started download of yebf_h012_.fasta.gz
19/11/2008 13:41:50|rosetta@home|Finished download of yebf_h012_.fasta.gz
19/11/2008 13:42:06||Suspending computation - user is active
19/11/2008 13:42:29||Resuming computation
19/11/2008 13:43:26|rosetta@home|Finished download of minirosetta_database_rev25538.zip
19/11/2008 14:44:08|rosetta@home|Starting h012__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-5-S3-3--h012_-_4675_98_1
19/11/2008 14:44:10|rosetta@home|Starting task h012__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-5-S3-3--h012_-_4675_98_1 using minirosetta version 140
19/11/2008 14:57:50|rosetta@home|Task h012__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-5-S3-3--h012_-_4675_98_1 exited with zero status but no 'finished' file
19/11/2008 14:57:50|rosetta@home|If this happens repeatedly you may need to reset the project.
19/11/2008 14:57:50|rosetta@home|Restarting task h012__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-5-S3-3--h012_-_4675_98_1 using minirosetta version 140
19/11/2008 14:57:53||Suspending computation - user is active
19/11/2008 14:58:13||Resuming computation
19/11/2008 14:58:54|rosetta@home|Task h012__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-5-S3-3--h012_-_4675_98_1 exited with zero status but no 'finished' file
19/11/2008 14:58:54|rosetta@home|If this happens repeatedly you may need to reset the project.
ID: 57079 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 808,098
RAC: 0
Message 57080 - Posted: 19 Nov 2008, 20:23:31 UTC

No graphics again 12.9 credits per hour 6.22 hours 80.67 credits total
Have a crunching good day!!
ID: 57080 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2128
Credit: 41,307,917
RAC: 10,728
Message 57082 - Posted: 19 Nov 2008, 20:54:49 UTC - in response to Message 57079.  

I tried another WU on our Windows Vista Home system PC and it failed (again!)
WU
and this WU also failed on another computer:
Details


Outcome Client error
Client state Compute error
Exit status -226 (0xffffff1e)

CPU time 431.4052
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
too many exit(0)s
</message>
<stderr_txt>
[...]
Can't acquire lockfile - exiting
[...]

This error is becoming so widespread now (on non-Vista64 systems too] it really needs some dedicated attention.

Can we have some formal comment on it, even if it's just to say you haven't tracked down the source of the problem or a practical workaround? It's just frustrating otherwise.

Until it's solved I'm really struggling to see a reason why any Minis should be issued. I could double my output for the project (as could several others) if it was either solved or Beta 5.98 WUs were issued, which run 100% for me.
ID: 57082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Erwin Schlonz
Avatar

Send message
Joined: 20 May 07
Posts: 5
Credit: 203,397
RAC: 0
Message 57094 - Posted: 20 Nov 2008, 10:52:27 UTC

What's up with this compute error???
It seems to me that the file name is way too long to handle for WinXP! Isn't there a maximum file name length (including path) of 255 characters?
My disc space is definitely not full.

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 7200
WARNING! attempt to create gzipped file ../../projects/boinc.bakerlab.org_rosetta/loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t288__olange_IGNORE_THE_REST_2FNEA_7_4818_50_0_0 failed.
======================================================
DONE :: 1 starting structures 7120.77 cpu seconds
This process generated 45 decoys from 45 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t288__olange_IGNORE_THE_REST_2FNEA_7_4818_50_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

WU affected so far:

https://boinc.bakerlab.org/rosetta/result.php?resultid=208686725
https://boinc.bakerlab.org/rosetta/result.php?resultid=208672659
ID: 57094 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57096 - Posted: 20 Nov 2008, 11:12:15 UTC - in response to Message 57094.  

What's up with this compute error???
It seems to me that the file name is way too long to handle for WinXP! Isn't there a maximum file name length (including path) of 255 characters?
My disc space is definitely not full.

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 7200
WARNING! attempt to create gzipped file ../../projects/boinc.bakerlab.org_rosetta/loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t288__olange_IGNORE_THE_REST_2FNEA_7_4818_50_0_0 failed.
======================================================
DONE :: 1 starting structures 7120.77 cpu seconds
This process generated 45 decoys from 45 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t288__olange_IGNORE_THE_REST_2FNEA_7_4818_50_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

WU affected so far:

https://boinc.bakerlab.org/rosetta/result.php?resultid=208686725
https://boinc.bakerlab.org/rosetta/result.php?resultid=208672659



read this thread over in ufluids which references Dr. David Anderson. The article says:
-161 means there's a "dangling references" in your client_state.xml file, for example there's

<file_ref>
<name>foobar</name>
</file_ref>

but there's not <file_info> with name foobar.

It looks like the problem is that the ufluids app sometimes doesn't create all of its output files. I.E., the app finishes successfully but some of the output files don't exist. BOINC treats this as an error; the app must create all the files, even if they're empty.

same would apply to Rosie.

ID: 57096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 57099 - Posted: 20 Nov 2008, 18:33:37 UTC

The curse of the NANS strikes again: 208596316

Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600

ERROR: NANs occured in hbonding!
ERROR:: Exit from: ....srccorescoringhbondshbonds_geom.cc line: 763
called boinc_finish

ID: 57099 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 57100 - Posted: 20 Nov 2008, 19:26:55 UTC


I have several of these loopbuild_minimalist_core3_homo_bench- .... tasks and several of them are way overtime... get to just under 10 minutes to go and stay that way for hours....

What could be up with these ???

All of my machines are Linux 2.6 kernels... Fedora/RedHat EL/CentOS


Looking for a team ??? Join BoincSynergy!!


ID: 57100 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Not2Nutz

Send message
Joined: 21 Jan 08
Posts: 1
Credit: 76,372
RAC: 0
Message 57106 - Posted: 20 Nov 2008, 21:51:43 UTC

It looks like this problem has been ongoing for several weeks. And not one word about it on the Rosetta project web site face page. I am glad my frustration and curiosity finally rose to the level that caused me to visit this forum.

I too, have 8 WU's of Mini 1.40 in progress for 15+ hours and stuck at above 98% completion, and still showing 9 hours 57 minutes left to completition. In fact the time-to-completion hasn't changed in over 10 hours.

One WU did complete in a timely fashion with a computation error.

I don't think my problem is for lack of RAM as I have 24GB installed. I am running Vista X64 on twin dual-core Xeons at 3.0GHz.

I have suspended all but one WU and I have bumped the task priority by two levels, just to see if I could hasten this one WU along. It doesn't seem to be helping as my CPUs are hardly even taxed at this point. So the problem does not seem to be a shortage of compute power. And I have over 1 Terabyte of free disk space. So it can't be for a lack of disk space either.

I am really at a loss of what to do here. Should I just abort them all and wait for the detectives to do the forensic thing and a fix to be implemented?

Any suggestions would be appreciated.

n2n
ID: 57106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 57107 - Posted: 20 Nov 2008, 22:05:32 UTC

I just started crunching Rosetta and have 3 tasks running for over 3 hours now with 15 hours to go. I had to abort this morning that ran for well over 18 hours.
Is this normal? I had one task finish alright but took almost 18? hours. My task managershows minirosetta consuming 165000K for each of the 3 cores it is using.
ID: 57107 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57108 - Posted: 20 Nov 2008, 23:52:18 UTC - in response to Message 57107.  

I just started crunching Rosetta and have 3 tasks running for over 3 hours now with 15 hours to go. I had to abort this morning that ran for well over 18 hours.
Is this normal? I had one task finish alright but took almost 18? hours. My task managershows minirosetta consuming 165000K for each of the 3 cores it is using.


your memory usage is in line with mine for a dual core.
time remaining, see my reply in your first question thread about newbie questions.
there are some things to check, if they are all ok, then let boinc manager learn its way around your system. it will settle down over time. also could you post links to the work units that you aborted. show what your cpu run time was set for and what the run time was when you aborted the task. also show what the stderr out message was or any other messages. people can comment on what they see from that data.
ID: 57108 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,293,370
RAC: 1,781
Message 57110 - Posted: 20 Nov 2008, 23:58:55 UTC

Rob,

Rosetta@home workunits seem to behave that way when they have a significant underestimate of the amount of CPU time they need to run. I had one about a week ago that needed about 19.5 hours to run, instead of the 6 hours length I was then asking for, but completed normally otherwise. Don't be surprised if the underestimate of the time it needs to run also gives you a rather poor ratio to credits received to credits requested. Also, these Minirosetta v1.40 workunits with such underestimates are also poor at recovering from restarting your machine after a shurdown or reboot. Earlier in this thread, you should find about 5 items in workunit names that indicate they are likely to have these problems - for example, zinc as part of the name.

Robert
ID: 57110 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 15 · Next

Message boards : Number crunching : Minirosetta v1.40 bug thread



©2024 University of Washington
https://www.bakerlab.org