Posts by Kartsa

1) Message boards : Number crunching : minirosetta 2.17 (Message 69825)
Posted 14 Mar 2011 by Kartsa
Post:
so, theres no error messages or anything. From restart to the point when I noticed that a wu has hanged

14/03/2011 20:26:33 Not using a proxy
14/03/2011 20:26:33 rosetta@home Restarting task T0596_rR_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23234_4284_0 using minirosetta version 217
14/03/2011 20:26:33 rosetta@home Restarting task T0623_rH_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23166_4307_0 using minirosetta version 217
14/03/2011 20:26:33 rosetta@home Restarting task T0528_rR_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23193_4413_0 using minirosetta version 217
14/03/2011 20:26:34 rosetta@home Restarting task IF3_like_SAVE_ALL_OUT_i016_008_23333_342_0 using minirosetta version 217
14/03/2011 20:39:24 rosetta@home Computation for task T0596_rR_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23234_4284_0 finished
14/03/2011 20:39:24 rosetta@home Starting IF3_like_SAVE_ALL_OUT_relax_i016_23334_521_0
14/03/2011 20:39:25 rosetta@home Starting task IF3_like_SAVE_ALL_OUT_relax_i016_23334_521_0 using minirosetta version 217
14/03/2011 20:39:26 rosetta@home Started upload of T0596_rR_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23234_4284_0_0
14/03/2011 20:39:46 rosetta@home Finished upload of T0596_rR_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23234_4284_0_0
14/03/2011 21:12:48 rosetta@home Computation for task IF3_like_SAVE_ALL_OUT_relax_i016_23334_521_0 finished
14/03/2011 21:12:48 rosetta@home Starting IF3_like_SAVE_ALL_OUT_i016_009_23333_1081_0
14/03/2011 21:12:48 rosetta@home Starting task IF3_like_SAVE_ALL_OUT_i016_009_23333_1081_0 using minirosetta version 217
14/03/2011 21:12:50 rosetta@home Started upload of IF3_like_SAVE_ALL_OUT_relax_i016_23334_521_0_0
14/03/2011 21:13:13 rosetta@home Finished upload of IF3_like_SAVE_ALL_OUT_relax_i016_23334_521_0_0
14/03/2011 21:41:14 rosetta@home Computation for task T0528_rR_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23193_4413_0 finished
14/03/2011 21:41:14 rosetta@home Starting IF3_like_SAVE_ALL_OUT_i016_008_23333_703_0
14/03/2011 21:41:15 rosetta@home Starting task IF3_like_SAVE_ALL_OUT_i016_008_23333_703_0 using minirosetta version 217
14/03/2011 21:41:17 rosetta@home Started upload of T0528_rR_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23193_4413_0_0
14/03/2011 21:41:27 rosetta@home Finished upload of T0528_rR_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23193_4413_0_0

and it was the task T0623_rH_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23166_4307_0
that got stuck

edit and just to add: I had all the other projects suspended before the restart, so only rosetta was running
2) Message boards : Number crunching : minirosetta 2.17 (Message 69824)
Posted 14 Mar 2011 by Kartsa
Post:
completely exit BOINC and restart it

yep this seems to get them running again

BOINC version 6.10.58, using the default(?) memory settings, 50% when in use and 90% when not in use. I have 8 gigs total and rosetta rarely uses more than 2gigs in total (4 wus, ~500MB each; usually it's a lot less, 250-400MB each. Apps are not left in memory when suspended. At this point cant say anything certain about the possible messages since I just restarted the client, will post the next time some wu 'hangs'.

The other projects I'm running, Seti and Einstein, aren't affected by this problem.
3) Message boards : Number crunching : minirosetta 2.17 (Message 69810)
Posted 13 Mar 2011 by Kartsa
Post:
Workunit T0635_rR_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23623_2499 has stopped using ANY CPU time, even though BOINC thinks it is still running.

Currently at 00:43:26 CPU time, 17:06:52 Elapsed time, 5.656% Progress, 45:45:04 Estimated time remaining.

Since I ask for 12-hour workunits, I suspect that its timeout procedure has failed as well.

I'm about to restart it from its last checkpoint, in case that will help.

I'm using TThrottle to keep my computers from overheating, so I'm not quite sure just how much of the CPU time BOINC is allowed to use.

I'm having similar issues with some of the units on two different machines (win 7 and xp). Boinc thinks they are still running but they are not using any cpu and progress doesn't increase. I'm not using any throttling at all, full 100% all the time. I just abort the failed units since suspending/resume doesn't make any difference. I've been having these problems for couple of months I think, most of the wus work just fine.
Tried resetting the project, didn't help.

two examples
T0590_rH_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23137_1570
T0620_rH_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_23164_2093 (someone seems to have successfully finished this one, though...)






©2024 University of Washington
https://www.bakerlab.org