minirosetta 2.05

Message boards : Number crunching : minirosetta 2.05

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

AuthorMessage
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 206
Credit: 19,454,163
RAC: 1,177
Message 65157 - Posted: 31 Jan 2010, 1:14:32 UTC
Last modified: 31 Jan 2010, 1:17:39 UTC

Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal?

And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): https://boinc.bakerlab.org/rosetta/result.php?resultid=314040179
While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time).
This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type *gbnnotyr*, where the combined "small" and "huge" models in the same type of tasks)
ID: 65157 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 10,836,395
RAC: 0
Message 65158 - Posted: 31 Jan 2010, 9:31:05 UTC - in response to Message 65155.  



Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.


Even if this works, it shouldn't be necessary to babysit BOINC/Rosetta in this way. This hanging certainly seems to be a widespread issue but one that only affects Windows in its various incarnations. The fact that it's irreproducible means a fix may be some time in coming but I hope the project team find it soon.


The fact someone noticing that this is occurring, suggests babysitting to begin with.

ID: 65158 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 65159 - Posted: 31 Jan 2010, 20:32:50 UTC - in response to Message 65157.  

Hello, yes! The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details.

Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal?

And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): https://boinc.bakerlab.org/rosetta/result.php?resultid=314040179
While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time).
This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type *gbnnotyr*, where the combined "small" and "huge" models in the same type of tasks)


ID: 65159 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 65160 - Posted: 31 Jan 2010, 21:43:31 UTC - in response to Message 65159.  

Rosetta @ Home has produced many very high-quality designs for our Protein-interface design team! So we're likely to submit many more jobs to Rosetta @ Home. To help you recognize these jobs, we'll add a _Protein_Interface_Design_ note to every job name that is related to these jobs from now on. This way you'll be able to follow these jobs. I also hope that this will help you see where the variable-credit issue is coming from more easily.

Hello, yes! The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details.

Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal?

And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): https://boinc.bakerlab.org/rosetta/result.php?resultid=314040179
While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time).
This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type *gbnnotyr*, where the combined "small" and "huge" models in the same type of tasks)



ID: 65160 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fredmeyer2470

Send message
Joined: 6 Jun 09
Posts: 1
Credit: 1,741,466
RAC: 0
Message 65161 - Posted: 31 Jan 2010, 23:20:50 UTC

The Rosetta application is spinning its wheels. It is continually running a task even though the task is 100% complete. There is another task to run, but Rosetta won't switch to it.
ID: 65161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 206
Credit: 19,454,163
RAC: 1,177
Message 65162 - Posted: 1 Feb 2010, 3:14:03 UTC

2 Sarel
Thanks for the explanation.

And what about this?:
> Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a very lot of steps up to 200000 - 400000 for 1 model. Is this normal?

And at the same time, another note: it seems the job of this type: resa_sel_core_1.5_low200_beta_low200_nostart_texcst_05_hb_t328__IGNORE_THE_REST_17378_267_0 ignore the target CPU time. For example, this WU calculate 1 model somewhere for 2.5 hours (already longer than the target time ), but after the 1-st model, instead of sending the result starts calculating 2-nd model. Total 18850 seconds vs cpu_run_time_pref = 7200 seconds.
In this example, all ended well, but in other circumstances it can lead to excess cpu_run_time_pref more than 3 times and triggering watchdog and results loss. In addition, some members may think that the task stuck and abort it...
ID: 65162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65165 - Posted: 1 Feb 2010, 16:27:56 UTC

Max, the watchdog should be expected to kick in if a model exceeds target runtime by more then 4 hours. And so with a short 2 hour preference, you will tend to see much more variation in runtimes then if you had a longer preference.

However, I would expect that once the target runtime is exceeded, and a model is completed, that work on the WU should be ended and it would be reported back. So if you have a 2hr preference, and the first model took more then 2hrs to completed, I should have expected the WU to end once the first model was completed. In fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well.

Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful.
Rosetta Moderator: Mod.Sense
ID: 65165 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 65169 - Posted: 1 Feb 2010, 21:39:59 UTC

A couple of t287__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901 WUs on two different Linux machines failed after a few seconds claiming "process got signal 11".

https://boinc.bakerlab.org/rosetta/result.php?resultid=314826769
https://boinc.bakerlab.org/rosetta/result.php?resultid=314751622
ID: 65169 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 206
Credit: 19,454,163
RAC: 1,177
Message 65170 - Posted: 2 Feb 2010, 1:09:41 UTC

2 Mod.Sense
Thanks for the clarification on the watchdog. Previously I had seen how it hit after exceeding 6 hours of calculations and thought that he was fired after exceeding CPU TT x 3 (2h * 3 = 6h for my case). So in fact correct formula is CPU TT + 4h, right? (just in my case it gives the same 2h +4 h = 6h)
fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well.

Yes, usually does so. Here's an example of such a task: https://boinc.bakerlab.org/rosetta/result.php?resultid=313861637
Calculation of 1-st model took 5145 sec and the program has ended the processing, because second model would exceed the CPU TT (5145 * 2 = 10290> 7200).
Or another example: https://boinc.bakerlab.org/rosetta/result.php?resultid=314455813
Calculation of the two models has taken 4995 sec and the program has ended the processing, because third model would exceed the CPU TT ((4995 / 2) * 3 = 7492> 7200).
In these (and most others) the logic of the program is working correct.
But in the example above, this algorithm seems to give a failure.

Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful.

No, the last 2 weeks I have not changed runtime preference.
Yet I have no more recent examples, but before I had 2 other tasks that too, seems to ignore the runtime preference. (although I'm not 100% sure about it, because I have not followed their performance - perhaps just a 1st model was designed quickly, and the last took much longer than expected...)
Here they are:
cst2.loopbuild_threading_hb_i1496_IGNORE_THE_REST_17154_387_0
t364__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_4455_0
ID: 65170 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
KnopperHarley

Send message
Joined: 1 Nov 06
Posts: 2
Credit: 788,560
RAC: 0
Message 65175 - Posted: 2 Feb 2010, 11:18:57 UTC

Hey there!

I got a problem with two tasks at the moment.
Yesterday i wondered why remaining time is set to 30,5h per WU when i saw it, but i didn't care about it ... perhaps a test with more work per WU ... who knows. ;-)

But now one task is 'stuck' at 58.285% (+0.002% in now more than 12h) and the other one at 82.419% work done.
Runtime for these WUs are at around 28h und 11,75h counting on and on up high (elapsed and remaining -_- ).

So i asked the task-manager for help and is says the following:
these two WUs are using 218mb and 300mb memory ... not using ANY cpu-resources any more ... 0% both (cpu-time is still counting on 1sec/sec).

Did something went wrong on my pc while crunching? Or what's the matter of this?

Tasks
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=286264240
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=287080918


greetings

PS: both paused for now
ID: 65175 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65177 - Posted: 2 Feb 2010, 15:21:50 UTC
Last modified: 2 Feb 2010, 15:24:05 UTC

Max:
perhaps just a 1st model was designed quickly, and the last took much longer than expected


Right and that is exactly what Sarel's new tasks do. Run 5 models in 5 minutes, then hit one that looks interesting and run for (for example) 80 minutes. Now 6 models have been completed in 85 minutes and with a 2hr runtime preference, we guess we can complete more models in the 2 hours. If that next one happens to be interesting as well, you run long.

Some of the improvements Sarel is making and working on will help the longer models run faster. So this should avoid some of those that were taking several hours for a single model, and make completion times closer to your preference.

Yes, Max. The watchdog USED to be based on 4 times the runtime preference. This was fine for short runtime preferences, but those with preference set to over 12 hours wanted to kill the task sooner and get on with others. Now it is runtime pref. plus 4 hrs, with the thought that all properly running models will complete in less then 4 hours.

The rest of what you say sounds right to me. So keep watching, and jotting notes. This information will be key to resolving this issue.

KnopperHarley
This is one of the few remaining problems that some people are seeing in version 2.05. It seems to be rather rare, and perhaps only to occur on Windows. I see you are running Win XP (I highlight that just to make it easy for the Project Team to see it, not because it should be a problem). I believe suspending and resuming the tasks seems to get them going again.

Could I ask you how your machine is configured? Specifically, do you leave tasks in memory while preempted? Do you run other BOINC projects? Do you allow BOINC to run 100% of CPU? Do you power your machine off each day?
Rosetta Moderator: Mod.Sense
ID: 65177 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
KnopperHarley

Send message
Joined: 1 Nov 06
Posts: 2
Credit: 788,560
RAC: 0
Message 65178 - Posted: 2 Feb 2010, 15:47:13 UTC

Uhm, well ...

I tried around a bit (restarted BOINC) and (you might guess): it works. ^^'

Cpu-time jumped back to 3h and 6h or something and it's using the cores again.
Seems like something really screwed up the Rosetta-apps while working.

So nevermind ... ignore my posting above. ;-)

I lost a bit of time, but the WUs are obviously (hopefully?!) undamaged and one has been completed in the meantime, so happy crunching again. o/


greetings

PS: Would it make sense to send the WUs a second time to another participant to confirm the results ... just to be sure?!
Especially the second WU mentioned in my post above (probably more than 7,5h in the end) plus another WU with almost 6,75h
(t293__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_4919)
that has been finished last night are, let's say ... (maybe not impossible but) 'unusual' (to me :-) ).

PPS: for the protocol *g*
- Leave applications in memory while suspended? no
- Rosetta + SETI (50:50)
- Use at most 100 percent of CPU time
- it's almost every day off for a period of time (except weekend once in a while)

ID: 65178 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5652
Credit: 5,622,096
RAC: 0
Message 65182 - Posted: 2 Feb 2010, 23:12:43 UTC

compute error
t323__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_2006_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=314347348

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
]]>
ID: 65182 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5652
Credit: 5,622,096
RAC: 0
Message 65183 - Posted: 2 Feb 2010, 23:14:45 UTC

compute error with unhandeled exception dump
https://boinc.bakerlab.org/rosetta/result.php?resultid=310017128
homopt_fa_cstmc_1.t370_.t370_.IGNORE_THE_REST.S_00003_0000784_010.pdb.JOB_16898_23_0

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x7C90120E

ID: 65183 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
l_mckeon

Send message
Joined: 5 Jun 07
Posts: 44
Credit: 180,717
RAC: 0
Message 65184 - Posted: 3 Feb 2010, 0:40:23 UTC

I aborted lr15clus_opt_.1ptq.1ptq.IGNORE_THE_REST.c.3.21.pdb.pdb.JOB_17471_3_0 after 50 minutes.

Stuck on model 1, step 0, with funny looking graphics.

I no longer have the patience to see how these turn out.
ID: 65184 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 65187 - Posted: 3 Feb 2010, 11:54:35 UTC - in response to Message 65184.  

I aborted lr15clus_opt_.1ptq.1ptq.IGNORE_THE_REST.c.3.21.pdb.pdb.JOB_17471_3_0 after 50 minutes.

Stuck on model 1, step 0, with funny looking graphics.

I no longer have the patience to see how these turn out.

Instead of aborting just try closing and restarting Boinc. That often does the trick.
ID: 65187 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile John Hunt
Avatar

Send message
Joined: 18 Sep 05
Posts: 446
Credit: 200,755
RAC: 0
Message 65189 - Posted: 3 Feb 2010, 15:20:10 UTC
Last modified: 3 Feb 2010, 15:21:06 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=287053961
has been running now for 56 hrs and still only 57.019% complete.

Core2Quad Q6600 @ 2.4GHz & Windows XP Home.

Keep going or abort?
ID: 65189 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65190 - Posted: 3 Feb 2010, 16:50:20 UTC

Keep going or abort?


As Evan points out, often such conditions get reset if you suspend and resume the task, or end and restart BOINC...

But first, I'd like to ask you to go to the advanced view, tasks tab, select the task that's been running so long, and then click the properties button that appears over on the left. There are three time figures there that I would like you to report:

CPU time at last checkpoint:
CPU time:
and Elapsed time:

It will take you a minute or so to jot that down, then close the window, and click again on the properties button for the task and see if the CPU time has changed at all.
Rosetta Moderator: Mod.Sense
ID: 65190 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile John Hunt
Avatar

Send message
Joined: 18 Sep 05
Posts: 446
Credit: 200,755
RAC: 0
Message 65191 - Posted: 3 Feb 2010, 17:52:51 UTC
Last modified: 3 Feb 2010, 18:12:49 UTC

O.K. I've suspended the WU and then re-started.

Here are the figures requested (when suspended) -
CPU time at last checkpoint: 02:05:26
CPU time: 02:05:27
and Elapsed time: 58:38:24

After re-start -
CPU time at last checkpoint: 02:05:26
CPU time: 02:10:22
and Elapsed time: 58:43:35

WU completed shortly afterwards with a computation error.

Thank you!
ID: 65191 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin

Send message
Joined: 13 Apr 07
Posts: 42
Credit: 260,782
RAC: 0
Message 65195 - Posted: 4 Feb 2010, 1:09:32 UTC

Just took a look at my graphics and saw this, is it normal? Ive been watching it for awhile now and it seems to be stuck on the model 2 step 0. Any ideas on what i should do?

ID: 65195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

Message boards : Number crunching : minirosetta 2.05



©2023 University of Washington
https://www.bakerlab.org