Message boards : Number crunching : minirosetta 2.05
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next
Author | Message |
---|---|
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0 |
Hello, yes! The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details. Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal? |
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0 |
Rosetta @ Home has produced many very high-quality designs for our Protein-interface design team! So we're likely to submit many more jobs to Rosetta @ Home. To help you recognize these jobs, we'll add a _Protein_Interface_Design_ note to every job name that is related to these jobs from now on. This way you'll be able to follow these jobs. I also hope that this will help you see where the variable-credit issue is coming from more easily. Hello, yes! The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details. |
fredmeyer2470 Send message Joined: 6 Jun 09 Posts: 1 Credit: 1,741,466 RAC: 0 |
The Rosetta application is spinning its wheels. It is continually running a task even though the task is 100% complete. There is another task to run, but Rosetta won't switch to it. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,574,068 RAC: 11,918 |
2 Sarel Thanks for the explanation. And what about this?: > Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a very lot of steps up to 200000 - 400000 for 1 model. Is this normal? And at the same time, another note: it seems the job of this type: resa_sel_core_1.5_low200_beta_low200_nostart_texcst_05_hb_t328__IGNORE_THE_REST_17378_267_0 ignore the target CPU time. For example, this WU calculate 1 model somewhere for 2.5 hours (already longer than the target time ), but after the 1-st model, instead of sending the result starts calculating 2-nd model. Total 18850 seconds vs cpu_run_time_pref = 7200 seconds. In this example, all ended well, but in other circumstances it can lead to excess cpu_run_time_pref more than 3 times and triggering watchdog and results loss. In addition, some members may think that the task stuck and abort it... |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Max, the watchdog should be expected to kick in if a model exceeds target runtime by more then 4 hours. And so with a short 2 hour preference, you will tend to see much more variation in runtimes then if you had a longer preference. However, I would expect that once the target runtime is exceeded, and a model is completed, that work on the WU should be ended and it would be reported back. So if you have a 2hr preference, and the first model took more then 2hrs to completed, I should have expected the WU to end once the first model was completed. In fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well. Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful. Rosetta Moderator: Mod.Sense |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
A couple of t287__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901 WUs on two different Linux machines failed after a few seconds claiming "process got signal 11". https://boinc.bakerlab.org/rosetta/result.php?resultid=314826769 https://boinc.bakerlab.org/rosetta/result.php?resultid=314751622 |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,574,068 RAC: 11,918 |
2 Mod.Sense Thanks for the clarification on the watchdog. Previously I had seen how it hit after exceeding 6 hours of calculations and thought that he was fired after exceeding CPU TT x 3 (2h * 3 = 6h for my case). So in fact correct formula is CPU TT + 4h, right? (just in my case it gives the same 2h +4 h = 6h) fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well. Yes, usually does so. Here's an example of such a task: https://boinc.bakerlab.org/rosetta/result.php?resultid=313861637 Calculation of 1-st model took 5145 sec and the program has ended the processing, because second model would exceed the CPU TT (5145 * 2 = 10290> 7200). Or another example: https://boinc.bakerlab.org/rosetta/result.php?resultid=314455813 Calculation of the two models has taken 4995 sec and the program has ended the processing, because third model would exceed the CPU TT ((4995 / 2) * 3 = 7492> 7200). In these (and most others) the logic of the program is working correct. But in the example above, this algorithm seems to give a failure. Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful. No, the last 2 weeks I have not changed runtime preference. Yet I have no more recent examples, but before I had 2 other tasks that too, seems to ignore the runtime preference. (although I'm not 100% sure about it, because I have not followed their performance - perhaps just a 1st model was designed quickly, and the last took much longer than expected...) Here they are: cst2.loopbuild_threading_hb_i1496_IGNORE_THE_REST_17154_387_0 t364__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_4455_0 |
KnopperHarley Send message Joined: 1 Nov 06 Posts: 2 Credit: 788,560 RAC: 0 |
Hey there! I got a problem with two tasks at the moment. Yesterday i wondered why remaining time is set to 30,5h per WU when i saw it, but i didn't care about it ... perhaps a test with more work per WU ... who knows. ;-) But now one task is 'stuck' at 58.285% (+0.002% in now more than 12h) and the other one at 82.419% work done. Runtime for these WUs are at around 28h und 11,75h counting on and on up high (elapsed and remaining -_- ). So i asked the task-manager for help and is says the following: these two WUs are using 218mb and 300mb memory ... not using ANY cpu-resources any more ... 0% both (cpu-time is still counting on 1sec/sec). Did something went wrong on my pc while crunching? Or what's the matter of this? Tasks https://boinc.bakerlab.org/rosetta/workunit.php?wuid=286264240 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=287080918 greetings PS: both paused for now |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Max: perhaps just a 1st model was designed quickly, and the last took much longer than expected Right and that is exactly what Sarel's new tasks do. Run 5 models in 5 minutes, then hit one that looks interesting and run for (for example) 80 minutes. Now 6 models have been completed in 85 minutes and with a 2hr runtime preference, we guess we can complete more models in the 2 hours. If that next one happens to be interesting as well, you run long. Some of the improvements Sarel is making and working on will help the longer models run faster. So this should avoid some of those that were taking several hours for a single model, and make completion times closer to your preference. Yes, Max. The watchdog USED to be based on 4 times the runtime preference. This was fine for short runtime preferences, but those with preference set to over 12 hours wanted to kill the task sooner and get on with others. Now it is runtime pref. plus 4 hrs, with the thought that all properly running models will complete in less then 4 hours. The rest of what you say sounds right to me. So keep watching, and jotting notes. This information will be key to resolving this issue. KnopperHarley This is one of the few remaining problems that some people are seeing in version 2.05. It seems to be rather rare, and perhaps only to occur on Windows. I see you are running Win XP (I highlight that just to make it easy for the Project Team to see it, not because it should be a problem). I believe suspending and resuming the tasks seems to get them going again. Could I ask you how your machine is configured? Specifically, do you leave tasks in memory while preempted? Do you run other BOINC projects? Do you allow BOINC to run 100% of CPU? Do you power your machine off each day? Rosetta Moderator: Mod.Sense |
KnopperHarley Send message Joined: 1 Nov 06 Posts: 2 Credit: 788,560 RAC: 0 |
Uhm, well ... I tried around a bit (restarted BOINC) and (you might guess): it works. ^^' Cpu-time jumped back to 3h and 6h or something and it's using the cores again. Seems like something really screwed up the Rosetta-apps while working. So nevermind ... ignore my posting above. ;-) I lost a bit of time, but the WUs are obviously (hopefully?!) undamaged and one has been completed in the meantime, so happy crunching again. o/ greetings PS: Would it make sense to send the WUs a second time to another participant to confirm the results ... just to be sure?! Especially the second WU mentioned in my post above (probably more than 7,5h in the end) plus another WU with almost 6,75h (t293__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_4919) that has been finished last night are, let's say ... (maybe not impossible but) 'unusual' (to me :-) ). PPS: for the protocol *g* - Leave applications in memory while suspended? no - Rosetta + SETI (50:50) - Use at most 100 percent of CPU time - it's almost every day off for a period of time (except weekend once in a while) |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
compute error t323__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_2006_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=314347348 <core_client_version>6.10.18</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> ]]> |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
compute error with unhandeled exception dump https://boinc.bakerlab.org/rosetta/result.php?resultid=310017128 homopt_fa_cstmc_1.t370_.t370_.IGNORE_THE_REST.S_00003_0000784_010.pdb.JOB_16898_23_0 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x7C90120E |
l_mckeon Send message Joined: 5 Jun 07 Posts: 44 Credit: 180,717 RAC: 0 |
I aborted lr15clus_opt_.1ptq.1ptq.IGNORE_THE_REST.c.3.21.pdb.pdb.JOB_17471_3_0 after 50 minutes. Stuck on model 1, step 0, with funny looking graphics. I no longer have the patience to see how these turn out. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
I aborted lr15clus_opt_.1ptq.1ptq.IGNORE_THE_REST.c.3.21.pdb.pdb.JOB_17471_3_0 after 50 minutes. Instead of aborting just try closing and restarting Boinc. That often does the trick. |
John Hunt Send message Joined: 18 Sep 05 Posts: 446 Credit: 200,755 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=287053961 has been running now for 56 hrs and still only 57.019% complete. Core2Quad Q6600 @ 2.4GHz & Windows XP Home. Keep going or abort? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Keep going or abort? As Evan points out, often such conditions get reset if you suspend and resume the task, or end and restart BOINC... But first, I'd like to ask you to go to the advanced view, tasks tab, select the task that's been running so long, and then click the properties button that appears over on the left. There are three time figures there that I would like you to report: CPU time at last checkpoint: CPU time: and Elapsed time: It will take you a minute or so to jot that down, then close the window, and click again on the properties button for the task and see if the CPU time has changed at all. Rosetta Moderator: Mod.Sense |
John Hunt Send message Joined: 18 Sep 05 Posts: 446 Credit: 200,755 RAC: 0 |
O.K. I've suspended the WU and then re-started. Here are the figures requested (when suspended) - CPU time at last checkpoint: 02:05:26 CPU time: 02:05:27 and Elapsed time: 58:38:24 After re-start - CPU time at last checkpoint: 02:05:26 CPU time: 02:10:22 and Elapsed time: 58:43:35 WU completed shortly afterwards with a computation error. Thank you! |
Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0 |
|
Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0 |
Strange seems to be fine now, you can disregard earlier post. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,574,068 RAC: 11,918 |
I do not think that should be ignored. This type of tasks on my computer, too, is behaving very strangely. Here's an example where the protein is coiled into a ring(Click to enlarge): In this state model is already about 30 minutes. Sometime ring begins to deploy, but then rolled back into the ring. |
Message boards :
Number crunching :
minirosetta 2.05
©2025 University of Washington
https://www.bakerlab.org