Message boards : Number crunching : Report long-running models here
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 14 · Next
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Greg, if Dennis is seeing long-running models, they tend to not take checkpoints. So, any time BOINC is ended and restarted, the task might lose lots of work. And so the next time it starts, the cycle repeats. This is part of why the team is working to eliminate the long-running models. Anyway, I just wanted to point out that part of why he's getting those symptoms is because of the long-running models, not the other way around. Dennis, the runtime preference is not the time per model. The per model times should be as described in the begining of this thread. And then more models are done if the runtime preference allows time for it. So, your approach of aborting anything racking up more then 30 hours is good. And in fact if you see one with only 6 hours, but still on model one, I would abort that too. Or, one with two models that's up to 8 or 9 hours. etc. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Greg, if Dennis is seeing long-running models, they tend to not take checkpoints. So, any time BOINC is ended and restarted, the task might lose lots of work. And so the next time it starts, the cycle repeats. This is part of why the team is working to eliminate the long-running models. mod- thanks for the clarification. |
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
The t060 (beta 5.98) wus usually last 4.5 to nearly 6 hours per model on my PPC. This model (t060_1_NMRREF_1_t060_1_id_model_07IGNORE_THE_REST_idl_5381_1234_0) lasted almost 10 hours. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=216802506 Name cc2_1_8_mammoth_fa_cst_hb_t303__IGNORE_THE_REST_2AH5A_4_6138_17_0 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 14400 ====================================================== DONE :: 1 starting structures 15270.1 cpu seconds This process generated 2 decoys from 2 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish and before that: https://boinc.bakerlab.org/rosetta/result.php?resultid=216802506 Name cc2_1_8_mammoth_fa_cst_hb_t303__IGNORE_THE_REST_2AH5A_4_6138_17_0 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 14400 ====================================================== DONE :: 1 starting structures 15270.1 cpu seconds This process generated 2 decoys from 2 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> credit sucks on both of these both tasks granted these exact amounts: Claimed credit 106.166115188458 Granted credit 74.8691857584611 |
senatoralex85 Send message Joined: 27 Sep 05 Posts: 66 Credit: 169,644 RAC: 0 |
My Preferences are set to 4 hours per workunit. This workunit lasted 12 hours. Task ID 217245063 Name cc2_1_8_mammoth_mix_fa_cst_hb_t313__IGNORE_THE_REST_1BG2A_7_6180_31_0 Workunit 197983062 Created 27 Dec 2008 3:47:57 UTC Sent 27 Dec 2008 4:19:00 UTC Received 27 Dec 2008 19:09:47 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 792930 Report deadline 6 Jan 2009 4:19:00 UTC CPU time 43297.02 stderr out <core_client_version>5.10.45</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 21600 ====================================================== DONE :: 1 starting structures 43296 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 118.702972598747 Granted credit 59.6309170544315 application version 1.47 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
rifleman is had a zinc task run for 37 hrs before watchdog terminated it. He has a 12 hr run time set. see this thread[/url for more information. the one task can be found at [url]https://boinc.bakerlab.org/rosetta/result.php?resultid=216862173 |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=217325144 Nearly 16 hrs in when I spotted it and now it reports, after a manual abort, it has done 0 CPU time ?!?! |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=217459230 <core_client_version>6.2.15</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 14400 # cpu_run_time_pref: 7200 # cpu_run_time_pref: 7200 ********************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 26682.9 seconds. Greater than 3X preferred time: 7200 seconds ********************************************************************** called boinc_finish </stderr_txt> ]]> |
Aegis Maelstrom Send message Joined: 29 Oct 08 Posts: 61 Credit: 2,137,555 RAC: 0 |
Task cc_nonideal_0_6_nocst4_hb_t328__IGNORE_THE_REST_2GVKA_7_5916_29 Workunit 197391713. Terminated automatically as a long run, reported and counted as a "success". <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> ********************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 32978.8 seconds. Greater than 3X preferred time: 10800 seconds ********************************************************************** called boinc_finish </stderr_txt> ]]> Minirosetta 1.47, the latest BOINC client. Recently and temporarily added Vista laptop with Core 2 duo, underclocked/normally clocked (due to power management), certainly not overclocked. Stable system. Tasked run for over 9 hours, still on Model 1, over 2 million steps have been made. I've been watching the task - it looked like some kind of an infinite loop - mostly classic "big" moves but also the second step - smallMoverMoverBase... - as far as I can remember. What is more - as ModSense had written above no checkpoints were made so I had to wait with this obviously wrong simulation just to be sure it won't work. By the way - I've seen some other complaints on cc_nonideal tasks in the MiniRosetta 1.47 bug thread. Have a good luck on a bug hunt! :) |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
This WU: https://boinc.bakerlab.org/rosetta/result.php?resultid=217250916 took 20 hours on an Athlon XP 2400+ to crunch one decoy. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
I was looking at the running on one of my computers and noticed that this task seemed to have overshot the 6 hour time set. So, I stopped BOINC and restarted it and it restarted the task at 25% done and 1:36 of compute time vice the real total of over 6 hours. A long time ago I de-emphasised Rosetta because the project went from a reliable application to some of the worse behaving applications in the BOINC universe. I don't mind occasional errors, but I lost about 8 tasks on one machine because of a lock file error ... the machine is stable and works well on all other projects ... Now I am finding that there are tasks that never seem to want to finish (now I am paying attention) and worse do not properly checkpoint. And so, once more I am going to downgrade Rosetta ... I have neither the health nor time to babysit what is supposed to be a mature and PRODUCTION project. I suppose we need to change the way we classify projects because this application is not production by any means ... |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Well, the task died with too many exits ... I suppose one is way too many ... the file also suffers from the lock file problem. Well, the queues are draining ... |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=217585069 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_73016_0 Outcome Success CPU time 20686.61 - actual run time <core_client_version>6.4.5</core_client_version> # cpu_run_time_pref: 14400 <-- my set run time ====================================================== DONE :: 1 starting structures 20686.3 cpu seconds This process generated 3 decoys from 3 attempts |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,496 |
I'd like to start a thread for reports of long running models. These appear to be related more to the specific batches of work released, then to any given specific application version. So, I've moved the problems with v1.34 posts that seemed more about runtime into this thread. A workunit that has already taken 30 hours even though I asked for 14 hour workunits: 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_78916 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198338075 Model 2 now running. Can't find any information on whether model 1 was also slow. Vista SP1 32-bit BOINC 5.10.45 Still running. Currently using 94 MB memory, not counting any that's swapped out. Graphics windows opened at least once, will not open again now. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,496 |
This workunit (already finished) gave fewer decoys than expected for a 14-hour expected length workunit - only 2. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198155961 I've already given details about my machine when reporting a different problem of this type, with a workunit name similar enough to suggest that it's for the same protein. The information I can still see doesn't pin this down to any particular model within the workunit. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,496 |
A workunit that has already taken 30 hours even though I asked for 14 hour workunits: The graphics window finally opened again, although so slowly I was already doing something else then. Since then, the workunit finally finished, after about 31 CPU hours. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,496 |
This makes 3 workunits in a row where I got only 2 decoys in a workunit expected to last 14 CPU hours: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198390275 The names of all three of those workunits began with 1nkuA. Do you need to add these workunits to a list of workunit types expected to take significantly more than 2 hours per decoy? |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198560045 One whole decoy in 12.6 hrs - nice !! Ubuntu 8.10 on a P4 2.6GHz HT with 512 Mb memory with POEM@home running as well. CPU run time preference set to 8 hrs ("Home" in this case - have now changed to 6 hrs "Work" preference) |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
MiniRosetta 1.47 task 217935249 cc2_1_8_mammoth_mix_cen_cst_hb_t305__IGNORE_THE_REST_2AHSA_1_5874_98_0 BOINC client version 6.4.5 for windows_intelx86 Processor: 2 AuthenticAMD AMD Athlon(tm) 64 X2 Dual-Core Processor TK-55 [x86 Family 15 Model 104 Stepping 1] OS: Microsoft Windows Vista: Home Premium x86 Editon, Service Pack 1, (06.00.6001.00) Normal runtime: 3 hours Currently at: 7 hours 56m Currently on: Model 1 Step 1335490 |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
MiniRosetta 1.47 task 217935249 Typical of my luck, there was a power outage here, the WU restarted from about 1h 45m, but completed about 2h 45m. Claimed credit 28.0678988927865 Granted credit 63.9157762593521 ...so compensated on credit. Just strange it didn't finish earlier on its first run. |
Message boards :
Number crunching :
Report long-running models here
©2024 University of Washington
https://www.bakerlab.org