Message boards : Number crunching : Report long-running models here
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 14 · Next
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
AdeB, since that single model has been running longer then 6 hours. I would suggest you abort it... unless you have no other tasks to work on while Idan, what is the normal runtime preference for the host that is running that task? (note to self, why hasn't watchdog ended it? If pref. is <24hrs) Rosetta Moderator: Mod.Sense |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
AdeB, since that single model has been running longer then 6 hours. I would suggest you abort it... After another crash the task has been aborted. |
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
The WU 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 finally finished after more than 66 hours only to get 80 credits... quite disappointing... :( |
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
The WU 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 finally finished after more than 66 hours only to get 80 credits... quite disappointing... :( |
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
The WU 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 finally finished after more than 66 hours only to get 80 credits... quite disappointing... :( Well, this is the old problem of the dog that did nothing in the nighttime, as documented in the watchdog message: <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> ********************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 238970 seconds. Greater than 3X preferred time: 10800 seconds ********************************************************************** called boinc_finish </stderr_txt> ]]> 3X 10800 seconds = 32400 seconds, why did the watchdog need 238970 seconds to interfere? |
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
Idan, what is the normal runtime preference for the host that is running that task? (note to self, why hasn't watchdog ended it? If pref. is <24hrs) The computer's working on BOINC 24/7... better watch that watchdog... :) I have another WU with 40+ hours running right now, I'll let it finish to see if it gets the same watchdog message... Woops, looks like it just finished, as you can see: HERE. <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> ********************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 183132 seconds. Greater than 3X preferred time: 10800 seconds ********************************************************************** called boinc_finish </stderr_txt> ]]> Same watchdog being late again, now only after 50 hours... gave me 80 credits... which is 1.6 credits per hour... I think you should change your credit system a bit... :P |
pramo Send message Joined: 21 Oct 05 Posts: 4 Credit: 339,337 RAC: 342 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198269130 70+ hours, 9+ hours to go, deadline in 7 hours I killed it |
pramo Send message Joined: 21 Oct 05 Posts: 4 Credit: 339,337 RAC: 342 |
don't know if this helps, or what you might want to look at on my end. as you can see, a long time member and first time poster:) stderr.txt was a bunch of this- # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 stdout.txt was this- [2008-12-29 6:19:24:] :: BOINC :: boinc_init() Created shared memory segment Created semaphore Starting watchdog... . . . <snip> . . . [2009- 1- 7 5:23:12:] :: BOINC :: boinc_init() Created shared memory segment Created semaphore Starting watchdog... [2009- 1- 7 8:40: 5:] :: BOINC :: boinc_init() Created shared memory segment Created semaphore Starting watchdog... |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
I aborted lr5_score12_rlbd_2fls_IGNORE_THE_REST_DECOY_5559_1293_1 after running for more than 30 hours. AdeB |
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
I would really like to know if it would be worth it to let the WU finish by itself or should we just abort it after 25-30 hours? Do you look at the problems reported by the long-running models? is it contributing to let them run or it doesn't? in case it does contribute, I'll happily running to the bitter end, but if not, I'd like to do some realy science helping work... :) |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I would really like to know if it would be worth it to let the WU finish by itself or should we just abort it after 25-30 hours? not sure what you run time is, but if it is way over the maximum time, you might as well abort it as you won't get all that much credit for it anyway. you will regain your loss with the better running tasks. be sure to post a link to the specific task, so the team knows which task it was that ran so long. |
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
Thanks for the reply, I was talking about the scientific value of letting the WU run for a long time till it abort itself, not about the points. I would like to know if it is important for you guys to see that the watch dog stopped the application later than 3 times the time it was suppose to run or it doesn't matter for you what so ever? |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Thanks for the reply, I was talking about the scientific value of letting the WU run for a long time till it abort itself, not about the points. If the task is done, but ran over your time settings by more than 3x, be sure to post the link here so that the team can have a look at the task. Same thing if you aborted the task. They will take any results that have been completed before the abort or the watchdog termination, if the task uploads. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Thanks for the reply, I was talking about the scientific value of letting the WU run for a long time till it abort itself, not about the points. The main benefit to the science is going to be in understanding how to eliminate the long-running models. Not specifically in the result that yours produces. It is certainly useful for the Project Team to see the completed result (and/or a report in this thread). Otherwise it just looks like many other "user aborted" tasks, and does not lead one to enter "track down problem" mode. Or to suspect there are specific reasons people aborted the task. At this point, due to the reports in this thread, some of the causes for long-running models have been identified and coding changes are in the works to resolve them. So at this point, I'd suggest following the guidelines as originally posted in this thread, abort the task, report it's behavior, and continue on. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
from my run time is set for 4 hrs but tasks are running 6 hrs thread: My settings are for 4 hrs run times, but these specific tasks ran 6 hrs without me touching a thing. abinitio_norelax_homfrag_129_B_4icbA_SAVE_ALL_OUT_4626_4561_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=218675744 abinitio_norelax_homfrag_129_B_5croA_SAVE_ALL_OUT_4626_4561_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=218675745 abinitio_norelax_homfrag_129_B_1a8oA_SAVE_ALL_OUT_4626_4562_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=218675747 credit was granted, 2 out of 3 cases were higher than claimed and the other was lower than claimed. all completed ok. |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
|
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
|
robertmiles Send message Joined: 16 Jun 08 Posts: 1235 Credit: 14,341,506 RAC: 433 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=219050690 Looks like you've got a combination of problems. 1. Using a machine where the time between interrupts for other work is less than the time between checkpoints, without also using the leave in memory option (which may also need an increase in the upper limit of disk space allowed and the percentage of the swap space allowed to be effective without causing problems). 2. A recently completed workunit (although with an error) which took such a short time that the software overestimates how many workunits your machine can complete within the queue length you have selected. 3. A queue length so long relative to the deadline length that your machine has problems catching up when you get too many jobs at once. 4. A group of workunits from a recent batch with poor estimates of how long they will run. I'm not sure what to do about it except to abort some of the workunits in the queue that your machine hasn't started working on yet, before they reach the deadline. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
You can also reduce your runtime preference and then update to the project. The new preference is applied to your existing tasks, even if they've already started. But this makes the rest of future scheduling a mess. So, aborting a few may be easier. But it looks like you've got plenty of time to complete them. There is only three days of work there if your machine is on crunching Rosetta all the time. I think that particular task must have had a long running model without checkpoint. Or perhaps the machine was restarted several times in a row to install service packs etc. while this task was in progress. Rosetta Moderator: Mod.Sense |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
You can also reduce your runtime preference and then update to the project. The new preference is applied to your existing tasks, even if they've already started. Is it still true that running WUs use the new runtime? I recently changed my runtime, and it seemed to me that the rosetta 5.98 WUs that were running did indeed use the new runtime, but the running minirosetta WUs did not. They seemed to use the preference that was set when they started. I didn't try stopping and restarting BOINC. That might have forced minirosetta to read the new runtime when it restarted. |
Message boards :
Number crunching :
Report long-running models here
©2025 University of Washington
https://www.bakerlab.org