Report long-running models here

Author	Message
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 58532 - Posted: 5 Jan 2009, 18:17:43 UTC AdeB, since that single model has been running longer then 6 hours. I would suggest you abort it... unless you have no other tasks to work on while Idan, what is the normal runtime preference for the host that is running that task? (note to self, why hasn't watchdog ended it? If pref. is <24hrs) Rosetta Moderator: Mod.Sense ID: 58532 · Rating: 0 · rate: / Reply Quote

AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0	Message 58543 - Posted: 5 Jan 2009, 21:06:44 UTC - in response to Message 58532. AdeB, since that single model has been running longer then 6 hours. I would suggest you abort it... After another crash the task has been aborted. ID: 58543 · Rating: 0 · rate: / Reply Quote

Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0	Message 58557 - Posted: 6 Jan 2009, 8:22:08 UTC The WU 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 finally finished after more than 66 hours only to get 80 credits... quite disappointing... :( ID: 58557 · Rating: 0 · rate: / Reply Quote

Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0	Message 58558 - Posted: 6 Jan 2009, 8:23:54 UTC The WU 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 finally finished after more than 66 hours only to get 80 credits... quite disappointing... :( ID: 58558 · Rating: 0 · rate: / Reply Quote

ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0	Message 58608 - Posted: 7 Jan 2009, 11:25:16 UTC - in response to Message 58558. The WU 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 finally finished after more than 66 hours only to get 80 credits... quite disappointing... :( Well, this is the old problem of the dog that did nothing in the nighttime, as documented in the watchdog message: <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> ******************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 238970 seconds. Greater than 3X preferred time: 10800 seconds ******************************************************************** called boinc_finish </stderr_txt> ]]> 3X 10800 seconds = 32400 seconds, why did the watchdog need 238970 seconds to interfere? ID: 58608 · Rating: 0 · rate: / Reply Quote

Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0	Message 58612 - Posted: 7 Jan 2009, 13:01:22 UTC - in response to Message 58532. Last modified: 7 Jan 2009, 13:06:50 UTC Idan, what is the normal runtime preference for the host that is running that task? (note to self, why hasn't watchdog ended it? If pref. is <24hrs) The computer's working on BOINC 24/7... better watch that watchdog... :) I have another WU with 40+ hours running right now, I'll let it finish to see if it gets the same watchdog message... Woops, looks like it just finished, as you can see: HERE. <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> ******************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 183132 seconds. Greater than 3X preferred time: 10800 seconds ******************************************************************** called boinc_finish </stderr_txt> ]]> Same watchdog being late again, now only after 50 hours... gave me 80 credits... which is 1.6 credits per hour... I think you should change your credit system a bit... :P ID: 58612 · Rating: 0 · rate: / Reply Quote

pramo Send message Joined: 21 Oct 05 Posts: 4 Credit: 362,513 RAC: 0	Message 58617 - Posted: 7 Jan 2009, 14:50:40 UTC https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198269130 70+ hours, 9+ hours to go, deadline in 7 hours I killed it ID: 58617 · Rating: 0 · rate: / Reply Quote

pramo Send message Joined: 21 Oct 05 Posts: 4 Credit: 362,513 RAC: 0	Message 58618 - Posted: 7 Jan 2009, 14:56:50 UTC - in response to Message 58617. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198269130 70+ hours, 9+ hours to go, deadline in 7 hours I killed it don't know if this helps, or what you might want to look at on my end. as you can see, a long time member and first time poster:) stderr.txt was a bunch of this- # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 stdout.txt was this- [2008-12-29 6:19:24:] :: BOINC :: boinc_init() Created shared memory segment Created semaphore Starting watchdog... . . . <snip> . . . [2009- 1- 7 5:23:12:] :: BOINC :: boinc_init() Created shared memory segment Created semaphore Starting watchdog... [2009- 1- 7 8:40: 5:] :: BOINC :: boinc_init() Created shared memory segment Created semaphore Starting watchdog... ID: 58618 · Rating: 0 · rate: / Reply Quote

AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0	Message 58661 - Posted: 7 Jan 2009, 23:11:12 UTC I aborted lr5_score12_rlbd_2fls_IGNORE_THE_REST_DECOY_5559_1293_1 after running for more than 30 hours. AdeB ID: 58661 · Rating: 0 · rate: / Reply Quote

Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0	Message 58665 - Posted: 8 Jan 2009, 10:05:42 UTC I would really like to know if it would be worth it to let the WU finish by itself or should we just abort it after 25-30 hours? Do you look at the problems reported by the long-running models? is it contributing to let them run or it doesn't? in case it does contribute, I'll happily running to the bitter end, but if not, I'd like to do some realy science helping work... :) ID: 58665 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58669 - Posted: 8 Jan 2009, 11:39:36 UTC - in response to Message 58665. Last modified: 8 Jan 2009, 11:39:54 UTC I would really like to know if it would be worth it to let the WU finish by itself or should we just abort it after 25-30 hours? Do you look at the problems reported by the long-running models? is it contributing to let them run or it doesn't? in case it does contribute, I'll happily running to the bitter end, but if not, I'd like to do some realy science helping work... :) not sure what you run time is, but if it is way over the maximum time, you might as well abort it as you won't get all that much credit for it anyway. you will regain your loss with the better running tasks. be sure to post a link to the specific task, so the team knows which task it was that ran so long. ID: 58669 · Rating: 0 · rate: / Reply Quote

Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0	Message 58672 - Posted: 8 Jan 2009, 11:57:24 UTC Thanks for the reply, I was talking about the scientific value of letting the WU run for a long time till it abort itself, not about the points. I would like to know if it is important for you guys to see that the watch dog stopped the application later than 3 times the time it was suppose to run or it doesn't matter for you what so ever? ID: 58672 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58674 - Posted: 8 Jan 2009, 12:03:39 UTC - in response to Message 58672. Thanks for the reply, I was talking about the scientific value of letting the WU run for a long time till it abort itself, not about the points. I would like to know if it is important for you guys to see that the watch dog stopped the application later than 3 times the time it was suppose to run or it doesn't matter for you what so ever? If the task is done, but ran over your time settings by more than 3x, be sure to post the link here so that the team can have a look at the task. Same thing if you aborted the task. They will take any results that have been completed before the abort or the watchdog termination, if the task uploads. ID: 58674 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 58678 - Posted: 8 Jan 2009, 14:04:59 UTC - in response to Message 58672. Thanks for the reply, I was talking about the scientific value of letting the WU run for a long time till it abort itself, not about the points. I would like to know if it is important for you guys to see that the watch dog stopped the application later than 3 times the time it was suppose to run or it doesn't matter for you what so ever? The main benefit to the science is going to be in understanding how to eliminate the long-running models. Not specifically in the result that yours produces. It is certainly useful for the Project Team to see the completed result (and/or a report in this thread). Otherwise it just looks like many other "user aborted" tasks, and does not lead one to enter "track down problem" mode. Or to suspect there are specific reasons people aborted the task. At this point, due to the reports in this thread, some of the causes for long-running models have been identified and coding changes are in the works to resolve them. So at this point, I'd suggest following the guidelines as originally posted in this thread, abort the task, report it's behavior, and continue on. Rosetta Moderator: Mod.Sense ID: 58678 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58687 - Posted: 8 Jan 2009, 16:51:05 UTC from my run time is set for 4 hrs but tasks are running 6 hrs thread: My settings are for 4 hrs run times, but these specific tasks ran 6 hrs without me touching a thing. abinitio_norelax_homfrag_129_B_4icbA_SAVE_ALL_OUT_4626_4561_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=218675744 abinitio_norelax_homfrag_129_B_5croA_SAVE_ALL_OUT_4626_4561_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=218675745 abinitio_norelax_homfrag_129_B_1a8oA_SAVE_ALL_OUT_4626_4562_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=218675747 credit was granted, 2 out of 3 cases were higher than claimed and the other was lower than claimed. all completed ok. ID: 58687 · Rating: 0 · rate: / Reply Quote

rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0	Message 58688 - Posted: 8 Jan 2009, 19:53:35 UTC https://boinc.bakerlab.org/rosetta/results.php?hostid=267483 ID: 58688 · Rating: 0 · rate: / Reply Quote

rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0	Message 58734 - Posted: 11 Jan 2009, 19:29:42 UTC - in response to Message 58688. Last modified: 11 Jan 2009, 19:31:03 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=219050690 https://boinc.bakerlab.org/rosetta/results.php?hostid=267483 ID: 58734 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1264 Credit: 14,421,737 RAC: 0	Message 58745 - Posted: 12 Jan 2009, 1:18:18 UTC - in response to Message 58734. Last modified: 12 Jan 2009, 1:21:59 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=219050690 https://boinc.bakerlab.org/rosetta/results.php?hostid=267483 Looks like you've got a combination of problems. 1. Using a machine where the time between interrupts for other work is less than the time between checkpoints, without also using the leave in memory option (which may also need an increase in the upper limit of disk space allowed and the percentage of the swap space allowed to be effective without causing problems). 2. A recently completed workunit (although with an error) which took such a short time that the software overestimates how many workunits your machine can complete within the queue length you have selected. 3. A queue length so long relative to the deadline length that your machine has problems catching up when you get too many jobs at once. 4. A group of workunits from a recent batch with poor estimates of how long they will run. I'm not sure what to do about it except to abort some of the workunits in the queue that your machine hasn't started working on yet, before they reach the deadline. ID: 58745 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 58749 - Posted: 12 Jan 2009, 3:46:07 UTC You can also reduce your runtime preference and then update to the project. The new preference is applied to your existing tasks, even if they've already started. But this makes the rest of future scheduling a mess. So, aborting a few may be easier. But it looks like you've got plenty of time to complete them. There is only three days of work there if your machine is on crunching Rosetta all the time. I think that particular task must have had a long running model without checkpoint. Or perhaps the machine was restarted several times in a row to install service packs etc. while this task was in progress. Rosetta Moderator: Mod.Sense ID: 58749 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 58755 - Posted: 12 Jan 2009, 13:34:53 UTC - in response to Message 58749. You can also reduce your runtime preference and then update to the project. The new preference is applied to your existing tasks, even if they've already started. Is it still true that running WUs use the new runtime? I recently changed my runtime, and it seemed to me that the rosetta 5.98 WUs that were running did indeed use the new runtime, but the running minirosetta WUs did not. They seemed to use the preference that was set when they started. I didn't try stopping and restarting BOINC. That might have forced minirosetta to read the new runtime when it restarted. ID: 58755 · Rating: 0 · rate: / Reply Quote