Discussion on increasing the default run time

Message boards : Number crunching : Discussion on increasing the default run time

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

AuthorMessage
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,272,718
RAC: 1,488
Message 57052 - Posted: 19 Nov 2008, 2:02:47 UTC
Last modified: 19 Nov 2008, 2:12:36 UTC

If you are going to pack more than one model into each workunit, could you also score the results for each of those models separately, then add these scores together, so you don't automatically discard all the results for a workunit if one of its later models fails? In that case, after some types of errors, you might then have the workunit go on to the next model not yet tried.

That might, however, call for at least one new outcome state, such as one indicating partially successful.

I don't mind longer workunits, as long as I can still choose the length of the workunits so that I can complete at least one workunit each day. Currently, I've already set my preferred workunit time to 8 hours, since that seems to balance my desires and yours better than my previous setting.
ID: 57052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57055 - Posted: 19 Nov 2008, 4:36:22 UTC - in response to Message 57052.  
Last modified: 19 Nov 2008, 4:38:40 UTC

If you are going to pack more than one model into each workunit, could you also score the results for each of those models separately, then add these scores together...


Yes, that's how it works now.

...so you don't automatically discard all the results for a workunit if one of its later models fails?


Depending upon how and why it fails, prior models completed are generally preserved, and credited appropriately.

In that case, after some types of errors, you might then have the workunit go on to the next model not yet tried.


If the task failed, then for some reason it is not running well on your machine. It is more conservative to replace it with another task that may run better for your environment. In other words if model 1 or 2 failed from this task, let's not push our luck with more. Better to get word back to the project server about the failure sooner. Perhaps there is a trend that will indicate similar future work should be held until a specific issue is resolved.
Rosetta Moderator: Mod.Sense
ID: 57055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 57066 - Posted: 19 Nov 2008, 14:04:31 UTC

If the first model fails, then the WU should error out. But if most models are successful, why should a single bad model cause the whole WU to be invalid? As long as most models are successful, I see no reason the WU can't continue crunching for the normal length of time and then be marked valid.
ID: 57066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57067 - Posted: 19 Nov 2008, 14:15:06 UTC

At present, without any of the proposed change, most tasks that complete some models and are then ended are marked "Success". But if the task was ended by the watchdog due to too many restarts, then it possibly indicates that the models run for a long time between checkpoints, and the person using the machine does not retain tasks in memory while suspended and/or is turning the machine on and off frequently. So, there certainly are cases where the next task may run better then the current one. And, as I said previously, the sooner the error report gets back to the project, the more chance to assess that information prior to sending out more similar work. It could be that everyone is failing on those specific tasks. So, now continuing it doesn't make any sense.

Either way, the goal is certainly to have tasks run normally. And to have them complete within the target runtime. So, establishing the bahavior and default values with that scenario in mind makes sense.
Rosetta Moderator: Mod.Sense
ID: 57067 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 57070 - Posted: 19 Nov 2008, 16:36:31 UTC - in response to Message 57067.  

At present, without any of the proposed change, most tasks that complete some models and are then ended are marked "Success".

Let's consider a WU that I recently had which was marked as invalid due to NANs in hbonding:
h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-11-S3-4--h001b-_4769_1224

Note that my WU ended after nearly 13 hours of my 16 hour preference and was marked invalid:
https://boinc.bakerlab.org/rosetta/result.php?resultid=207873078

Then someone else crunched it successfully using the default 3 hour crunch time:
https://boinc.bakerlab.org/rosetta/result.php?resultid=208374490

He successfully did 4 models in that 3 hour crunch time. I would guess that I had done around 14 models successfully and then model 15 got the NANs error. Thus, after 14 good models, a single bad model caused the whole WU to be marked invalid. The other person had a valid WU because he stopped crunching after only 4 models.

So why couldn't minirosetta just discard the bad model #15 and continue crunching models until the 16 hours were up, then report the WU as valid?
ID: 57070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57071 - Posted: 19 Nov 2008, 17:33:55 UTC - in response to Message 57070.  

So why couldn't minirosetta just discard the bad model #15 and continue crunching models until the 16 hours were up, then report the WU as valid?


I agree with you. It could and should preserve any valid work completed and give credit for it. But the fact that this does not always happen is really unrelated to the topic of this thread.
Rosetta Moderator: Mod.Sense
ID: 57071 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 57089 - Posted: 19 Nov 2008, 22:22:41 UTC - in response to Message 57071.  

But the fact that this does not always happen is really unrelated to the topic of this thread.

This thread seems to be about the effects of increasing the default run time.

Currently, we have some WUs that sometimes bomb out with NAN errors in random models. With a 3 hour run time, only a few models will be crunched, so the chance of one of them bombing out with this error is small. If a model does bomb out and cause the WU to be invalid, the number of good models that are lost will be small as well. For instance, if 4 models are normally crunched in the 3 hours, then between 0 and 3 good models will have been crunched before the bad model (average 1.5 good models lost).

With a 6 hour run time, there will be twice as many models crunched, so the chance of hitting a bad model is doubled. There will also likely be many more good models lost. For instance, if 8 models are crunched in the 6 hours, then between 0 and 7 good models will be lost (average 3.5 good models lost).

So increasing the default run time will increase the number of good models that are lost.

ID: 57089 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,250,162
RAC: 0
Message 57098 - Posted: 20 Nov 2008, 18:17:53 UTC

Limited to just changing this runtime parameter...

There are a lot of assumptions in choosing these numbers.... Such as, "What percentage of clients request work that have a runtime preference less than 3 hours?" If that number is small, changing the lower bound won't help you. If a large majority of the users have the default setting of 3 hours, then changing the default to 4 hours should reduce the work request rate to the servers by 1/3. Would that be enough?

I believe project should increase it by 1 hour every 3 weeks until desired setting is reached. I believe that the minimum should be 2 hours, but the default should be 4 hours. Try it out for at least 3 weeks before drawing any conclusions.

Another idea:
Just have a default "number of models to generate in each task" setting in preferences. Task would stop after successfully generating that number of models. The watchdog would be configured appropriately for each task's expected time; it could be generic = 24 hours. Default # models to generate = 3.

Assumptions: Scientists approximately know how long a model takes to compute for any given WU. Can calculate watchdog hard limits based on this knowledge.
ID: 57098 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
funkydude

Send message
Joined: 15 Jun 08
Posts: 28
Credit: 397,934
RAC: 0
Message 57101 - Posted: 20 Nov 2008, 20:02:56 UTC - in response to Message 57098.  
Last modified: 20 Nov 2008, 20:08:03 UTC

The post on the front page specifies that the default will be 6 hours, but the minimum 3 hours. So I'd like to know where exactly do we set this? I can't find it anywhere on the website. Currently, my times have not changed at all, they are still 2-3 hours.

EDIT: A picture is worth a thousand words: http://img156.imageshack.us/img156/2679/boincrn3.jpg
ID: 57101 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57102 - Posted: 20 Nov 2008, 20:54:05 UTC

yo, whoa, funky: when it says "...planning to..." it means it hasn't happend yet. It also says "When and how this will occur has not yet been decided."

You can change your runtime preference by clicking the "[Participants]" link at the top of this message board webpage. Then click the link for "Rosetta@home preferences". If you have not done that, your runtimes default to 3 hours. And if the project suddenly changes that to 6hrs, this might cause you some issues with how you like to use your machine. Hence, discussing this beforehand in this thread.
Rosetta Moderator: Mod.Sense
ID: 57102 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,250,162
RAC: 0
Message 57105 - Posted: 20 Nov 2008, 21:44:42 UTC - in response to Message 57102.  

Mod.Sense,

Do you have any thoughts about my ideas?
(Go up three posts.)
ID: 57105 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57111 - Posted: 21 Nov 2008, 0:06:38 UTC

I prefer to remain in the facilitator role here and not attempt to influence the discussion.

I would point out that establishing a fixed number of models, say for example I set it to 20, doesn't really help you much. Because some proteins can do 20 models in an hour, and others would take days. Although if the model runtime were rather predictable, perhaps the WU flops count could be tailored for each different protein of a given batch of work. So it is still possible that some form of this approach could be implemented, and result in more predictable runtimes.

I'm not sure how many changes DK is planning to incorporate into this runtime change project. To modify flops counts, watchdog timeouts and etc. on the fly as work is sent out would probably require some serious changes to the scheduler.

So, anyway, I think I see where you are coming from with that, but I'm not sure how quickly such concepts could be incorporated. It seems likely they are beyond the scope of the immediate goal.
Rosetta Moderator: Mod.Sense
ID: 57111 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,250,162
RAC: 0
Message 57119 - Posted: 21 Nov 2008, 2:34:14 UTC - in response to Message 57111.  

Mod.Sense,

Thank you for the feedback.

In response to the original question (by Dave E K), the minimum runtime should be increased by 1 hour every 2-3 weeks (at least 10 days + time to task deadline alloted). This is enough time to allow even the biggest crunchers to empty their BOINC cache using the new runtime preference.

I also think that an announcement in the site news and a link to the "Runtime preference FAQ" would be very helpful. The FAQ would just explain what the runtime preference is and where to change it on the website. The news and link would encourage and help users make a wise decision for this setting (and keep the redundant questions to a minimum).
ID: 57119 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2122
Credit: 41,183,435
RAC: 10,025
Message 57194 - Posted: 24 Nov 2008, 2:03:21 UTC - in response to Message 57098.  

There are a lot of assumptions in choosing these numbers.... Such as, "What percentage of clients request work that have a runtime preference less than 3 hours?" If that number is small, changing the lower bound won't help you. If a large majority of the users have the default setting of 3 hours, then changing the default to 4 hours should reduce the work request rate to the servers by 1/3. Would that be enough?

I believe project should increase it by 1 hour every 3 weeks until desired setting is reached. I believe that the minimum should be 2 hours, but the default should be 4 hours. Try it out for at least 3 weeks before drawing any conclusions.

I support this change. Particularly the minimum run-time change to 2 hours rather than 3. The problems I've repeatedly mentioned I have means my failure rate would vastly increase if the minimum was 3 hours.
ID: 57194 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
funkydude

Send message
Joined: 15 Jun 08
Posts: 28
Credit: 397,934
RAC: 0
Message 57252 - Posted: 26 Nov 2008, 18:27:25 UTC

I found a new hours section in the cPanel, I'm wondering, does increasing my hours on work increase the size of the download and the RAM taken?
ID: 57252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,250,162
RAC: 0
Message 57254 - Posted: 26 Nov 2008, 19:04:04 UTC - in response to Message 57252.  

I found a new hours section in the cPanel, I'm wondering, does increasing my hours on work increase the size of the download and the RAM taken?


No. You'll just get more credits for each task you compute.
ID: 57254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57256 - Posted: 26 Nov 2008, 19:20:11 UTC - in response to Message 57252.  
Last modified: 26 Nov 2008, 19:20:35 UTC

I found a new hours section in the cPanel, I'm wondering, does increasing my hours on work increase the size of the download and the RAM taken?


Actually, your downloads will likely be reduced, because each task will keep your machine busy for a longer period of time. The RAM used is the same either way. I just controls the number of models you produce for the task. The models themselves and calculations performed are the same either way.
Rosetta Moderator: Mod.Sense
ID: 57256 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 57825 - Posted: 12 Dec 2008, 16:49:09 UTC

Anyone out there? Is this discussion over? Then let's have a decision.
ID: 57825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Otto

Send message
Joined: 6 Apr 07
Posts: 27
Credit: 3,567,665
RAC: 0
Message 57827 - Posted: 12 Dec 2008, 17:02:41 UTC

What's the decision? (I'm personally fine with the minimum of 3 hours.)
ID: 57827 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 57828 - Posted: 12 Dec 2008, 18:00:16 UTC

We are still waiting to test out some bug fixes.
ID: 57828 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

Message boards : Number crunching : Discussion on increasing the default run time



©2024 University of Washington
https://www.bakerlab.org