Message boards : Number crunching : Discussion on increasing the default run time
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next
Author | Message |
---|---|
robertmiles Send message Joined: 16 Jun 08 Posts: 1231 Credit: 14,238,388 RAC: 3,880 |
If you are going to pack more than one model into each workunit, could you also score the results for each of those models separately, then add these scores together, so you don't automatically discard all the results for a workunit if one of its later models fails? In that case, after some types of errors, you might then have the workunit go on to the next model not yet tried. That might, however, call for at least one new outcome state, such as one indicating partially successful. I don't mind longer workunits, as long as I can still choose the length of the workunits so that I can complete at least one workunit each day. Currently, I've already set my preferred workunit time to 8 hours, since that seems to balance my desires and yours better than my previous setting. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
If you are going to pack more than one model into each workunit, could you also score the results for each of those models separately, then add these scores together... Yes, that's how it works now. ...so you don't automatically discard all the results for a workunit if one of its later models fails? Depending upon how and why it fails, prior models completed are generally preserved, and credited appropriately. In that case, after some types of errors, you might then have the workunit go on to the next model not yet tried. If the task failed, then for some reason it is not running well on your machine. It is more conservative to replace it with another task that may run better for your environment. In other words if model 1 or 2 failed from this task, let's not push our luck with more. Better to get word back to the project server about the failure sooner. Perhaps there is a trend that will indicate similar future work should be held until a specific issue is resolved. Rosetta Moderator: Mod.Sense |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
If the first model fails, then the WU should error out. But if most models are successful, why should a single bad model cause the whole WU to be invalid? As long as most models are successful, I see no reason the WU can't continue crunching for the normal length of time and then be marked valid. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
At present, without any of the proposed change, most tasks that complete some models and are then ended are marked "Success". But if the task was ended by the watchdog due to too many restarts, then it possibly indicates that the models run for a long time between checkpoints, and the person using the machine does not retain tasks in memory while suspended and/or is turning the machine on and off frequently. So, there certainly are cases where the next task may run better then the current one. And, as I said previously, the sooner the error report gets back to the project, the more chance to assess that information prior to sending out more similar work. It could be that everyone is failing on those specific tasks. So, now continuing it doesn't make any sense. Either way, the goal is certainly to have tasks run normally. And to have them complete within the target runtime. So, establishing the bahavior and default values with that scenario in mind makes sense. Rosetta Moderator: Mod.Sense |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
At present, without any of the proposed change, most tasks that complete some models and are then ended are marked "Success". Let's consider a WU that I recently had which was marked as invalid due to NANs in hbonding: h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-11-S3-4--h001b-_4769_1224 Note that my WU ended after nearly 13 hours of my 16 hour preference and was marked invalid: https://boinc.bakerlab.org/rosetta/result.php?resultid=207873078 Then someone else crunched it successfully using the default 3 hour crunch time: https://boinc.bakerlab.org/rosetta/result.php?resultid=208374490 He successfully did 4 models in that 3 hour crunch time. I would guess that I had done around 14 models successfully and then model 15 got the NANs error. Thus, after 14 good models, a single bad model caused the whole WU to be marked invalid. The other person had a valid WU because he stopped crunching after only 4 models. So why couldn't minirosetta just discard the bad model #15 and continue crunching models until the 16 hours were up, then report the WU as valid? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
So why couldn't minirosetta just discard the bad model #15 and continue crunching models until the 16 hours were up, then report the WU as valid? I agree with you. It could and should preserve any valid work completed and give credit for it. But the fact that this does not always happen is really unrelated to the topic of this thread. Rosetta Moderator: Mod.Sense |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
But the fact that this does not always happen is really unrelated to the topic of this thread. This thread seems to be about the effects of increasing the default run time. Currently, we have some WUs that sometimes bomb out with NAN errors in random models. With a 3 hour run time, only a few models will be crunched, so the chance of one of them bombing out with this error is small. If a model does bomb out and cause the WU to be invalid, the number of good models that are lost will be small as well. For instance, if 4 models are normally crunched in the 3 hours, then between 0 and 3 good models will have been crunched before the bad model (average 1.5 good models lost). With a 6 hour run time, there will be twice as many models crunched, so the chance of hitting a bad model is doubled. There will also likely be many more good models lost. For instance, if 8 models are crunched in the 6 hours, then between 0 and 7 good models will be lost (average 3.5 good models lost). So increasing the default run time will increase the number of good models that are lost. |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
Limited to just changing this runtime parameter... There are a lot of assumptions in choosing these numbers.... Such as, "What percentage of clients request work that have a runtime preference less than 3 hours?" If that number is small, changing the lower bound won't help you. If a large majority of the users have the default setting of 3 hours, then changing the default to 4 hours should reduce the work request rate to the servers by 1/3. Would that be enough? I believe project should increase it by 1 hour every 3 weeks until desired setting is reached. I believe that the minimum should be 2 hours, but the default should be 4 hours. Try it out for at least 3 weeks before drawing any conclusions. Another idea: Just have a default "number of models to generate in each task" setting in preferences. Task would stop after successfully generating that number of models. The watchdog would be configured appropriately for each task's expected time; it could be generic = 24 hours. Default # models to generate = 3. Assumptions: Scientists approximately know how long a model takes to compute for any given WU. Can calculate watchdog hard limits based on this knowledge. |
funkydude Send message Joined: 15 Jun 08 Posts: 28 Credit: 397,934 RAC: 0 |
The post on the front page specifies that the default will be 6 hours, but the minimum 3 hours. So I'd like to know where exactly do we set this? I can't find it anywhere on the website. Currently, my times have not changed at all, they are still 2-3 hours. EDIT: A picture is worth a thousand words: http://img156.imageshack.us/img156/2679/boincrn3.jpg |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
yo, whoa, funky: when it says "...planning to..." it means it hasn't happend yet. It also says "When and how this will occur has not yet been decided." You can change your runtime preference by clicking the "[Participants]" link at the top of this message board webpage. Then click the link for "Rosetta@home preferences". If you have not done that, your runtimes default to 3 hours. And if the project suddenly changes that to 6hrs, this might cause you some issues with how you like to use your machine. Hence, discussing this beforehand in this thread. Rosetta Moderator: Mod.Sense |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
Mod.Sense, Do you have any thoughts about my ideas? (Go up three posts.) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I prefer to remain in the facilitator role here and not attempt to influence the discussion. I would point out that establishing a fixed number of models, say for example I set it to 20, doesn't really help you much. Because some proteins can do 20 models in an hour, and others would take days. Although if the model runtime were rather predictable, perhaps the WU flops count could be tailored for each different protein of a given batch of work. So it is still possible that some form of this approach could be implemented, and result in more predictable runtimes. I'm not sure how many changes DK is planning to incorporate into this runtime change project. To modify flops counts, watchdog timeouts and etc. on the fly as work is sent out would probably require some serious changes to the scheduler. So, anyway, I think I see where you are coming from with that, but I'm not sure how quickly such concepts could be incorporated. It seems likely they are beyond the scope of the immediate goal. Rosetta Moderator: Mod.Sense |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
Mod.Sense, Thank you for the feedback. In response to the original question (by Dave E K), the minimum runtime should be increased by 1 hour every 2-3 weeks (at least 10 days + time to task deadline alloted). This is enough time to allow even the biggest crunchers to empty their BOINC cache using the new runtime preference. I also think that an announcement in the site news and a link to the "Runtime preference FAQ" would be very helpful. The FAQ would just explain what the runtime preference is and where to change it on the website. The news and link would encourage and help users make a wise decision for this setting (and keep the redundant questions to a minimum). |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2106 Credit: 40,947,636 RAC: 18,521 |
There are a lot of assumptions in choosing these numbers.... Such as, "What percentage of clients request work that have a runtime preference less than 3 hours?" If that number is small, changing the lower bound won't help you. If a large majority of the users have the default setting of 3 hours, then changing the default to 4 hours should reduce the work request rate to the servers by 1/3. Would that be enough? I support this change. Particularly the minimum run-time change to 2 hours rather than 3. The problems I've repeatedly mentioned I have means my failure rate would vastly increase if the minimum was 3 hours. |
funkydude Send message Joined: 15 Jun 08 Posts: 28 Credit: 397,934 RAC: 0 |
I found a new hours section in the cPanel, I'm wondering, does increasing my hours on work increase the size of the download and the RAM taken? |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
I found a new hours section in the cPanel, I'm wondering, does increasing my hours on work increase the size of the download and the RAM taken? No. You'll just get more credits for each task you compute. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I found a new hours section in the cPanel, I'm wondering, does increasing my hours on work increase the size of the download and the RAM taken? Actually, your downloads will likely be reduced, because each task will keep your machine busy for a longer period of time. The RAM used is the same either way. I just controls the number of models you produce for the task. The models themselves and calculations performed are the same either way. Rosetta Moderator: Mod.Sense |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
Anyone out there? Is this discussion over? Then let's have a decision. |
Otto Send message Joined: 6 Apr 07 Posts: 27 Credit: 3,567,665 RAC: 0 |
What's the decision? (I'm personally fine with the minimum of 3 hours.) |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
We are still waiting to test out some bug fixes. |
Message boards :
Number crunching :
Discussion on increasing the default run time
©2024 University of Washington
https://www.bakerlab.org