Rosetta@home

Discussion on increasing the default run time

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Discussion on increasing the default run time

Sort
AuthorMessage
David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 942
ID: 14
Credit: 2,303,046
RAC: 485
Message 56932 - Posted 14 Nov 2008 19:58:43 UTC
Last modified: 15 Nov 2008 3:56:34 UTC

We are planning to increase the default run time from 3 hours to 6 hours and the minimum from 1 to 3 hours to reduce the load on our servers. There will be a transition period where your client will adjust to the new run time which will affect the number of tasks that are queued on your client. I've created this thread for a discussion on what would be the best way to transition to an increased run time. This obviously will only affect people with default run times (people who have not bothered to set this preference) or people who have set their run time to be less than 3 hours. (edit: not 6, whoops!)

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 56934 - Posted 14 Nov 2008 20:55:11 UTC
Last modified: 14 Nov 2008 21:03:33 UTC

For people that pull a week of work at a time, due to infrequent internet connections, increasing the runtime from 3 to 6 hours would mean they get twice as much work as they can crunch.

Would it be possible to increase the default like 5 minutes a day or something? That would be so gradual that after a week you would be at 3:35 as compared to the 3hrs previously (i.e. only a max of 18% variance). It would take you 6 weeks to get all the way up to 6hrs, but the work flow should be pretty steady for the client. It shouldn't noticably over or under load with work.

[edit]
I guess for all the same reasons, a gradual change to the min. runtime would be required too.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 56935 - Posted 14 Nov 2008 20:58:39 UTC

Anyone that wants to avoid such problems could always change their runtime from the default at a time of their choosing, either before or during such a transition.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 56937 - Posted 14 Nov 2008 21:40:23 UTC
Last modified: 14 Nov 2008 21:48:52 UTC

...assuming they knew about the proposed change. How many of the crunchers actively read the forums? I suspect a very small number. How about a "Rosetta News Letter" mass mailing?

If it was a problem, why didn't the project ask "the regulars" to change their default run time ages back? That might have bought some time or even alleviated the issue.

I've just changed all mine.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Chilean Profile
Avatar

Joined: Oct 16 05
Posts: 651
ID: 5008
Credit: 10,238,180
RAC: 4,709
Message 56941 - Posted 15 Nov 2008 1:25:27 UTC

Actually, it would be awesome to have a certain amount of models per WU. that way it would be way easier to compare CPU performance by just seeing how fast the CPU can crunch a WU.

I miss that from SETI@Home :(
____________

Gavin Shaw Profile
Avatar

Joined: Feb 1 07
Posts: 10
ID: 144828
Credit: 506,456
RAC: 0
Message 56942 - Posted 15 Nov 2008 1:52:26 UTC - in response to Message ID 56932.

This obviously will only affect people with default run times or people who have set their run time to be less than 6 hours.


Perhaps I'm just thick or slow (it is the weekend where I am), but how does changing the min time to 3hr and the default to 6hr affect me when I have my run time set to 4hr? It is still greater than the min time so nothing should change right?

____________
Never surrender and never give up. In the darkest hour there is always hope.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 56943 - Posted 15 Nov 2008 3:33:38 UTC - in response to Message ID 56942.

It is still greater than the min time so nothing should change right?


Right. You are not impacted by the proposed change to default run time, because you are not using the default. And you are not impacted by the proposed change to minimum runtime, because you are over the proposed new minimum runtime.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 56945 - Posted 15 Nov 2008 3:47:23 UTC

There might be a downside to increasing the default run time: if a task takes abnormally long for any reason it relies on the watchdog thread to stop it if it exceeds 3 times the preferred time (see below for an example). So if rosetta gets stuck in an infinite loop or something the amount of time wasted will be equal to 3 times the preferred time: clearly shorter preferred times are preferable in such a case.


206764478
Name 1hzh_2cxh_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_289_0
Workunit 188615593

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 48690.3 seconds. Greater than 3X preferred time: 14400 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>

____________

Gavin Shaw Profile
Avatar

Joined: Feb 1 07
Posts: 10
ID: 144828
Credit: 506,456
RAC: 0
Message 56946 - Posted 15 Nov 2008 4:00:03 UTC

Though the watchdog doesn't seem to kick in until about 3.5x the desired time has elapsed. Perhaps it is giving the unit some time to finish off before booting it?

____________
Never surrender and never give up. In the darkest hour there is always hope.

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 56957 - Posted 15 Nov 2008 13:54:50 UTC - in response to Message ID 56945.

There might be a downside to increasing the default run time: if a task takes abnormally long for any reason it relies on the watchdog thread to stop it if it exceeds 3 times the preferred time (see below for an example). So if rosetta gets stuck in an infinite loop or something the amount of time wasted will be equal to 3 times the preferred time: clearly shorter preferred times are preferable in such a case.

That's a good point. Perhaps the Watchdog should be more aggressive about aborting stuck workunits. Maybe it could abort the WU after 2x, or even 1.5x the specified crunching time. The old 3x with 3 hours is 9 hours, and 1.5x with the new 6 hours would still be 9 hours.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 56960 - Posted 15 Nov 2008 15:23:13 UTC

Yes, if the default runtimes were changed, the watchdog could be revised as well. The watchdog used to wait for 4x the preferred runtime.
____________
Rosetta Moderator: Mod.Sense

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 56983 - Posted 16 Nov 2008 4:02:54 UTC

I understand completely the motivation behind increasing the default run time and if I only received Rosetta Beta 5.98 WUs I'm sure I'd hold to that default successfully.

But as I report here (and previously) I get Mini Rosetta WUs constantly crashing out with "Can't acquire lockfile - exiting" error messages - maybe 60% failure rate with a 3-hour runtime, reducing to 40% failure rate with a 2-hour run time.

I've seen this reported by several other people running a 64-bit OS - not just on Vista or with an AMD machine. That said, I don't know how widespread it is. Perhaps you can analyse results at your end.

As stated in the post linked above, I get no errors at all with Rosetta Beta, so I'm inclined to think it's not some aberration with my machine. I'd really like to see some feedback on this issue and some assurance it's being investigated in some way.

I'd ask that a minimum run time of 2 hours is allowed (I can just about handle that) or some mechanism that allows me to reject all Mini Rosetta WUs. If not, I'm prepared to abort all Mini Rosetta WUs before they run. It's really a waste of time me receiving them if 60% are going to crash out on me anyway.

I've commented on this before here, here, here and first of all and more extensively here - see follow-up messages in that thread.

No such issues arose for me with my old AMD single core XPSP2 machine - only when I got this new AMD quad-core Vista64 machine.

Any advice appreciated. It's a very big Rosetta issue for me, so while I'm sure you'll save a whole load of bandwidth if you go ahead with the proposed changes I just hope some allowance can be made for people in my situation.
____________

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 56994 - Posted 16 Nov 2008 14:12:38 UTC
Last modified: 16 Nov 2008 14:38:21 UTC

"Can't acquire lockfile - exiting"

That's familiar. Go to "Your Account" then "Computing Preferences" check that at the bottom of the first block "Use at most" is set to 100%. That lock file error is common on systems where this is not set to 100% at some projects.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 56998 - Posted 16 Nov 2008 14:46:06 UTC - in response to Message ID 56994.

adrianxw wrote:

"Can't acquire lockfile - exiting"

That's familiar. Go to "Your Account" then "Computing Preferences" check that at the bottom of the first block "Use at most" is set to 100%. That lock file error is common on systems where this is not set to 100%.

Thanks for the comment - very promising. I'm showing (sorry for the layout):

Processor usage Default Home
Use at most per cent of CPU time
Enforced by version 5.6+ 50 100

Disk and memory usage Default Home
Use at most 5 100 GB disk space
Leave at least 0.1 0.001 GB disk space free
Use at most 50 50 % of total disk space
Write to disk at most every 60 60 seconds
Use at most 50 75 % of page file (swap space)
Use at most
Enforced by version 5.8+ 50 50 % of memory when computer is in use
Use at most
Enforced by version 5.8+ 90 90 % of memory when computer is not in use

Specifically, which 'use at most' are you referring to? The one under procesor usage?

My Default Computer Location is set to 'Home' if that make a difference.
____________

FalconFly Profile
Avatar

Joined: Jan 11 08
Posts: 23
ID: 234757
Credit: 2,163,056
RAC: 0
Message 57012 - Posted 16 Nov 2008 23:55:09 UTC - in response to Message ID 56998.
Last modified: 17 Nov 2008 0:09:10 UTC

I don't mind 6h default runtime, as that's what I'm using right now anyway.

I also wouldn't mind setting it higher, but :
Is it still correct that the Rosetta Client can enter a deadlock and will abort the WorkUnit not before 2x (or even 4x ?) of the scheduled runtime has elapsed ?

At least that's what I remember from reading the Q&A a long time ago.
I don't have any problems getting an occasional Computing Error or stalled WorkUnit but would mind wasting 24h (or even more) of runtime.

If that's all history already and not valid anymore, I'd happily switch to 24h runtime.

Just thought I'd ask, as I'm about to set Rosetta to full throttle in my network.

-- edit --
I'm also seeing h001b_BOINC_ABRELAX_RANGE_yebf failing with Compute Errors (on different Systems including other Hosts of the Quorum)... Losing 2-5h of work is one thing, losing 12-23h would be more disappointing.

Right now (pending any "max time exceeded" related problems), that would by my only concern increasing runtime significantly beyond what I got right now.

(would be cool if correct/complete predictions of a failed WorkUnit before the error occured could be credited and counted - that way a model induced compute error wouldn't really matter anymore regardless of runtime)
____________

ejuel Profile

Joined: Feb 8 07
Posts: 78
ID: 146186
Credit: 4,447,069
RAC: 0
Message 57016 - Posted 17 Nov 2008 1:08:39 UTC - in response to Message ID 56932.

We are planning to increase the default run time from 3 hours to 6 hours and the minimum from 1 to 3 hours to reduce the load on our servers.



Can you please explain this in lamens' terms? Are you stating that you are making changes on the server or on our clients? If on our clients, please explain what you mean. For example, are you making the Work Units twice as big/complex which means my machine will take twice as long to crunch each WU? If you are talking about the server are you stating that our client must wait at least 6 hours before connecting again for reporting or new WUs?

Again, your quote is very open ended and can mean a number of things.

Thanks.

-Eric
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 57019 - Posted 17 Nov 2008 2:15:00 UTC
Last modified: 17 Nov 2008 2:16:26 UTC

ejuel, DK is talking about the Rosetta specific preference for how long each task runs on your client machine. If you express no preference, the default is for tasks to run for 3 hours presently. But the drop down list lets you chose from a preference of 1 hour, through 24 hours for each task.

This is just a preference. It's not a hard limit and, as is often discussed on the message boards, there are cases where task run well past the runtime preference. By increasing the minimum from 1hr to 3hrs, and the default from 3hrs to 6hrs, more tasks will execute more predictably and consistently within the established preference.

The net result of that is that your client (if running with default settings) runs through 4 tasks per day per core, rather then 8. Still doing 24hrs of useful work to help the science of Rosetta@home. Just running more models against each task before reporting the results back.

So, it is a change to the definition of the default value for your runtime preference, which is defined on the server side, and effects every task run under the profile the setting pertains to.
____________
Rosetta Moderator: Mod.Sense

ejuel Profile

Joined: Feb 8 07
Posts: 78
ID: 146186
Credit: 4,447,069
RAC: 0
Message 57021 - Posted 17 Nov 2008 2:38:16 UTC - in response to Message ID 57019.

ejuel, DK is talking about the Rosetta specific preference for how long each task runs on your client machine. If you express no preference, the default is for tasks to run for 3 hours presently. But the drop down list lets you chose from a preference of 1 hour, through 24 hours for each task.


Thanks...but a few follow-up questions:

1)Why are all my not-processed-yet WUs now predicting 9hours 41mins to process rather than 6 hours? 6 vs 9:41 is a big difference.

2)What will happen to the 15+ WUs I have that are not completed yet, but are due within 48 hours? Mathematically there is no way I can crunch through 15+ WUs in 48 hours if each WU will take 9:41 to finish.

3)I assume RAC will not change...since RAC is not counting the quantity of WUs but rather the work/time ratio done on those WUs.

Any other pitfalls we should consider?

-Eric
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 57025 - Posted 17 Nov 2008 14:56:38 UTC

1) BOINC "learns" how long it takes your machine to complete tasks for each project. One or more of your very recent tasks took closer to 10hrs to complete. And so BOINC estimates that future tasks may take about as long (not a valid assumption).

2) Existing WUs in your cache *ARE* effected by runtime changes. That is one of many reasons to discuss and consider the topic carefully before making such a change in the project. And so, if the change were made today, and you've got all that work due in 2 days, your machine would miss some deadlines and the tasks would not receive credit. Then things would be back to normal. (or you would have to manually abort a few of them, until your machine adjusts to the new runtime).

3) Correct. RAC will not be directly impacted by the change.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 57045 - Posted 18 Nov 2008 17:26:12 UTC

Is there a reason why the watchdog couldn't work at the level of each individual model rather than the task as a whole? That way, you'd avoid the potential extra time wastage that might happen with longer run times if a model goes haywire.
____________

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 57052 - Posted 19 Nov 2008 2:02:47 UTC
Last modified: 19 Nov 2008 2:12:36 UTC

If you are going to pack more than one model into each workunit, could you also score the results for each of those models separately, then add these scores together, so you don't automatically discard all the results for a workunit if one of its later models fails? In that case, after some types of errors, you might then have the workunit go on to the next model not yet tried.

That might, however, call for at least one new outcome state, such as one indicating partially successful.

I don't mind longer workunits, as long as I can still choose the length of the workunits so that I can complete at least one workunit each day. Currently, I've already set my preferred workunit time to 8 hours, since that seems to balance my desires and yours better than my previous setting.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 57055 - Posted 19 Nov 2008 4:36:22 UTC - in response to Message ID 57052.
Last modified: 19 Nov 2008 4:38:40 UTC

If you are going to pack more than one model into each workunit, could you also score the results for each of those models separately, then add these scores together...


Yes, that's how it works now.

...so you don't automatically discard all the results for a workunit if one of its later models fails?


Depending upon how and why it fails, prior models completed are generally preserved, and credited appropriately.

In that case, after some types of errors, you might then have the workunit go on to the next model not yet tried.


If the task failed, then for some reason it is not running well on your machine. It is more conservative to replace it with another task that may run better for your environment. In other words if model 1 or 2 failed from this task, let's not push our luck with more. Better to get word back to the project server about the failure sooner. Perhaps there is a trend that will indicate similar future work should be held until a specific issue is resolved.
____________
Rosetta Moderator: Mod.Sense

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 57066 - Posted 19 Nov 2008 14:04:31 UTC

If the first model fails, then the WU should error out. But if most models are successful, why should a single bad model cause the whole WU to be invalid? As long as most models are successful, I see no reason the WU can't continue crunching for the normal length of time and then be marked valid.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 57067 - Posted 19 Nov 2008 14:15:06 UTC

At present, without any of the proposed change, most tasks that complete some models and are then ended are marked "Success". But if the task was ended by the watchdog due to too many restarts, then it possibly indicates that the models run for a long time between checkpoints, and the person using the machine does not retain tasks in memory while suspended and/or is turning the machine on and off frequently. So, there certainly are cases where the next task may run better then the current one. And, as I said previously, the sooner the error report gets back to the project, the more chance to assess that information prior to sending out more similar work. It could be that everyone is failing on those specific tasks. So, now continuing it doesn't make any sense.

Either way, the goal is certainly to have tasks run normally. And to have them complete within the target runtime. So, establishing the bahavior and default values with that scenario in mind makes sense.
____________
Rosetta Moderator: Mod.Sense

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 57070 - Posted 19 Nov 2008 16:36:31 UTC - in response to Message ID 57067.

At present, without any of the proposed change, most tasks that complete some models and are then ended are marked "Success".

Let's consider a WU that I recently had which was marked as invalid due to NANs in hbonding:
h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-11-S3-4--h001b-_4769_1224

Note that my WU ended after nearly 13 hours of my 16 hour preference and was marked invalid:
http://boinc.bakerlab.org/rosetta/result.php?resultid=207873078

Then someone else crunched it successfully using the default 3 hour crunch time:
http://boinc.bakerlab.org/rosetta/result.php?resultid=208374490

He successfully did 4 models in that 3 hour crunch time. I would guess that I had done around 14 models successfully and then model 15 got the NANs error. Thus, after 14 good models, a single bad model caused the whole WU to be marked invalid. The other person had a valid WU because he stopped crunching after only 4 models.

So why couldn't minirosetta just discard the bad model #15 and continue crunching models until the 16 hours were up, then report the WU as valid?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 57071 - Posted 19 Nov 2008 17:33:55 UTC - in response to Message ID 57070.

So why couldn't minirosetta just discard the bad model #15 and continue crunching models until the 16 hours were up, then report the WU as valid?


I agree with you. It could and should preserve any valid work completed and give credit for it. But the fact that this does not always happen is really unrelated to the topic of this thread.
____________
Rosetta Moderator: Mod.Sense

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 57089 - Posted 19 Nov 2008 22:22:41 UTC - in response to Message ID 57071.

But the fact that this does not always happen is really unrelated to the topic of this thread.

This thread seems to be about the effects of increasing the default run time.

Currently, we have some WUs that sometimes bomb out with NAN errors in random models. With a 3 hour run time, only a few models will be crunched, so the chance of one of them bombing out with this error is small. If a model does bomb out and cause the WU to be invalid, the number of good models that are lost will be small as well. For instance, if 4 models are normally crunched in the 3 hours, then between 0 and 3 good models will have been crunched before the bad model (average 1.5 good models lost).

With a 6 hour run time, there will be twice as many models crunched, so the chance of hitting a bad model is doubled. There will also likely be many more good models lost. For instance, if 8 models are crunched in the 6 hours, then between 0 and 7 good models will be lost (average 3.5 good models lost).

So increasing the default run time will increase the number of good models that are lost.

DJStarfox

Joined: Jul 19 07
Posts: 140
ID: 191721
Credit: 560,560
RAC: 21
Message 57098 - Posted 20 Nov 2008 18:17:53 UTC

Limited to just changing this runtime parameter...

There are a lot of assumptions in choosing these numbers.... Such as, "What percentage of clients request work that have a runtime preference less than 3 hours?" If that number is small, changing the lower bound won't help you. If a large majority of the users have the default setting of 3 hours, then changing the default to 4 hours should reduce the work request rate to the servers by 1/3. Would that be enough?

I believe project should increase it by 1 hour every 3 weeks until desired setting is reached. I believe that the minimum should be 2 hours, but the default should be 4 hours. Try it out for at least 3 weeks before drawing any conclusions.

Another idea:
Just have a default "number of models to generate in each task" setting in preferences. Task would stop after successfully generating that number of models. The watchdog would be configured appropriately for each task's expected time; it could be generic = 24 hours. Default # models to generate = 3.

Assumptions: Scientists approximately know how long a model takes to compute for any given WU. Can calculate watchdog hard limits based on this knowledge.

funkydude

Joined: Jun 15 08
Posts: 12
ID: 264493
Credit: 146,106
RAC: 0
Message 57101 - Posted 20 Nov 2008 20:02:56 UTC - in response to Message ID 57098.
Last modified: 20 Nov 2008 20:08:03 UTC

The post on the front page specifies that the default will be 6 hours, but the minimum 3 hours. So I'd like to know where exactly do we set this? I can't find it anywhere on the website. Currently, my times have not changed at all, they are still 2-3 hours.

EDIT: A picture is worth a thousand words: http://img156.imageshack.us/img156/2679/boincrn3.jpg

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 57102 - Posted 20 Nov 2008 20:54:05 UTC

yo, whoa, funky: when it says "...planning to..." it means it hasn't happend yet. It also says "When and how this will occur has not yet been decided."

You can change your runtime preference by clicking the "[Participants]" link at the top of this message board webpage. Then click the link for "Rosetta@home preferences". If you have not done that, your runtimes default to 3 hours. And if the project suddenly changes that to 6hrs, this might cause you some issues with how you like to use your machine. Hence, discussing this beforehand in this thread.
____________
Rosetta Moderator: Mod.Sense

DJStarfox

Joined: Jul 19 07
Posts: 140
ID: 191721
Credit: 560,560
RAC: 21
Message 57105 - Posted 20 Nov 2008 21:44:42 UTC - in response to Message ID 57102.

Mod.Sense,

Do you have any thoughts about my ideas?
(Go up three posts.)

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 57111 - Posted 21 Nov 2008 0:06:38 UTC

I prefer to remain in the facilitator role here and not attempt to influence the discussion.

I would point out that establishing a fixed number of models, say for example I set it to 20, doesn't really help you much. Because some proteins can do 20 models in an hour, and others would take days. Although if the model runtime were rather predictable, perhaps the WU flops count could be tailored for each different protein of a given batch of work. So it is still possible that some form of this approach could be implemented, and result in more predictable runtimes.

I'm not sure how many changes DK is planning to incorporate into this runtime change project. To modify flops counts, watchdog timeouts and etc. on the fly as work is sent out would probably require some serious changes to the scheduler.

So, anyway, I think I see where you are coming from with that, but I'm not sure how quickly such concepts could be incorporated. It seems likely they are beyond the scope of the immediate goal.
____________
Rosetta Moderator: Mod.Sense

DJStarfox

Joined: Jul 19 07
Posts: 140
ID: 191721
Credit: 560,560
RAC: 21
Message 57119 - Posted 21 Nov 2008 2:34:14 UTC - in response to Message ID 57111.

Mod.Sense,

Thank you for the feedback.

In response to the original question (by Dave E K), the minimum runtime should be increased by 1 hour every 2-3 weeks (at least 10 days + time to task deadline alloted). This is enough time to allow even the biggest crunchers to empty their BOINC cache using the new runtime preference.

I also think that an announcement in the site news and a link to the "Runtime preference FAQ" would be very helpful. The FAQ would just explain what the runtime preference is and where to change it on the website. The news and link would encourage and help users make a wise decision for this setting (and keep the redundant questions to a minimum).

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 57194 - Posted 24 Nov 2008 2:03:21 UTC - in response to Message ID 57098.

There are a lot of assumptions in choosing these numbers.... Such as, "What percentage of clients request work that have a runtime preference less than 3 hours?" If that number is small, changing the lower bound won't help you. If a large majority of the users have the default setting of 3 hours, then changing the default to 4 hours should reduce the work request rate to the servers by 1/3. Would that be enough?

I believe project should increase it by 1 hour every 3 weeks until desired setting is reached. I believe that the minimum should be 2 hours, but the default should be 4 hours. Try it out for at least 3 weeks before drawing any conclusions.

I support this change. Particularly the minimum run-time change to 2 hours rather than 3. The problems I've repeatedly mentioned I have means my failure rate would vastly increase if the minimum was 3 hours.
____________

funkydude

Joined: Jun 15 08
Posts: 12
ID: 264493
Credit: 146,106
RAC: 0
Message 57252 - Posted 26 Nov 2008 18:27:25 UTC

I found a new hours section in the cPanel, I'm wondering, does increasing my hours on work increase the size of the download and the RAM taken?

DJStarfox

Joined: Jul 19 07
Posts: 140
ID: 191721
Credit: 560,560
RAC: 21
Message 57254 - Posted 26 Nov 2008 19:04:04 UTC - in response to Message ID 57252.

I found a new hours section in the cPanel, I'm wondering, does increasing my hours on work increase the size of the download and the RAM taken?


No. You'll just get more credits for each task you compute.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 57256 - Posted 26 Nov 2008 19:20:11 UTC - in response to Message ID 57252.
Last modified: 26 Nov 2008 19:20:35 UTC

I found a new hours section in the cPanel, I'm wondering, does increasing my hours on work increase the size of the download and the RAM taken?


Actually, your downloads will likely be reduced, because each task will keep your machine busy for a longer period of time. The RAM used is the same either way. I just controls the number of models you produce for the task. The models themselves and calculations performed are the same either way.
____________
Rosetta Moderator: Mod.Sense

Nothing But Idle Time

Joined: Sep 28 05
Posts: 209
ID: 1675
Credit: 139,545
RAC: 0
Message 57825 - Posted 12 Dec 2008 16:49:09 UTC

Anyone out there? Is this discussion over? Then let's have a decision.

Otto

Joined: Apr 6 07
Posts: 27
ID: 163281
Credit: 1,908,137
RAC: 1,443
Message 57827 - Posted 12 Dec 2008 17:02:41 UTC

What's the decision? (I'm personally fine with the minimum of 3 hours.)

David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 942
ID: 14
Credit: 2,303,046
RAC: 485
Message 57828 - Posted 12 Dec 2008 18:00:16 UTC

We are still waiting to test out some bug fixes.

Matthew Lei
Avatar

Joined: Jun 5 06
Posts: 4
ID: 87065
Credit: 258,058
RAC: 0
Message 58316 - Posted 1 Jan 2009 4:00:14 UTC - in response to Message ID 57828.

We are still waiting to test out some bug fixes.


Does that mean you guys are going ahead with the change?
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 58437 - Posted 3 Jan 2009 22:19:02 UTC

Hi.

The sooner they bring this in the better i say, less people hammering the servers.

pete.

____________


6dj72cn8

Joined: Apr 18 06
Posts: 3
ID: 77268
Credit: 13,531
RAC: 0
Message 58778 - Posted 13 Jan 2009 4:04:51 UTC

My preference is for a minimum of two hours.


____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 58783 - Posted 13 Jan 2009 14:37:47 UTC

Harry, could you please talk a bit about WHY that is your preference. What is it about how you use your machine that makes this better for you?
____________
Rosetta Moderator: Mod.Sense

6dj72cn8

Joined: Apr 18 06
Posts: 3
ID: 77268
Credit: 13,531
RAC: 0
Message 58793 - Posted 14 Jan 2009 0:50:15 UTC

After some thought I am unable to justify my preference in a fashion likely to be helpful or meaningful to the project. I therefore withdraw my previous comment and ask you to ignore it.
____________

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 201,862
RAC: 0
Message 58794 - Posted 14 Jan 2009 1:58:54 UTC

It may be that a short run-time is chosen by some users to better reflect the kind of time they spend on a computer before shutting down, thus enabling a WU to complete without keeping the computer on longer than wished. While checkpointing has been infrequent this may be a way of preventing the same WU from restarting over and over.

But if checkpointing is to be made more frequent with the new Mini Rosetta version being tested soon that part of the problem may disappear.

Going back through this thread, the idea that the minimum be increased from 1 to 2 hours and the default increased from 3 to 4 hours as a first interim change makes sense. Not as drastic as doubling it all at once. It can be assessed for unexpected results before increasing the default from 4 to 5 hours maybe a month later. Then again to 6 hours for a while before changing the minimum from 2 to 3 hours.

Every change would go towards helping the server loads to a lesser degree.

Virtual Boss*
Avatar

Joined: May 10 08
Posts: 35
ID: 257766
Credit: 700,682
RAC: 148
Message 58799 - Posted 14 Jan 2009 11:51:58 UTC
Last modified: 14 Jan 2009 12:00:11 UTC

For those who may be interested in the effect of runtime changes.

Below is a list which shows credit vs down/up traffic for 3 weeks pre/post runtime change.

Weekly Dates      Credit   Down MB   UP MB
26Oct08-01Nov08   2079     150.1     6.0
02Nov08-08Nov08   2519     235.9     8.7
09Nov08-15Nov08   3222     211.2     29.1
16Nov08-22Nov08   3894     118.2     6.4
23Nov08-29Nov08   4348     120.0     12.3
30Nov08-06Dec08   2839     117.3     4.9

After this thread started I changed my runtime preferences.
Before 15Nov all my hosts were default 3hrs.
On 15Nov I changed runtimes as follows:

10hrs - 1 Host (~80% of RAC)
6hrs - 2 Hosts (~11% of RAC)
4hrs - 3 Hosts (~9% of RAC)

The List shows the obvious drop in internet traffic and increased Credit output due to changing the runtime (which was the only change made).

[EDIT] The increase in credit shown here is more likely due to variation in project crunching ratio - but overall shows ~10-15% increase since change.[/EDIT

Bruce

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 58802 - Posted 14 Jan 2009 13:30:17 UTC

Yes virtual, I wouldn't "sell it" as a RAC improvement. It should be basically unmeasurable. And, as you can imagine, if a new protein come in to study during your test, then you had an extra couple of 2-3MB files to download. It really varies. So, it isn't even really intended to be much of a bandwidth saving. I really boils down to the number of hits on the scheduler. After that, the specific file transfers are not the main focus of changing these values.

You can also reduce file transfer bandwidth (and scheduler hits) if you keep more days of work on your machine. If you say connect about every 3 days for example, rather then the 0.1 days default, that can make a nice reduction in hits on the servers. Not that setting to .1 days means you actually always do 240 hits per day, but the higher value will reduce the number of requests.
____________
Rosetta Moderator: Mod.Sense

Virtual Boss*
Avatar

Joined: May 10 08
Posts: 35
ID: 257766
Credit: 700,682
RAC: 148
Message 58805 - Posted 14 Jan 2009 14:41:46 UTC - in response to Message ID 58802.

Yes virtual, I wouldn't "sell it" as a RAC improvement. It should be basically unmeasurable. And, as you can imagine, if a new protein come in to study during your test, then you had an extra couple of 2-3MB files to download. It really varies. So, it isn't even really intended to be much of a bandwidth saving. I really boils down to the number of hits on the scheduler. After that, the specific file transfers are not the main focus of changing these values.

You can also reduce file transfer bandwidth (and scheduler hits) if you keep more days of work on your machine. If you say connect about every 3 days for example, rather then the 0.1 days default, that can make a nice reduction in hits on the servers. Not that setting to .1 days means you actually always do 240 hits per day, but the higher value will reduce the number of requests.


Hi Mod.Sense

I agree that there is a large number of variables, but they do tend to average out over the longer term.

Below are the figures for 2 months pre/post which still show a significant decrease in traffic.

Even though during the post period I have noticed there have been increased numbers of new proteins and several series which repeatedly 'crashed out' on my hosts and problems with credit generated, which would all have the effect of reducing the amount of traffic reduction I have seen.

These figures indicate a 33% increase in credit per MB of Download.

Date ranges      Credit DownMB  Ratio
16Sep08-14Nov08  25373  1196.2  21.21
15Nov08-13Jan09  30202 1057.2 28.57

Simple maths will tell you that for a particular protien, if you double your runtime then you will roughly double the number of models completed, thereby roughly doubling your credit (per MB DL).

If you still crunch for the same number of hrs per day this means your traffic is roughly halved.

I believe in the longer term my stats will approach that figure.

I was also wondering where the total credit increase came from, and suspect it may partially be due to less cpu time 'wasted' by 1 - network traffic and 2 - loading and initialising the work unit before it can start actually crunching any useful data.

I guess more time will give more accurate findings.

And Yes - My overall servers hits have reduced considerably. (maybe by 30-40% guesstimate)

Bruce

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 58807 - Posted 14 Jan 2009 16:05:30 UTC

I was referring to RAC. You are using a credit per MB of BW system. So, yes, I'd expect that the more factors you can play together that reduce MB of download will improve your credit per MB. But, it will not make a material difference in credit per day of crunching, which is what RAC amounts to.

I'm curious too, how did you measure your bandwidth? Are you using a proxy server that recorded that? I'm not questioning your figures. Just looking for more ways to measure it my self :)
____________
Rosetta Moderator: Mod.Sense

Virtual Boss*
Avatar

Joined: May 10 08
Posts: 35
ID: 257766
Credit: 700,682
RAC: 148
Message 58847 - Posted 16 Jan 2009 13:18:22 UTC - in response to Message ID 58807.

I was referring to RAC. You are using a credit per MB of BW system. So, yes, I'd expect that the more factors you can play together that reduce MB of download will improve your credit per MB. But, it will not make a material difference in credit per day of crunching, which is what RAC amounts to.

I'm curious too, how did you measure your bandwidth? Are you using a proxy server that recorded that? I'm not questioning your figures. Just looking for more ways to measure it my self :)



I am using a commercial program called BWMeter, primarily to control b/w allocations to each host on my network to stop any host 'hogging' the internet.

I also has quite good statistics among many other features.

P.Henry
Avatar

Joined: Oct 27 08
Posts: 39
ID: 285528
Credit: 876,073
RAC: 0
Message 59477 - Posted 9 Feb 2009 4:32:30 UTC - in response to Message ID 58847.

im going quad :D

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 59982 - Posted 5 Mar 2009 2:00:19 UTC

So, what was the decision on increasing the minimum and default runtime?

Did you decide to upgrade the DB server instead?
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

mike46360

Joined: May 21 07
Posts: 10
ID: 178869
Credit: 18,011
RAC: 0
Message 61792 - Posted 16 Jun 2009 18:09:27 UTC

We are planning to increase the default run time from 3 hours to 6 hours and the minimum from 1 to 3 hours to reduce the load on our servers.


I increased the run time from 3 hours to 6 hours last night..

Does this help the folding at all or is it just to ease the pain on the servers?

ByRad Profile
Avatar

Joined: Apr 12 08
Posts: 8
ID: 252633
Credit: 8,231,131
RAC: 13,507
Message 61794 - Posted 16 Jun 2009 20:45:48 UTC

But there will be albo a problem... Just for try I have changed my runtime from default (3 hours) to th maximum value of 24 hours for couple of days. Ane the effect was that only 4 of 14 tasks have finished without errors (I tried 2 days on WinXPx86 and 2 days on Win& 64b, so it doesn't depend on the wersion of rosetta (I mean x64 / x86, not v.1.74) ). In that period I was running my PC all the time (24h a day) restarting it once or twice a day. So increasng the runtime will also reduce the number of Work Units finishing properly.
Because of this I think that it woult be really nice idea if the result of every finished WU (valid or erroneus) were sent to the serwer, because the error can occur in the first model but also after 100 models finished properly. And if there were also sent some informations about error, it would give some debug informations for developers (without huge increasing of the traffic).
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 61832 - Posted 18 Jun 2009 13:18:33 UTC

mike, it just keeps your machine busy with less overhead on the project servers.

ByRad, the Rosetta applications does send partial successes. If you complete 50 models and then number 51 fails, the task is reported back and should show as a success. It also sends some information back to help diagnose what caused the problems with the 51st model. So, the system may not always work perfectly, but the suggestions you have made are already in the code.
____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 62671 - Posted 31 Jul 2009 6:27:03 UTC

Hi.

I see that nothing has been done about this, it might help you with the type of

server problems your having at the moment. putting the default to at least four

hours.

I.M.H.O.


____________


Warped

Joined: Jan 15 06
Posts: 44
ID: 50853
Credit: 1,336,001
RAC: 225
Message 63031 - Posted 24 Aug 2009 15:32:11 UTC

I live in a bandwidth-impoverished part of the world, with high prices and low speed. Consequently, I have selected 16 hours run time.

However, I find this thread as well as the others discussing long-running models to be of little interest when I have work units running for about 4 hours. Is the preferred run time really applied?
____________
Warped

dcdc Profile

Joined: Nov 3 05
Posts: 1596
ID: 8948
Credit: 33,802,222
RAC: 17,340
Message 63032 - Posted 24 Aug 2009 15:55:08 UTC
Last modified: 24 Aug 2009 15:56:55 UTC

I'd happily change my run-time prefs so that computers that are on lots have a high run-time and the others have a low run-time but I find this really difficult as they're tied to the BOINC work/home/school settings (which I think are poor, but not the project's fault ;) ).

I also use BAM but that doesn't allow changes to the run-time, so I'm left with the default. Being able to select a run-time preferences per machine would be useful, but probably only for a minority i guess...

(just noticed the project haven't posted on this for a while!)
____________

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 63042 - Posted 25 Aug 2009 23:34:48 UTC - in response to Message ID 63031.

I live in a bandwidth-impoverished part of the world, with high prices and low speed. Consequently, I have selected 16 hours run time.

However, I find this thread as well as the others discussing long-running models to be of little interest when I have work units running for about 4 hours. Is the preferred run time really applied?


I have noticed that on my faster machine, the limit of 99 decoys is usually reached before the 12-hour expected runtime I've requested. You might want to check the report visible on the Rosetta@home of how well the workunit succeeded to see if your workunits also often stop at the 99 decoys limit instead of near the requested run time.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 63043 - Posted 25 Aug 2009 23:48:35 UTC - in response to Message ID 63032.

I'd happily change my run-time prefs so that computers that are on lots have a high run-time and the others have a low run-time but I find this really difficult as they're tied to the BOINC work/home/school settings (which I think are poor, but not the project's fault ;) ).

I also use BAM but that doesn't allow changes to the run-time, so I'm left with the default. Being able to select a run-time preferences per machine would be useful, but probably only for a minority i guess...

(just noticed the project haven't posted on this for a while!)


I've noticed that the World Community Grid project lets you make some machine-specific settings through their web site, but then several other settings do not propagate through to that machine if changed in other ways. I don't use BAM, so I don't know if this is compatible with BAM.

However, it looks like I may soon need to switch managers so that I can control BOINC on my two desktops from my laptop, which appears to be short of much power for running longer workunits well, so could you tell me if BAM seems suitable for that purpose?

Warped

Joined: Jan 15 06
Posts: 44
ID: 50853
Credit: 1,336,001
RAC: 225
Message 63068 - Posted 28 Aug 2009 17:05:24 UTC - in response to Message ID 63042.

I live in a bandwidth-impoverished part of the world, with high prices and low speed. Consequently, I have selected 16 hours run time.

However, I find this thread as well as the others discussing long-running models to be of little interest when I have work units running for about 4 hours. Is the preferred run time really applied?


I have noticed that on my faster machine, the limit of 99 decoys is usually reached before the 12-hour expected runtime I've requested. You might want to check the report visible on the Rosetta@home of how well the workunit succeeded to see if your workunits also often stop at the 99 decoys limit instead of near the requested run time.


The workunits ending before the selected run-time get to a stop at 100 decoys, whereas the one recent workunit which made it to the selected 16 hours stopped at 88 decoys. Is there anything I can do to adjust this or is it a lucky-dip?

dgnuff Profile
Avatar

Joined: Nov 1 05
Posts: 347
ID: 8170
Credit: 23,006,762
RAC: 6,734
Message 63069 - Posted 28 Aug 2009 18:08:22 UTC - in response to Message ID 63068.
Last modified: 28 Aug 2009 18:28:50 UTC


I have noticed that on my faster machine, the limit of 99 decoys is usually reached before the 12-hour expected runtime I've requested. You might want to check the report visible on the Rosetta@home of how well the workunit succeeded to see if your workunits also often stop at the 99 decoys limit instead of near the requested run time.


The workunits ending before the selected run-time get to a stop at 100 decoys, whereas the one recent workunit which made it to the selected 16 hours stopped at 88 decoys. Is there anything I can do to adjust this or is it a lucky-dip?


I've noticed this too. As far as I can tell, it's a "lucky-dip" as you so accurately describe it. Also known as a crap-shoot in other parts of the world. ;)

In another thread, I suggested increasing the maximum number of decoys from 100 to something higher, but that idea was rejected. I still find the reason for staying with the 100 decoy max totally counter-intuitive, and in fact I'm not at all sure the reasoning given is correct.

That said, I'll make the suggestion again to increase the max decoys to 200 (or even higher), and see where the suggestion goes. For those of us with fast machines, willing to do long run times it will reduce the load on the servers. I admit it will change the "shape" of the uploaded data, but it will not change the amount - this last point is the one where I think people haven't thought the problem through correctly.
____________

cenit Profile

Joined: Apr 1 07
Posts: 13
ID: 161706
Credit: 1,630,287
RAC: 0
Message 63070 - Posted 28 Aug 2009 21:34:07 UTC - in response to Message ID 63069.


I have noticed that on my faster machine, the limit of 99 decoys is usually reached before the 12-hour expected runtime I've requested. You might want to check the report visible on the Rosetta@home of how well the workunit succeeded to see if your workunits also often stop at the 99 decoys limit instead of near the requested run time.


The workunits ending before the selected run-time get to a stop at 100 decoys, whereas the one recent workunit which made it to the selected 16 hours stopped at 88 decoys. Is there anything I can do to adjust this or is it a lucky-dip?


I've noticed this too. As far as I can tell, it's a "lucky-dip" as you so accurately describe it. Also known as a crap-shoot in other parts of the world. ;)

In another thread, I suggested increasing the maximum number of decoys from 100 to something higher, but that idea was rejected. I still find the reason for staying with the 100 decoy max totally counter-intuitive, and in fact I'm not at all sure the reasoning given is correct.

That said, I'll make the suggestion again to increase the max decoys to 200 (or even higher), and see where the suggestion goes. For those of us with fast machines, willing to do long run times it will reduce the load on the servers. I admit it will change the "shape" of the uploaded data, but it will not change the amount - this last point is the one where I think people haven't thought the problem through correctly.


"Maximum number of decoys" at 99 was introduced some months ago when Rosetta@home was in "debug mode" (I think around v1.50, no new features only bugs solved). It was used as the easy way to solve some bugs that arose with large uploads (if I remember correctly, they didn't even investigate if the problem was in BOINC or somewhere in their code, because this trick solved easily the bug). I don't think that, atm, it's so important to solve drastically this problem; anyway, it should be interesting to know if they have any problem with server load now...

dgnuff Profile
Avatar

Joined: Nov 1 05
Posts: 347
ID: 8170
Credit: 23,006,762
RAC: 6,734
Message 63073 - Posted 29 Aug 2009 9:55:27 UTC - in response to Message ID 63070.


Snip ...

In another thread, I suggested increasing the maximum number of decoys from 100 to something higher, but that idea was rejected. I still find the reason for staying with the 100 decoy max totally counter-intuitive, and in fact I'm not at all sure the reasoning given is correct.

That said, I'll make the suggestion again to increase the max decoys to 200 (or even higher), and see where the suggestion goes. For those of us with fast machines, willing to do long run times it will reduce the load on the servers. I admit it will change the "shape" of the uploaded data, but it will not change the amount - this last point is the one where I think people haven't thought the problem through correctly.


"Maximum number of decoys" at 99 was introduced some months ago when Rosetta@home was in "debug mode" (I think around v1.50, no new features only bugs solved). It was used as the easy way to solve some bugs that arose with large uploads (if I remember correctly, they didn't even investigate if the problem was in BOINC or somewhere in their code, because this trick solved easily the bug). I don't think that, atm, it's so important to solve drastically this problem; anyway, it should be interesting to know if they have any problem with server load now...


Interesting. If anyone is looking for fairly reliable repro steps on getting uploads to fail, try the following.

Set up a machine, and adjust the maximum upload rate to 2 kbytes/sec on the advanced preferences page. Grab yourself a task like this one:

276073593

let it complete, and then try to upload it. The key section of the name appears to be the "ddg_predictions" string. I've seen a few of these guys going by, they seem to produce very large result files. I've had two that are in excess of 7 Mb and one that was over 11 Mb.

It's worth noting that if I temporarily adjust the upload speed to something over my connection's max (384 kbits/sec, i.e. ~ 40 kbytes/sec), the transfer will then go through without problems.

However it's a bit of a pain doing this, I'm about to the point that if I find another of these WU that's stuck uploading, I'm going to force that upload through, and then abort any of these jobs that I see in the queue.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 63076 - Posted 29 Aug 2009 13:44:35 UTC

So yes warped, you are seeing what I would expect. When your preference of 16hrs is near, the tasks end. And if 99 models is reached prior to that, the task will end at that time (at least in the "mini" application).

The amount of data reported back on the uploads varies by the type of protein and type of study being done. But the primary factor or multiplier on the size is the number of models. At one point there were batches of WUs that were running 20 models an hour. The upload size and potential for hitting the maximum outfile size were very large for long runtime preferences. 99 models was just a way to strike a compromise between giving the desired runtime, and having a predictable and reasonable upload size.

dgnuff From what you are describing, it sounds like the only issue with any given type of work unit is the resulting size of the result file. Any time you have a large file that must move and a very limited bandwidth, there is a conflict to be resolved. The BOINC client can do partial file transfers and continue where it has left off. But I believe it also times out on connections that are actually moving data as well. I've seen connections ended after 5 minutes, and then restarting, at least on downloads. I presume uploads are similar. I am not sure why Berkeley made the client work that way. Seems to me that an active connection that is still successfully moving data should be left alone.

So when you say you can get an upload to "fail", do you mean a retry occurs? Or do you mean that so many retries occur that... well the WU you linked looks like it arrived in-tact. So, eventually the upload was completed. I am unclear what you mean about the upload being "stuck". I think what you are seeing sounds normal for a connection with very limited bandwidth. And the client will continue working on, and completing getting it sent all by itself.

This is part of why they decided to limit to 99 models too. The uploads on tasks that produced many many models were approaching 100MB, which is large enough to cause difficulty in many environments.
____________
Rosetta Moderator: Mod.Sense

thatoneguy

Joined: Jun 8 06
Posts: 3
ID: 93346
Credit: 2,636,731
RAC: 0
Message 64601 - Posted 26 Dec 2009 3:45:42 UTC - in response to Message ID 56932.

Back to the main issue...

what would be the best way to transition to an increased run time.

If it is possible to do so, temporarily decrease the amount of work that can be downloaded. I think it is possible to fudge the report deadline so that computers don't ask for more work, but still receive credit for past due WUs. Following the change, simply increasing the deadline would ease almost all problems stemming from long run-times. The problem remains of course that WUs may take a long time to return to the server.
As long as credit is given for the late work, I think most people won't care about the change (except for the few people who have their computer on so seldom that they won't be able to complete any work on time).
____________

S_Koss

Joined: Jan 7 10
Posts: 4
ID: 365793
Credit: 37,252
RAC: 0
Message 64914 - Posted 11 Jan 2010 16:07:22 UTC

I have a serious problem with changing the default times. I shut 2 of my 3 crunching computers off at night because they are in my bedroom. Last night I had a 3 hour WU that was 99% done. But I was tired and did not want to wait 5 - 10 or 15 minutes for it to finish so I exited Boinc and went to bed. This morning the said WU restarted but from 0% I lost 3 hours of work on just this WU not taking into consideration the other WU that also restarted. I find that unacceptable and turned the default time down to 1 hour. If you are going to change the default time to a minimum of 3 hours then I will be changing projects because I will not continue to loose uncountable hours of work.

S_Koss

Joined: Jan 7 10
Posts: 4
ID: 365793
Credit: 37,252
RAC: 0
Message 64915 - Posted 11 Jan 2010 16:32:14 UTC

On second thought, you can do whatever you want. I am outa here.............

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 64916 - Posted 11 Jan 2010 17:01:01 UTC

Ah well, you can always choose to lower your over-clock.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 64917 - Posted 11 Jan 2010 17:12:46 UTC

Steve, what you are describing is a very new issue that has turned up with a new type of work unit that seems to be having some checkpointing issues.

Transient, I don't think an overclock would be needed to cause the symptoms he's reporting. I've asked Sarel to look in to it.

Steve, I'm curious how the runtime of a task is effecting your user experience (other then loss of work, which I clearly already understand). You appear to have racked up 25,000 credits in just 4 days, so clearly you have machines running 24x7 so how does running one task for 3 hours have a disadvantage over running 3 tasks for an hour each?
____________
Rosetta Moderator: Mod.Sense

David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 942
ID: 14
Credit: 2,303,046
RAC: 485
Message 64918 - Posted 11 Jan 2010 18:24:48 UTC - in response to Message ID 64914.

I have a serious problem with changing the default times. I shut 2 of my 3 crunching computers off at night because they are in my bedroom. Last night I had a 3 hour WU that was 99% done. But I was tired and did not want to wait 5 - 10 or 15 minutes for it to finish so I exited Boinc and went to bed. This morning the said WU restarted but from 0% I lost 3 hours of work on just this WU not taking into consideration the other WU that also restarted. I find that unacceptable and turned the default time down to 1 hour. If you are going to change the default time to a minimum of 3 hours then I will be changing projects because I will not continue to loose uncountable hours of work.



can you give us more information. what was the job id? can you link us to your job information?

DK

S_Koss

Joined: Jan 7 10
Posts: 4
ID: 365793
Credit: 37,252
RAC: 0
Message 64921 - Posted 11 Jan 2010 20:57:40 UTC

Hi, so let me try to explain this better. If you have 4 or 8 or 12 WU in varying degrees of completion and you shut down for the night (because I shut 2 computers of 3 down at night) the average loss will be higher than 1 hour WU. When you restart the next morning and you loose everything that you did the night before it gets frustrating and it has been so for the past 4 days. That is why I am not really interested in your project. I have since detached from your project so I cannot give you WU numbers.

Thank you.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 64924 - Posted 12 Jan 2010 4:21:20 UTC

Steve, most Rosetta work units will save a checkpoint every 15 minutes or so. Giving a balance between losing CPU effort, and keeping checkpoint overhead and disk writes low (even in your case where you power off each day, 90% of the checkpoints are never actually needed). So, on average you should see that less then 7.5min. of CPU time is lost when powering off.

Sarel is making the needed changes (posted here) so this will be true for his new type of work units as well. I just don't want to see you leave a very worthwhile project for the wrong reasons. Mad Max's Post on Saturday the 9th was one of the first posts that was specific enough to identify the problem, and yours then confirmed the issue. And here we are Monday the 11th, and the problem is being addressed.
____________
Rosetta Moderator: Mod.Sense

DJStarfox

Joined: Jul 19 07
Posts: 140
ID: 191721
Credit: 560,560
RAC: 21
Message 64939 - Posted 12 Jan 2010 20:23:25 UTC

Even if you ignore Steve's experience with this project, I hope you recognize that one point has been made clear repeatedly. Checkpoints are a critical feature of BOINC applications. If you need to make checkpoints work within a single decoy's generation, then make it happen.

Given that, there's nothing wrong with doubling the default/minimum run times.

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 4,704,855
RAC: 9,235
Message 64957 - Posted 14 Jan 2010 1:51:55 UTC - in response to Message ID 64924.
Last modified: 14 Jan 2010 2:18:58 UTC

Steve, most Rosetta work units will save a checkpoint every 15 minutes or so. Giving a balance between losing CPU effort, and keeping checkpoint overhead and disk writes low (even in your case where you power off each day, 90% of the checkpoints are never actually needed). So, on average you should see that less then 7.5min. of CPU time is lost when powering off.


On my observations (after I have faced a similar problem, I some time watched disk writing of Rosetta application) majority of WUs wrote checkpoints even much more often - about 1 time each 1-2 minutes (I think according to setting in BOINС which by default set to 60 seconds).
Except for two types WUs - one did not write checkpoints at all (as you have marked this problem is already localised and FIX for it should be included in new version Rosetta mini 2.05) and another wrote checkpoints as usually, but after restarting for any reason could not use them (or did not try at all).
If the job of 2nd type once again gets to me I will try to catch it.
I think an indirect tag of such tasks there should be a bad ratio between "claimed credit" and "granted credit" (on the scale of the concrete computer). As in this case: http://boinc.bakerlab.org/rosetta/result.php?resultid=309578283

I think having the complete server statistics probably to sort tasks by this ratio and to look in what types of tasks there is a bad ratios more often.
By this criterion tasks having one (or both) from following disadvantages should "emerge":
1. Problems with checkpoints mechanism
2. Bad optimisation (executed more slowly in comparison with the others)

But, while I do not have any ideas how to separate one from others...

P.S.
I am impressed by speed of response (only few days between "bug report" and fix for it), on matching with many other projects it are very fast feedback.

Link
Avatar

Joined: May 4 07
Posts: 260
ID: 173059
Credit: 337,463
RAC: 237
Message 64966 - Posted 14 Jan 2010 14:33:13 UTC
Last modified: 14 Jan 2010 14:35:02 UTC

@S_Koss: why don't you just hibernate the systems instead of shuting them down? Works perfect for me and I don't loose even one second of work.

BTT: no problem for me if the default run times be encreased, I run WUs for 12-24 hours.
____________
.

Rabinovitch Profile
Avatar

Joined: Apr 28 07
Posts: 28
ID: 170444
Credit: 1,377,008
RAC: 1,448
Message 64973 - Posted 14 Jan 2010 17:07:41 UTC - in response to Message ID 56932.

We are planning to increase the default run time from 3 hours to 6 hours and the minimum from 1 to 3 hours to reduce the load on our servers.


Nice idea.

And what about increasing maximum crunching time? I am ready crunch even several days if it necessary, or crunch untill all models will be processed. What about checkbox like "Work till the end"? :-)

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 4,704,855
RAC: 9,235
Message 64981 - Posted 14 Jan 2010 21:44:43 UTC - in response to Message ID 64957.
Last modified: 14 Jan 2010 21:53:40 UTC


On my observations (after I have faced a similar problem, I some time watched disk writing of Rosetta application) majority of WUs wrote checkpoints even much more often - about 1 time each 1-2 minutes (I think according to setting in BOINС which by default set to 60 seconds).
Except for two types WUs - one did not write checkpoints at all (as you have marked this problem is already localised and FIX for it should be included in new version Rosetta mini 2.05) and another wrote checkpoints as usually, but after restarting for any reason could not use them (or did not try at all).
If the job of 2nd type once again gets to me I will try to catch it.
I think an indirect tag of such tasks there should be a bad ratio between "claimed credit" and "granted credit"


Long it was not necessary to wait, it is seems I got one of such tasks just right now. I will post the "report" in an appropriate topic a bit later: minirosetta 2.03

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 64982 - Posted 14 Jan 2010 21:46:36 UTC - in response to Message ID 64973.

And what about increasing maximum crunching time? I am ready crunch even several days if it necessary, or crunch untill all models will be processed. What about checkbox like "Work till the end"? :-)


There is no end. There are literally trillions of trillions of possible models. The current 24hr maximum attempts to strike a balance between getting results back to the project with a fast turnaround time, and minimizing burden on servers and bandwidth for downloads. Originally the maximum was 4 days, but just think if a problem arose and you ran for 4 days before the watchdog realized it and kicked in to end the task.
____________
Rosetta Moderator: Mod.Sense

Nuadormrac

Joined: Sep 27 05
Posts: 37
ID: 1352
Credit: 75,798
RAC: 0
Message 66124 - Posted 15 May 2010 0:54:02 UTC - in response to Message ID 56983.

This also brings up another issue with such a possible increase; though it's a credit related one, so might not take the same precedence as... And yet depending how the units are treated, it might effect the science also.

If the processing time is increased, and the unit deadlocks, hangs, or in some way crashes after the initial model(s) had been successfully been processed, it will after whatever time is spent hanging, error out. And yet not everything in the WU was bad. Now because the units don't have the time involved of a CPDN unit, it's unlikely that trickles would be introduced.

However, an effect of lengthening the runtime can also be that a unit that does error latter on will have a higher chance to error out; and if this occurs then any science which was accumulated prior to the model within the WU that did error could be lost, and the credits for those models which were completed without error most assuredly would be, unless something along the line of trickles or partial validation/crediting could be implemented to allow the successfully processed models within the unit to be validated and counted as such.

I understand completely the motivation behind increasing the default run time and if I only received Rosetta Beta 5.98 WUs I'm sure I'd hold to that default successfully.

But as I report here (and previously) I get Mini Rosetta WUs constantly crashing out with "Can't acquire lockfile - exiting" error messages - maybe 60% failure rate with a 3-hour runtime, reducing to 40% failure rate with a 2-hour run time.

I've seen this reported by several other people running a 64-bit OS - not just on Vista or with an AMD machine. That said, I don't know how widespread it is. Perhaps you can analyse results at your end.

As stated in the post linked above, I get no errors at all with Rosetta Beta, so I'm inclined to think it's not some aberration with my machine. I'd really like to see some feedback on this issue and some assurance it's being investigated in some way.

I'd ask that a minimum run time of 2 hours is allowed (I can just about handle that) or some mechanism that allows me to reject all Mini Rosetta WUs. If not, I'm prepared to abort all Mini Rosetta WUs before they run. It's really a waste of time me receiving them if 60% are going to crash out on me anyway.

I've commented on this before here, here, here and first of all and more extensively here - see follow-up messages in that thread.

No such issues arose for me with my old AMD single core XPSP2 machine - only when I got this new AMD quad-core Vista64 machine.

Any advice appreciated. It's a very big Rosetta issue for me, so while I'm sure you'll save a whole load of bandwidth if you go ahead with the proposed changes I just hope some allowance can be made for people in my situation.


____________

Nuadormrac

Joined: Sep 27 05
Posts: 37
ID: 1352
Credit: 75,798
RAC: 0
Message 66125 - Posted 15 May 2010 1:05:59 UTC - in response to Message ID 57055.
Last modified: 15 May 2010 1:10:03 UTC

If the task failed, then for some reason it is not running well on your machine. It is more conservative to replace it with another task that may run better for your environment. In other words if model 1 or 2 failed from this task, let's not push our luck with more. Better to get word back to the project server about the failure sooner. Perhaps there is a trend that will indicate similar future work should be held until a specific issue is resolved.


If model one failed though it would both not impact people well, and yes that the tasks aren't working well on the machine can reasonably be argued. But then a longer WU time wouldn't effect it much if the unit was aborted early on, and a new unit needed to be downloaded (for instance 5 minutes after starting). That's well below even existing preferences.

Where this could more likely be an issue is if, lets say for sake of argument 20 models completed successfully, and for whatever reason model number 21 failed. Now the unit was running 2.5 hours. Only if partial validation for the 20 models occurs would one avoid losing 20 models (vs just 1), and the user would lose the whole 2.5 hours worth of credits, vs just the amount lost for the one unit.

Now arguably I haven't tended to see units fail much on Rossetta (though some have, for there to be discussion, along with recommendation on the team page for the Pentathalon challenge). But in the past I had seen it from time to time on RALPH, which is good because it means many are being caught in the alpha/beta stage, before getting released to people in general. But it can be a consideration.

But for crunchers, there can be 2 big considerations with this proposed change. One is the effect on the BOINC queue, and the other is a reason for which shorter run times can be chosen/preferred, less likelihood of running into the odd error, if it has a smaller span of time in which to occur, and if it does happen smaller impact on potential for lost credits.

For you, there's server load on the one hand, but also potential for lost models/work already completed on the other. (Given we're talking a change of 1-3 hour minimum and 3-6 hour default; units which successfully run for < 1 hour aren't a consideration with such a change as they'd get thrown out and new download would occur anyhow. Hence I'm presuming the first model or 2 has had a sucessful run, for it to now error out prior to either 1 or 3 hours respective. And yes I know a few models do run for 2 hours or so, though many end earlier.)
____________

Warped

Joined: Jan 15 06
Posts: 44
ID: 50853
Credit: 1,336,001
RAC: 225
Message 67549 - Posted 2 Sep 2010 14:22:08 UTC
Last modified: 2 Sep 2010 14:24:14 UTC

Am I correct in assuming that this proposal has been shelved?

Furthermore, please excuse my ignorance about the way the project works, am I correct in the following statements?:
1. Each work unit is pre-populated with 99 (or 100) models.
2. The work unit stops when the earlier of the pre-selected run time or the 99 models is run.
3. In the case that the run time causes the work unit to end, the models remaining untested are discarded and not used for future work units.
4. There are (for practical purposes) an infinite number of possible models, so, assuming point 3 to be correct, discarding the untested models is not an issue.
5. Given that the possible models are "infinite", there should never be a shortage of work units.
6. Shorter work units impact the server load but reduce the risk of crashing before completion or watchdog picking up an error.
7. Longer work units reduce the server load and reduce the risk of running out of work in the case of server issues such as we have recently experienced.
____________
Warped

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 67551 - Posted 2 Sep 2010 16:05:52 UTC - in response to Message ID 67549.

Am I correct in assuming that this proposal has been shelved?

Furthermore, please excuse my ignorance about the way the project works, am I correct in the following statements?:
1. Each work unit is pre-populated with 99 (or 100) models.
2. The work unit stops when the earlier of the pre-selected run time or the 99 models is run.
3. In the case that the run time causes the work unit to end, the models remaining untested are discarded and not used for future work units.
4. There are (for practical purposes) an infinite number of possible models, so, assuming point 3 to be correct, discarding the untested models is not an issue.
5. Given that the possible models are "infinite", there should never be a shortage of work units.
6. Shorter work units impact the server load but reduce the risk of crashing before completion or watchdog picking up an error.
7. Longer work units reduce the server load and reduce the risk of running out of work in the case of server issues such as we have recently experienced.


Let me take my best shot at these, I cannot confirm the status of the original proposal.

1. Not entirely correct. It just starts with the seed to a random number generator. It can generate any number of starting models from that. But some specific protocols were limited to producing 100 models, because the upload file sizes became quit large.

2. It won't interrupt a model in progress just to cut off at exactly the configured runtime preference. But it will try to avoid beginning the next model if it would be predicted (as based on the prior models in your own task) to run too long. And if it doesn't stop running within 4 hours of the configured runtime preference, that is when the watchdog is there to wrap things up.

3. Correct. Using a Monte Carlo approach allows that a sampling of the search space reveals your estimate at the answer, and so whether the specific models that would have been run if that specific task had continued or were run on a faster CPU are not specifically relevant. So long as the overall search space is adequately sampled, the specific models being examined is not critical.

4. Correct.

5. Not correct. Any server is going to have some limit on the amount of outstanding and completed work it can keep track of. And any Project Team is going to have to review the results to try and gain insights. So if everyone goes on vacation for the holidays, it doesn't make any sense to be sending out work just because there is no limits to the sampling of the search space that is POSSIBLE. It only makes sense to send work you will have staff enough to review. And it only makes sense to sample the search space to some limited degree. The objective ultimately is to be able to come up with a better, more accurate, answer with fewer samples.

6. Some volunteers have reported this. "Crashing" is a relative term. Generally models completed prior to any problem encountered are reported back and granted credit, so the specific fate of the last model of the task is not going to lose the good models you completed prior to that. If a given protocol has a quirk where some fraction of models end up running longer then 4 hours, then yes, by running with a longer runtime, you increase the number of models you begin, and therefore improve your odds of encountering one that runs for a long time. But if your alternative is to pick up another short runtime work unit which has the same odds, and begin running on it... you are exposed to the same chance of hitting a long-running model that requires watchdog intervention. There have been cases where errors were not wrapped up as cleanly as desired. But many of the reports of "crashing" fail to observe the nightly credit granting script that gives credit even after the validator has run on the task.

7. A longer running task and a less frequent server contact helps reduce server loads, certainly. But if you reduce your server contact to once per day rather then 10 times, and the server is not available at that time, you are still out of work (if you have no additional buffer or work). On the other hand, the BOINC client tends to contact the server several hours before it estimates the current work will complete, and so on average, a longer runtime would tend to help you ride through short outages given the same "additional days" of work settings. Odds improve that you will still be crunching on a 24hr task during a short outage and so it will pass without you even knowing it. If you hit the server 10 days per day, odds are you will notice any 3hr or longer outage. Just a question of whether you still have the same few hours of work left.

In other words, if you have no cache of work, that cushion that the client builds in when it requests work before you absolutely run out goes a long way, because if the server is still up, you'll probably be set for another day. And if the server is down, there's a reasonable chance you still get more work before completing the tasks you have in progress. So a day long runtime would be more similar to a short runtime with a 1 day additional buffer, say a 3hr runtime with a 21hr additional buffer would be roughly the same as a 24hr runtime and a zero buffer.
____________
Rosetta Moderator: Mod.Sense

Warped

Joined: Jan 15 06
Posts: 44
ID: 50853
Credit: 1,336,001
RAC: 225
Message 67553 - Posted 2 Sep 2010 19:43:35 UTC

Thanks for the detailed response, Mod.Sense. It certainly helps me understand how best I can contribute.

John M. Kendall Profile

Joined: Dec 8 05
Posts: 3
ID: 32433
Credit: 3,737,127
RAC: 2,304
Message 68741 - Posted 3 Dec 2010 5:39:10 UTC

To Completion time needs to be longer. Most of the work unit end up running at High Priority.
____________

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 68742 - Posted 3 Dec 2010 7:16:36 UTC

Maybe you should consider lowering the workday buffer to something lower
____________

sgaboinc

Joined: Apr 2 14
Posts: 170
ID: 498515
Credit: 125,409
RAC: 0
Message 76994 - Posted 12 Jul 2014 3:08:52 UTC

it looks like this is an old thread.
however, i'm against increasing the default run time beyond 3 hours

the reasons are that i think many of the home community do not leave their PCs on 24x7 let alone crunch boinc round the day. i think there are also many who crunch boinc/rosetta@home occasionally and there are new users. Having too long a run time would discourage these groups who may abandon the project altogether as it is too long a wait to see results / feedback e.g. on the tasks web or simply requires too long for the PC to be on to get results. electricty is not necessarily free or cheap round the world for those who participates in the projects. and very importantly, poorly configured PCs can run with loud fans which would irritate the participants having to put up with longer periods of having the PC on and processing data

changing that to 6 hours also do not resolve the issue where works are concurrently retrieved or submitted. i'd think the high traffic situation occurs in spikes

in addition, average consumer cpus today has staggering improvements in processing speeds compared to say even just 5 years ago, these benchmarks are of ranges from 10 to 100 times faster compared to the old single core pentiums, p4, athlons etc. that means a same task single iteration(decoy/model) on that old cpu now takes 1/10 to 1/100 of the original run times on a modern CPU

i'd think what could be done is to look at the protocols (e.g. boinc or even the rosetta app itself) to see if some improvements can be done so that submissions / retrieval may perhaps be staggered.

other possibilities could be mirrors possibly hosted by partners or even a more sophisticated peer-to-peer protocol. After all as rosetta@home is a distributed computing project it's likely possible that there can be distributed boinc servers handling distributed work issue and submissions. there are many examples of such successes (e.g. bit-torrent file distribution networks) but it'd require some protocol changes at perhaps the boinc level and even clients perhaps

sgaboinc

Joined: Apr 2 14
Posts: 170
ID: 498515
Credit: 125,409
RAC: 0
Message 77020 - Posted 16 Jul 2014 13:22:40 UTC
Last modified: 16 Jul 2014 13:33:15 UTC

i'd like to present a suggestion perhaps:
i reviewed some docs on boinc and apparently, it appears that this can be some what a challenge to implement but i'd like to just share a thought:

a minimum run time of 3 hours and default to 6 hours can be a standard value.

however, these values can be provided as a *computing preferences* which users can update in the user accounts on the 'computing preferences' page.

the idea is that the minimum run time and max run times are a sort of 'custom preferences' that's specific to the project (rosetta@home) and specific to the user. (and even specific to host)

when the user's boinc client connects it downloads the 'custom prefs' and saves that in an xml configuration file in the project directory perhaps.

when minirosetta starts on the user's PC, it reads the 'custom prefs' as part of initialization. it can fallback to global defaults if the values are not specified or that it falls out of the 'valid ranges'

other possible custom prefers could have the users indicating the preferences for (priorities of) small/medium/large/complex tasks, which the scheduler on the server may possibly use to present the relevant tasks. however, i'm not sure if this is already part of boinc today just that it's fully automated.

--------------
i hope this may possibly solve some problems:
1) i noted that some recent tasks/models are apparently pretty large and possibly v complex. on my pc that's running a recent Intel Haswell i7 (probably considered a decently 'fast' consumer CPU), i've seen it completing a single (or very few) models/decoys in the 3 hour time frame. while the simpler jobs some of which completes as many as close to a hundred models/decoys in that same time frame)

this may result in too few results being produced for the larger complicated structures

having a longer minimum run time could help the large complex tasks complete more models/decoys

2) some users who may own somewhat slower PCs and it needed more time to complete the tasks or who would like the tasks to produce more models/decoys for each job (this would mean needing a longer run time)

---------------
however, commitment to this 'default run time' durations as i elaborated previously is very much dependent on the users's specific circumstance and the global defaults should not be too 'onerous' to discourage the new users or the occasional 'light' volunteers group.

while there are also others who probably leaves a host (PC/server) on crunching boinc/rosetta round the clock 24x7

i.e. one size fits all is probably a bad idea and custom user preferences specifying this as a custom 'computing preference' relevant to rosetta@home is probably a way to alleviate/resolve this

just 2 cents

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 77027 - Posted 17 Jul 2014 21:10:08 UTC

My first thought on reading your post was that this is what currently is supported (via the Rosetta-specific preferences, configured via the website). But I think what you are suggesting that would be different is essentially to tag the tasks with some relative size, and then allow the user to configure whether or not they want to accommodate that size.

I guess even better would be to get BOINC Manager to do such selection for you. So it would be automatic that if you have a smaller machine or don't run very long each day that these tasks would not be sent. And it would seem as though this could be done by establishing appropriate memory and FLOP estimates on the tasks. Perhaps based upon reported results from Ralph@home. But that gets to be sticky with the existing runtime preference setting, and how BOINC Manager normalizes the runtimes.
____________
Rosetta Moderator: Mod.Sense

sgaboinc

Joined: Apr 2 14
Posts: 170
ID: 498515
Credit: 125,409
RAC: 0
Message 77623 - Posted 31 Oct 2014 11:57:50 UTC
Last modified: 31 Oct 2014 12:55:48 UTC

today:
Total queued jobs: 9,095,679
In progress: 298,286
Successes last 24h: 115,117

perhaps it is time to reduce the default run time for everyone?

thanks modsense, would try out the boinc-client/manager preferences 1st

shanen Profile
Avatar

Joined: Apr 16 14
Posts: 83
ID: 500533
Credit: 3,879,280
RAC: 5,674
Message 78885 - Posted 7 Oct 2015 1:55:45 UTC

Seems to me like there are several issues mixed together here, and I've just spent a long time trying to sort out the thread without being to find a clear focus on things. Let me try to break it down another way:

There seem to be two objectives:

1. Doing good by solving problems.
2. Earning credit for doing that good.
3. Productively using computer cycles that would be wasted.

I was going to start breaking it down into tradeoffs, but almost all of the cases kept coming back to wasted effort on my part (or on my computer's part). For example, the original idea of this thread was to have larger run times, but that causes more conflicts with my normal usage of my computers. I'm already noticing how the long checkpoints tend to result in lost work each time a computer is started or shut down.

I've mostly been focused on projects that don't do any checkpoint for an hour or longer. For several reasons I feel it is better to shut down properly rather than sleep, but if I check the status of the in-progress work, I often find that hours of work will be discarded unless I sleep the machine... The system is complicated and unreliable, and the only safe guideline seems to be favoring the smallest work units with the most frequent checkpoints and the longest deadlines.

Just seeming too complicated and confusing, which is why I dropped my previous projects (after earning over 1 million and almost 300,000 "Work done" points). Right now I'm leaning towards doing what I can to help their bandwidth problems by dropping rosetta@home (after earning 1.4 million points). (In the ancient pre-BOINC days when seti@home was the only game, I had worked my way up to top 1% status, but I always felt that project was pointless.)

Not likely, but maybe someone can represent a BOINC project that is NOT so troublesome?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 78886 - Posted 7 Oct 2015 3:19:22 UTC

@shanen, I'll just point out that the target runtime you set on R@h has no impact on the frequency of a task checkpointing. Most R@h tasks checkpoint every 15 minutes or so. Some types of tasks can take over an hour, but they are less common.

The runtime just determines how many models your machine will work on for a given protein challenge. More models completed means more credit, and more crunch time before using bandwidth to get a new task. This thread is discussing the runtime. It actually sounds like you are more interested in the frequency of checkpointing. If that is the case, feel free to open a new thread. The number crunching board is probably the best place to discuss that topic further.
____________
Rosetta Moderator: Mod.Sense

Link
Avatar

Joined: May 4 07
Posts: 260
ID: 173059
Credit: 337,463
RAC: 237
Message 78890 - Posted 8 Oct 2015 7:36:52 UTC - in response to Message ID 78885.

shanen wrote:
Not likely, but maybe someone can represent a BOINC project that is NOT so troublesome?

Well, Seti@Home is still checkpointing pretty much as often as you want, or Milkyway@Home was checkpointing on the CPU as often as you want (IIRC, I run it mostly on my GPU without any checkpointing at all).

You can simply use the results of WUProp@home to find a project, after all that's what for people let it run on their computers, to help others find suitable projects for their computers.

... or you simply hibernate you computer instead of shuting it down, there's nothing wrong with that, since XP came out I restart only when really needed, for example after some updates.
____________
.

shanen Profile
Avatar

Joined: Apr 16 14
Posts: 83
ID: 500533
Credit: 3,879,280
RAC: 5,674
Message 78896 - Posted 11 Oct 2015 2:23:47 UTC

Well, I just made a second attempt to start a thread in the new direction as suggested, but it seems I failed again.

Let me try to put a quick wrapper around the problem. I do NOT want to spend a lot of time trying to figure out why a BOINC project seems to be failing to make any progress, or even to understand how the project website works. Nor do I want to substantially modify my computer usage habits for the greater convenience of the BOINC projects.

I just want to donate the available cycles to do some good. The main reason I abandoned the last two BOINC projects I was supporting was because of complexities in their operations.
____________
Freedom = (Meaningful - Constrained) Choice != (Beer^3 | Speech)

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 78898 - Posted 12 Oct 2015 2:16:46 UTC - in response to Message ID 78896.

Well, I just made a second attempt to start a thread in the new direction as suggested, but it seems I failed again.

Let me try to put a quick wrapper around the problem. I do NOT want to spend a lot of time trying to figure out why a BOINC project seems to be failing to make any progress, or even to understand how the project website works. Nor do I want to substantially modify my computer usage habits for the greater convenience of the BOINC projects.

I just want to donate the available cycles to do some good. The main reason I abandoned the last two BOINC projects I was supporting was because of complexities in their operations.

Maybe I'm being very stupid, but I looked at a couple of your machines and you seem to complete 15-30 tasks a day. These will have had zero downtime. On the assumption you shutdown once a day (may be wrong) you might be losing a few minutes each since the last checkpoint of just your running tasks. That's as close to zero (for the day) as makes no difference.

I agree the FKRP tasks seem to struggle for their first checkpoint - some hours - but only those tasks.

Dare I say it, you seem to be wasting far more time trying to micro-manage tasks (and writing about them) than if you just let them run. If you want to save a whole heap of time, don't go checking your downloads at all and cherry-picking ones to delete. As long as there's enough time left in your day to complete them, they'll sort themselves out without any downtime and no micro-managing what Boinc and the tasks do routinely anyway.

In answer to a question you asked in one of you threads, there isn't a problem so I don't do anything about them and just get on with my day.

That said, maybe I've misunderstood or completely missed the issue. It wouldn't be the first time.
____________

Message boards : Number crunching : Discussion on increasing the default run time


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^