Discussion on increasing the default run time

Message boards : Number crunching : Discussion on increasing the default run time

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 10 · Next

AuthorMessage
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 56932 - Posted: 14 Nov 2008, 19:58:43 UTC
Last modified: 15 Nov 2008, 3:56:34 UTC

We are planning to increase the default run time from 3 hours to 6 hours and the minimum from 1 to 3 hours to reduce the load on our servers. There will be a transition period where your client will adjust to the new run time which will affect the number of tasks that are queued on your client. I've created this thread for a discussion on what would be the best way to transition to an increased run time. This obviously will only affect people with default run times (people who have not bothered to set this preference) or people who have set their run time to be less than 3 hours. (edit: not 6, whoops!)
ID: 56932 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 56934 - Posted: 14 Nov 2008, 20:55:11 UTC
Last modified: 14 Nov 2008, 21:03:33 UTC

For people that pull a week of work at a time, due to infrequent internet connections, increasing the runtime from 3 to 6 hours would mean they get twice as much work as they can crunch.

Would it be possible to increase the default like 5 minutes a day or something? That would be so gradual that after a week you would be at 3:35 as compared to the 3hrs previously (i.e. only a max of 18% variance). It would take you 6 weeks to get all the way up to 6hrs, but the work flow should be pretty steady for the client. It shouldn't noticably over or under load with work.

[edit]
I guess for all the same reasons, a gradual change to the min. runtime would be required too.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 56934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 56935 - Posted: 14 Nov 2008, 20:58:39 UTC

Anyone that wants to avoid such problems could always change their runtime from the default at a time of their choosing, either before or during such a transition.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 56935 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 56937 - Posted: 14 Nov 2008, 21:40:23 UTC
Last modified: 14 Nov 2008, 21:48:52 UTC

...assuming they knew about the proposed change. How many of the crunchers actively read the forums? I suspect a very small number. How about a "Rosetta News Letter" mass mailing?

If it was a problem, why didn't the project ask "the regulars" to change their default run time ages back? That might have bought some time or even alleviated the issue.

I've just changed all mine.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56937 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 56941 - Posted: 15 Nov 2008, 1:25:27 UTC

Actually, it would be awesome to have a certain amount of models per WU. that way it would be way easier to compare CPU performance by just seeing how fast the CPU can crunch a WU.

I miss that from SETI@Home :(
ID: 56941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Gavin Shaw
Avatar

Send message
Joined: 1 Feb 07
Posts: 10
Credit: 506,456
RAC: 0
Message 56942 - Posted: 15 Nov 2008, 1:52:26 UTC - in response to Message 56932.  

This obviously will only affect people with default run times or people who have set their run time to be less than 6 hours.


Perhaps I'm just thick or slow (it is the weekend where I am), but how does changing the min time to 3hr and the default to 6hr affect me when I have my run time set to 4hr? It is still greater than the min time so nothing should change right?

Never surrender and never give up. In the darkest hour there is always hope.

ID: 56942 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 56943 - Posted: 15 Nov 2008, 3:33:38 UTC - in response to Message 56942.  

It is still greater than the min time so nothing should change right?


Right. You are not impacted by the proposed change to default run time, because you are not using the default. And you are not impacted by the proposed change to minimum runtime, because you are over the proposed new minimum runtime.
Rosetta Moderator: Mod.Sense
ID: 56943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 56945 - Posted: 15 Nov 2008, 3:47:23 UTC

There might be a downside to increasing the default run time: if a task takes abnormally long for any reason it relies on the watchdog thread to stop it if it exceeds 3 times the preferred time (see below for an example). So if rosetta gets stuck in an infinite loop or something the amount of time wasted will be equal to 3 times the preferred time: clearly shorter preferred times are preferable in such a case.


206764478
Name 1hzh_2cxh_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_289_0
Workunit 188615593

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 48690.3 seconds. Greater than 3X preferred time: 14400 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>

ID: 56945 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Gavin Shaw
Avatar

Send message
Joined: 1 Feb 07
Posts: 10
Credit: 506,456
RAC: 0
Message 56946 - Posted: 15 Nov 2008, 4:00:03 UTC

Though the watchdog doesn't seem to kick in until about 3.5x the desired time has elapsed. Perhaps it is giving the unit some time to finish off before booting it?

Never surrender and never give up. In the darkest hour there is always hope.

ID: 56946 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 56957 - Posted: 15 Nov 2008, 13:54:50 UTC - in response to Message 56945.  

There might be a downside to increasing the default run time: if a task takes abnormally long for any reason it relies on the watchdog thread to stop it if it exceeds 3 times the preferred time (see below for an example). So if rosetta gets stuck in an infinite loop or something the amount of time wasted will be equal to 3 times the preferred time: clearly shorter preferred times are preferable in such a case.

That's a good point. Perhaps the Watchdog should be more aggressive about aborting stuck workunits. Maybe it could abort the WU after 2x, or even 1.5x the specified crunching time. The old 3x with 3 hours is 9 hours, and 1.5x with the new 6 hours would still be 9 hours.
ID: 56957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 56960 - Posted: 15 Nov 2008, 15:23:13 UTC

Yes, if the default runtimes were changed, the watchdog could be revised as well. The watchdog used to wait for 4x the preferred runtime.
Rosetta Moderator: Mod.Sense
ID: 56960 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1960
Credit: 38,076,311
RAC: 6,958
Message 56983 - Posted: 16 Nov 2008, 4:02:54 UTC

I understand completely the motivation behind increasing the default run time and if I only received Rosetta Beta 5.98 WUs I'm sure I'd hold to that default successfully.

But as I report here (and previously) I get Mini Rosetta WUs constantly crashing out with "Can't acquire lockfile - exiting" error messages - maybe 60% failure rate with a 3-hour runtime, reducing to 40% failure rate with a 2-hour run time.

I've seen this reported by several other people running a 64-bit OS - not just on Vista or with an AMD machine. That said, I don't know how widespread it is. Perhaps you can analyse results at your end.

As stated in the post linked above, I get no errors at all with Rosetta Beta, so I'm inclined to think it's not some aberration with my machine. I'd really like to see some feedback on this issue and some assurance it's being investigated in some way.

I'd ask that a minimum run time of 2 hours is allowed (I can just about handle that) or some mechanism that allows me to reject all Mini Rosetta WUs. If not, I'm prepared to abort all Mini Rosetta WUs before they run. It's really a waste of time me receiving them if 60% are going to crash out on me anyway.

I've commented on this before here, here, here and first of all and more extensively here - see follow-up messages in that thread.

No such issues arose for me with my old AMD single core XPSP2 machine - only when I got this new AMD quad-core Vista64 machine.

Any advice appreciated. It's a very big Rosetta issue for me, so while I'm sure you'll save a whole load of bandwidth if you go ahead with the proposed changes I just hope some allowance can be made for people in my situation.
ID: 56983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 56994 - Posted: 16 Nov 2008, 14:12:38 UTC
Last modified: 16 Nov 2008, 14:38:21 UTC

"Can't acquire lockfile - exiting"

That's familiar. Go to "Your Account" then "Computing Preferences" check that at the bottom of the first block "Use at most" is set to 100%. That lock file error is common on systems where this is not set to 100% at some projects.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56994 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1960
Credit: 38,076,311
RAC: 6,958
Message 56998 - Posted: 16 Nov 2008, 14:46:06 UTC - in response to Message 56994.  

adrianxw wrote:
"Can't acquire lockfile - exiting"

That's familiar. Go to "Your Account" then "Computing Preferences" check that at the bottom of the first block "Use at most" is set to 100%. That lock file error is common on systems where this is not set to 100%.

Thanks for the comment - very promising. I'm showing (sorry for the layout):

Processor usage Default Home
Use at most per cent of CPU time
Enforced by version 5.6+ 50 100

Disk and memory usage Default Home
Use at most 5 100 GB disk space
Leave at least 0.1 0.001 GB disk space free
Use at most 50 50 % of total disk space
Write to disk at most every 60 60 seconds
Use at most 50 75 % of page file (swap space)
Use at most
Enforced by version 5.8+ 50 50 % of memory when computer is in use
Use at most
Enforced by version 5.8+ 90 90 % of memory when computer is not in use

Specifically, which 'use at most' are you referring to? The one under procesor usage?

My Default Computer Location is set to 'Home' if that make a difference.
ID: 56998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 57012 - Posted: 16 Nov 2008, 23:55:09 UTC - in response to Message 56998.  
Last modified: 17 Nov 2008, 0:09:10 UTC

I don't mind 6h default runtime, as that's what I'm using right now anyway.

I also wouldn't mind setting it higher, but :
Is it still correct that the Rosetta Client can enter a deadlock and will abort the WorkUnit not before 2x (or even 4x ?) of the scheduled runtime has elapsed ?

At least that's what I remember from reading the Q&A a long time ago.
I don't have any problems getting an occasional Computing Error or stalled WorkUnit but would mind wasting 24h (or even more) of runtime.

If that's all history already and not valid anymore, I'd happily switch to 24h runtime.

Just thought I'd ask, as I'm about to set Rosetta to full throttle in my network.

-- edit --
I'm also seeing h001b_BOINC_ABRELAX_RANGE_yebf failing with Compute Errors (on different Systems including other Hosts of the Quorum)... Losing 2-5h of work is one thing, losing 12-23h would be more disappointing.

Right now (pending any "max time exceeded" related problems), that would by my only concern increasing runtime significantly beyond what I got right now.

(would be cool if correct/complete predictions of a failed WorkUnit before the error occured could be credited and counted - that way a model induced compute error wouldn't really matter anymore regardless of runtime)
ID: 57012 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ejuel

Send message
Joined: 8 Feb 07
Posts: 78
Credit: 4,447,069
RAC: 0
Message 57016 - Posted: 17 Nov 2008, 1:08:39 UTC - in response to Message 56932.  

We are planning to increase the default run time from 3 hours to 6 hours and the minimum from 1 to 3 hours to reduce the load on our servers.



Can you please explain this in lamens' terms? Are you stating that you are making changes on the server or on our clients? If on our clients, please explain what you mean. For example, are you making the Work Units twice as big/complex which means my machine will take twice as long to crunch each WU? If you are talking about the server are you stating that our client must wait at least 6 hours before connecting again for reporting or new WUs?

Again, your quote is very open ended and can mean a number of things.

Thanks.

-Eric
ID: 57016 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57019 - Posted: 17 Nov 2008, 2:15:00 UTC
Last modified: 17 Nov 2008, 2:16:26 UTC

ejuel, DK is talking about the Rosetta specific preference for how long each task runs on your client machine. If you express no preference, the default is for tasks to run for 3 hours presently. But the drop down list lets you chose from a preference of 1 hour, through 24 hours for each task.

This is just a preference. It's not a hard limit and, as is often discussed on the message boards, there are cases where task run well past the runtime preference. By increasing the minimum from 1hr to 3hrs, and the default from 3hrs to 6hrs, more tasks will execute more predictably and consistently within the established preference.

The net result of that is that your client (if running with default settings) runs through 4 tasks per day per core, rather then 8. Still doing 24hrs of useful work to help the science of Rosetta@home. Just running more models against each task before reporting the results back.

So, it is a change to the definition of the default value for your runtime preference, which is defined on the server side, and effects every task run under the profile the setting pertains to.
Rosetta Moderator: Mod.Sense
ID: 57019 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ejuel

Send message
Joined: 8 Feb 07
Posts: 78
Credit: 4,447,069
RAC: 0
Message 57021 - Posted: 17 Nov 2008, 2:38:16 UTC - in response to Message 57019.  

ejuel, DK is talking about the Rosetta specific preference for how long each task runs on your client machine. If you express no preference, the default is for tasks to run for 3 hours presently. But the drop down list lets you chose from a preference of 1 hour, through 24 hours for each task.


Thanks...but a few follow-up questions:

1)Why are all my not-processed-yet WUs now predicting 9hours 41mins to process rather than 6 hours? 6 vs 9:41 is a big difference.

2)What will happen to the 15+ WUs I have that are not completed yet, but are due within 48 hours? Mathematically there is no way I can crunch through 15+ WUs in 48 hours if each WU will take 9:41 to finish.

3)I assume RAC will not change...since RAC is not counting the quantity of WUs but rather the work/time ratio done on those WUs.

Any other pitfalls we should consider?

-Eric
ID: 57021 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57025 - Posted: 17 Nov 2008, 14:56:38 UTC

1) BOINC "learns" how long it takes your machine to complete tasks for each project. One or more of your very recent tasks took closer to 10hrs to complete. And so BOINC estimates that future tasks may take about as long (not a valid assumption).

2) Existing WUs in your cache *ARE* effected by runtime changes. That is one of many reasons to discuss and consider the topic carefully before making such a change in the project. And so, if the change were made today, and you've got all that work due in 2 days, your machine would miss some deadlines and the tasks would not receive credit. Then things would be back to normal. (or you would have to manually abort a few of them, until your machine adjusts to the new runtime).

3) Correct. RAC will not be directly impacted by the change.
Rosetta Moderator: Mod.Sense
ID: 57025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 57045 - Posted: 18 Nov 2008, 17:26:12 UTC

Is there a reason why the watchdog couldn't work at the level of each individual model rather than the task as a whole? That way, you'd avoid the potential extra time wastage that might happen with longer run times if a model goes haywire.
ID: 57045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 10 · Next

Message boards : Number crunching : Discussion on increasing the default run time



©2024 University of Washington
https://www.bakerlab.org