Shorter WU deadlines

Message boards : Number crunching : Shorter WU deadlines

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Nightbird

Send message
Joined: 17 Sep 05
Posts: 70
Credit: 32,418
RAC: 0
Message 10489 - Posted: 5 Feb 2006, 22:11:37 UTC
Last modified: 5 Feb 2006, 22:12:23 UTC

@ David Baker
another change is that the maximum work unit length has been increased to eliminate (hopefully!) the time out problem.

Ehm, can you explain ?
not sure that i understand here


ID: 10489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 10495 - Posted: 6 Feb 2006, 9:28:32 UTC
Last modified: 6 Feb 2006, 9:31:14 UTC

When work is created it has a number which essentially says how many OPS it will take to complete. From this the max time is calculated. If the number was too low, there is not enough time to do all the operations so the Result times out.

==== edit

Results that would never complete are what this is ment to address. The problem is that if the time to complete/# operations to complete is highly variable choosing the "right" value is difficult.
ID: 10495 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nightbird

Send message
Joined: 17 Sep 05
Posts: 70
Credit: 32,418
RAC: 0
Message 10497 - Posted: 6 Feb 2006, 11:50:27 UTC - in response to Message 10495.  
Last modified: 6 Feb 2006, 11:51:52 UTC

When work is created it has a number which essentially says how many OPS it will take to complete. From this the max time is calculated. If the number was too low, there is not enough time to do all the operations so the Result times out.

==== edit

Results that would never complete are what this is ment to address. The problem is that if the time to complete/# operations to complete is highly variable choosing the "right" value is difficult.

Thanks for your answer :)
but
OPS ? operations, i guess and a new acronym for your "BOINC Acronyms" ;)
and how is calculated the max time ?



ID: 10497 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 10499 - Posted: 6 Feb 2006, 13:41:40 UTC - in response to Message 10497.  

When work is created it has a number which essentially says how many OPS it will take to complete. From this the max time is calculated. If the number was too low, there is not enough time to do all the operations so the Result times out.

==== edit

Results that would never complete are what this is ment to address. The problem is that if the time to complete/# operations to complete is highly variable choosing the "right" value is difficult.

Thanks for your answer :)
but
OPS ? operations, i guess and a new acronym for your "BOINC Acronyms" ;)
[/quote]

Ops is a fairly standard abbreviation within computing. Ambiguously it can stand for 'operations' or 'operations per second' - you have to figure out which from the context.

Also Flops, which depending on context can mean fl[/]oating point [u]operations, or floating point operation per second.

and how is calculated the max time ?

The project specifies a max number of ops for each workunit (or is it each type of WU?), and the client uses the benchmark to turn this into a max cpu time.
ID: 10499 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 10505 - Posted: 6 Feb 2006, 14:49:31 UTC

In theory yoiu can specify IntOPS and FFLOPS separately, few project do a good study of this to set the proportions. Almost all concentrate on FLOPS only.

So, I was generic Operations per Second encompassing both floating point operations and ineger operations. If you dig deep you can also find that there are other classes of number systems used and each of these will also have characteristic speeds.

These systems include:
BCD - Binary coded decimal, used for exact decimal math
Boolean - though usually lumped with integer, can be slightly different in speed, usually not enought to matter (which is why it is lumped with integer)
Fixed Point - which is another number encoding scheme.

I know that there are a couple of others that are even less common but can't remmeber them off the top of my head.

And, FLOPS is actually sometimes different if you talk single precision vs. double precision. WIth IEEE 754 complient systems this is usually not an issue as the FPUs use 80-bit internal representations (usually) and so the only difference is in the final output result.

This is another reason that optimizing code can change the output values. If I stay in 80-bit precision over more operations my numerical error propagation is reduced as I have more digits of accuracy. If I am pulling the numbers out of the FPU and converting them back to single precision and then back, well, the values will TEND to have slightly more error.

Not an issue in most cases. However in iterative systems minor changes in the accumulation of error can give amazing differences in final outputs because of these seemingly trivial changes. There are some good references in the Wiki ... look up floating point numbers. I try to summarize how a lot of this "works", though I am sure that the simplification makes it rather incorrect in detail.

The difficulties lie in the fact that floating point is a scheme that encodes numbers. It can only precicely define certain values. All other values cannot be represented (shown in the Wiki example). More interesting is the fact the the intervals between numbers that arer representable are not constant. Depening where you are on the number line drives the "distance". Factors like these easily catch people off guard, including many of us that should know better. What is more distressing is the fact that many scientists do not really understand the fundamental issues and how they can affect the system they are coding.

Part of these issues are what drove the dabate between myself and Jack S. with regard to the random number generator used. I worked with a mathematician once on a system that derived a polynomial that represented the curves represented by a matrix of numbers. Over the decade I worked with him, never really understanding what the heck we were doing, I developed an acute sensitivity to the questions around and about as they are used in iterative systems.

With a RND Gen you have two concerns a) period, and b) distribution ...

An eample of period is the Apple II where the RND function was believed to have a period of 1G or so ... it turns out to have been roughly 17,000 before it began to repeat itself. One of the reaons many games on the early Apples were not that exciting ... :)

anyway, more than you wanted to know I am sure ...
ID: 10505 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nightbird

Send message
Joined: 17 Sep 05
Posts: 70
Credit: 32,418
RAC: 0
Message 10512 - Posted: 6 Feb 2006, 20:49:56 UTC

anyway, more than you wanted to know I am sure ...

Indeed but i don't have the feeling that I understand you. ;)


ID: 10512 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 10519 - Posted: 7 Feb 2006, 0:39:20 UTC - in response to Message 10497.  

When work is created it has a number which essentially says how many OPS it will take to complete. From this the max time is calculated. If the number was too low, there is not enough time to do all the operations so the Result times out.

==== edit

Results that would never complete are what this is ment to address. The problem is that if the time to complete/# operations to complete is highly variable choosing the "right" value is difficult.

Thanks for your answer :)
but
OPS ? operations, i guess and a new acronym for your "BOINC Acronyms" ;)
and how is calculated the max time ?

The max time is set by the project. It is a parameter that is part of the WU that is sent to your computer. It is only a Maximum number, and is expressed in number of operations. It really does not matter too much in human terms if it is CPU cycles or floating point calculations. The system keeps count of which ever it is and when it hits the number set by the project, it aborts the WU with a Max time error.

The project sets the value by giving their "best estimate" of the time a WU should run, and then converts the value to operations per second, or flops, or what ever you want to call them. As a result of having a lot of Max time errors the project recently raised the Max time setting (bounds) for many WU types, and the Max time errors have almost stopped. I have only seen one in the last week. Before they made the adjustment I was seeing 5 or 6 a day.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 10519 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikus

Send message
Joined: 7 Nov 05
Posts: 58
Credit: 700,115
RAC: 0
Message 10607 - Posted: 9 Feb 2006, 23:01:36 UTC - in response to Message 9729.  

In normal running (when projects are run in Round Robin mode) the cache for each project does indeed run first-in-first-out.

However, as soon as that is predicted to make any WU late for a deadline, the machine switches to EDF - earliest deadline first - mode, and then the WU from all projects are put into deadline order. In this mode the short deadline WU from LHC get crunched before long deadline WU that came in earlier.

In particular, if you have a cache of (say) 3 days work and you get a WU with a 2 day deadline, it instantly puts the box into EDF precisely because the new work would otherwise go to the back of the queue.

On Pirates@home they have sent out WU with 6-hour deadlines - these usually run instantly on download as with almost any mix of work they would otherwise be 'late'.

In addition, the server will not send out more short deadline work that your box can crunch in 90% of the alloted time - so a batch of short deadline work gets spread around many boxes even when each box is asking for a lot of work.


Two comments:

(1) Is this information (and the earlier explanation that EDF mode is entered if the deadline for a WU is more than half the cache size) __available__ to people who want to join the project ? I came to Rosetta from DF; if I had known that Rosetta would start playing "priority games" I would probably not have joined.

(2) I myself RESENT that now with 7-day WUs I would have to set my "cache size" (actually, the interval between connects) to THREE days to avoid EDF mode. I run offline; I have very little liking for a project that now FORCES me to connect every three days. WHY force __participating volunteers__ to "jump" as soon as the project comes up with a "bright idea" ?

To me, a more reasonable way to run a railroad would be to make longer-deadline WUs still available, and to set up the download process such that for users specifying an interval between connects of 7 days or more, the WUs preferentially downloaded would be such that the client is __not__ precipitated into EDF mode.

mikus


--------
p.s. I intend to run only one BOINC project per computer. My computers are normally off-line. The computer on which I run Rosetta has a specified 'time to connect' of eight days. With the new 7-day WUs, that computer is put into EDF mode as soon as the first WU starts downloading. But the only project *is* Rosettta, and *all* Rosetta WUs currently have seven-day deadlines. On my computer EDF mode accomplishes NOTHING (except aggravation).
.

ID: 10607 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 10608 - Posted: 10 Feb 2006, 1:39:13 UTC - in response to Message 10607.  

Two comments:

(1) Is this information (and the earlier explanation that EDF mode is entered if the deadline for a WU is more than half the cache size) __available__ to people who want to join the project ? I came to Rosetta from DF; if I had known that Rosetta would start playing "priority games" I would probably not have joined.

(2) I myself RESENT that now with 7-day WUs I would have to set my "cache size" (actually, the interval between connects) to THREE days to avoid EDF mode. I run offline; I have very little liking for a project that now FORCES me to connect every three days. WHY force __participating volunteers__ to "jump" as soon as the project comes up with a "bright idea" ?

To me, a more reasonable way to run a railroad would be to make longer-deadline WUs still available, and to set up the download process such that for users specifying an interval between connects of 7 days or more, the WUs preferentially downloaded would be such that the client is __not__ precipitated into EDF mode.

mikus


--------
p.s. I intend to run only one BOINC project per computer. My computers are normally off-line. The computer on which I run Rosetta has a specified 'time to connect' of eight days. With the new 7-day WUs, that computer is put into EDF mode as soon as the first WU starts downloading. But the only project *is* Rosettta, and *all* Rosetta WUs currently have seven-day deadlines. On my computer EDF mode accomplishes NOTHING (except aggravation).


The information is available from a range of sources. The easiest one to point out is the WIKI. There is a link to the WIKI from the project home page.

I have no doubt that the situation you describe is occurring on your systems. But it is caused by the way in which you have chosen to operate your machines. BOINC was never designed as a single project system. The whole point of BOINC is to share otherwise unused cycles between a number of projects. While it is possible to run BOINC with intermittent network connections, the system works best with a permanent connection. The choices made by the project as to reporting deadlines are being made because there is a reason for them. This has been discussed in the science forum.

With a project such as this, the project team has to run the project by balancing what is required by the science, against how the majority of the user community is configured. This by definition means that there will be systems and configurations at either end of the spectrum that will have significant inconvenience, or be unable to run the project. The vast majority of the users of R@H have no difficulty at all. They set the system up, adjust their preferences, and the thing just runs. On occasion, a WU will fail, or some other transient problem occurs, but this is just the nature of the beast and they take it in stride.

It seems as though for your setup a project like Einstein, or predictor may be more appropriate. But considering that there are somewhere around 36,000 users and most of them are not having significant difficulty, it is unlikely that a lot of changes would be made to accommodate 50 or even a few hundred users who wish to operate their systems near the outer edges of the normal BOINC environment.

Your problem could be solved by just setting the connection interval to 6 days, letting the system connect when it needs to connect, and let the system run itself. If you have only the one modem, the perhaps a $50 LinkSys switch would allow you to network your system and they could share the connection. The fact that the system enters EDF mode when it is only running one project has no impact whatsoever in practical terms. If none of this will wok for you then I think you have decided that R@H won't work for you.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 10608 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ecafkid

Send message
Joined: 5 Oct 05
Posts: 40
Credit: 15,177,319
RAC: 0
Message 10609 - Posted: 10 Feb 2006, 1:41:56 UTC

Not to burst your bubble but I just downloaded some WU's with 30 day deadlines. So not all WU's have the 7 day deadline. And yes they are some of the earlier WU's but because of the 30 day turnaround it took them that long to cycle back through.



ID: 10609 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nasher

Send message
Joined: 5 Nov 05
Posts: 98
Credit: 618,288
RAC: 0
Message 10704 - Posted: 12 Feb 2006, 23:31:30 UTC

Well being that i am curently deployed and cant check on my computers besides lookin at results sent by all projects they seem to me to be running fine..


as for the short 7 day work units . in another post they told us that there were some jobs they needed results quicker than normal and that they would be going back to the longer times later.

I understand that from time to time there is a reason that a project may need to get a certain job or set of WU's done quickly.. for instance when Pirates put out its last set of jobs they ran about 5-50 min and had a 6 hour deadline since they needed answers now. corse people runnin pirates knew this.

i think a 7 day turnaround for some WU's sometimes is reasonable, but i hope it dosnt become the NORM... if it is now and again i have no problem with it since it will crunch my jobs as i desire without me adjusting schedules..

oh another thing if you have it set to get 8 days or work of corse its going to grab alot of Rosetta or whatever you ask it to connect to.. then since the rosetta WU's you grabed were of the 7 day variety yes it went to EOD timeing .. personaly i have myself setup for .1-.5 days dependin on the machine and whats its used for.. this tends to keep at least 2-3 jobs of one project or another and not put me into EOD that often.
ID: 10704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 10714 - Posted: 13 Feb 2006, 6:25:25 UTC
Last modified: 13 Feb 2006, 6:28:40 UTC

EDF (earliest deadine first) isn't an error. It's just an explanation of what the scheduler is doing. It still keeps track of all project resource shares by the amount of time spent, and balances the work done long term by this number. EDF just means you'll do a bunch of one, then switch to another, then switch again. All the time it will balance out the resource share you've requested. In EDF it just doesn't happen "intraday", but can take days/weeks to balance. There is nothing wrong with running in EDF mode.

Amongst us Boinc Alphatesters and Developers we are currently discussing name changes to get people to understand that "overcommitted" and "edf" aren't bad. Thus far we haven't decided on the proper terminology, or methodology to achieve this.


ID: 10714 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Los Alcoholicos~La Muis

Send message
Joined: 4 Nov 05
Posts: 34
Credit: 1,041,724
RAC: 0
Message 10732 - Posted: 13 Feb 2006, 14:52:26 UTC
Last modified: 13 Feb 2006, 14:57:15 UTC

EDF maybe a feature but it doesn't work properly with the variations in wu sizes of R@h.

This computer (only R@h, 24/7) got 18 PRODUCTION_ABINITIO wu's on 01-18 with the deadline of 02-15 (these wu's take over 10 hours to finish). Then R@h started to send out these short deadline wu's.
Thanks to the EDF this computer did the earliest deadline first (about 150 wu's) and started only yesterday with these big wu's. It finished 2 so far. So it will probality finish another 3 wu's in time.
Boinc then will let it work on wu's which deadline is already passed. And since this computer is located elsewhere I have to travel 60 miles to abort those wu's to prevent this computer from wasting about 150 hours CPU time.
ID: 10732 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Shorter WU deadlines



©2024 University of Washington
https://www.bakerlab.org