BOINC - known issues - so why use it?

Message boards : Number crunching : BOINC - known issues - so why use it?

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7594 - Posted: 25 Dec 2005, 11:56:03 UTC

Hi,

People coming to a BOINC project for the first time often find a number of things about BOINC very irritating. If they come in with past experience of a non-BOINC distributed computing project, they immediately say "it was much better on Zetagrid / CPDN classic / etc etc. I know, that's exactly what I said 11 months ago after I arrived at Einstein, my first BOINC project after (you guessed already) Zetagrid and CPDN classic.

So this thread documents the known issues about BOINC and also to answer the obvious next question 'so why do we use BOINC then'.

Ive only listed the irritants that irritate me - feel free to add your own. Ive only listed the advantages that I like - feel free to add your own.

I personally have come to feel BOINC's benefits outweigh its irritants. If you feel differently, it's OK to say so -- but make criticisms about the software not the people who wrote it / chose it please.

There are many fine DC projects out there that do not use BOINC. There are several sites that contain whole lists of DC projects, see this list of lists for example.

I won't slag off any project team that decides to build their own result handler, and I won't slag you off if after thinking about the issues you feel you will be happier donating your spare cycles to one of those projects because you don't like BOINC.

Likewise I rely on *your* good will not to slag off those of us who decide that, on balance, we like BOINC, warts and all.

My next post in this thread lists a couple of the 'warts'. In the posting after that I will try to say what (for me) tips the balance the other way.

FWIW this thread is my Christmas present to Rosetta.

Happy new year to all crunchers, both within and without BOINC.

River~~
ID: 7594 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Keck_Komputers
Avatar

Send message
Joined: 17 Sep 05
Posts: 211
Credit: 4,246,150
RAC: 0
Message 7597 - Posted: 25 Dec 2005, 12:39:18 UTC

My number one plus to BOINC is the multi-project, and project independant nature. One interface to learn for several projects. Not having to babysit it to make sure the different projects all get their fair shares. Makes BOINC far ahead of any other distributed computing endevour that I have tried.
BOINC WIKI

BOINCing since 2002/12/8
ID: 7597 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7599 - Posted: 25 Dec 2005, 12:43:25 UTC
Last modified: 25 Dec 2005, 13:28:19 UTC

Known issues. Most annoying (to me personally) listed first.

1. The ghost WU problem.

Sometimes a BOINC server thinks it has issued one or more results to a client, but the client doesn't think it got them. These results count against that cpu's quota for that day, and when they don't get returned (how could they be?) the quota for the future is reduced as well.

This effect is mitigated by the fact that whenever you return valid results your quota doubles, up to the limit set by the project. So it does not take many valid WU to get your quota back to the max.


2. Counter-intuitive scheduling decisions

If you have more than one project, BOINC schedules how these projects interleave on your computer. Some of its decisions are counter intuitive, at least one is plain wrong (how it schedules long jobs on an multi-cpu box).


3. Separate upload & update

A source of much confusion to newcomers. Data is uploaded at the end of the run, but it does not 'count' until a separate 'update' step, also known as the 'report' step. It confuses newcomers, and still irritates old hands, when work that has been successfuly uploaded gets lost because the report is delayed.

Moreover, this effect is all the more irritating because BOINC deliberately delays the report step in most cases. This effect is mitigated by the fact that the delay is linked to the cache size (the connect every xx days setting in the general prefs) which has a small default size.


4. Counterintuitive user interface.

Counterintutive for some users. The buttons for suspend/resume, allow/no more work, etc toggle. The label shows the effect of clicking the button (intuitive for sme users) and thus shows the *opposite* to the current setting (counter-intuitive to other users). A better interface would have a pair of radio buttons for each, or a single tick box where the presence of absence of a tick made the setting obvious.


5. Screen real-estate of GUI.

The GUI uses a lot of screen, and large parts of it cannot be reduced in size - thecolumn of buttons for example. (Why can't these be replaced with right-click contect menus?)

This is especially a problem for people using monitors with low resolution (either old monitors, or people who *need* low resolution for eyesight reasons)


6. Limited provision for having diverse settings for users with many boxes

You can make settings for home, school, work. There is a kludge to get a fourth setting (general settings distinct from any of the above), but if you have a fifth flavour of box then you are stuck. My own preferences preference would be to have a 'local' setting, where the settings were set on the machine itself as well as the web-derived settings.


7. Inability to schedule for numbers of cpus

The number of cpus is changeable in the preferences, but can't be varied for time of day or for machine in use. For example, we can tell BOINC not to use the machine at all while it is in use for something else, and set how long BOINC waits before it cmes back. We can't say, 'keep one cpu free for five minutes after the machine has been in use', which would be a very useful option where most usage is by single threaded tasks like Word. Nor can we say 'use all cpus overnight, but only 1 cpu during working hours'.


8. Copy-protected password fields

In several operating systems, the dialog boxes used to log in or attach refuse to allow you to copy & paste. This is especially annoying on the attach to new project box which kindly suggests you copy&paste the account key, but then won't let you. Typing a 32character hex number with only stars for feedback ain't fun!


OK, so it ain't perfect. Why do I put up with it? See my next post, coming soon to a thread near you... R~~
ID: 7599 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7605 - Posted: 25 Dec 2005, 13:55:45 UTC
Last modified: 25 Dec 2005, 14:01:51 UTC

The first reason I used BOINC was pragmatic. I wanted to be part of Einstein (by work background includes teaching a general relativity course so I had a professional interest in it). Einstein used BOINC, so I had to.

At the time, early Jan 2005 this was under sufference. I was still running CPDN, but the classic version. I even had classic CPDN and Einstein running on the same 2-cpu box, one cpu each and that worked very well.

Then I started to notice the advantages of BOINC, which are not so obvious as some of the disadvantages. There are eight of them, falling neatly into two groups of four.

1. Project startup.

1a. Project Startup - Programming

A project can download the project software, install it, and they have the non-science code all written and working right out of the box. When Pasquale installed the package for Orbit, by the time he got to the message boards someone else had got the first message in first! (Misfit, who else?)

Not only is this over half the work (result handling, credit handling, top participants, forums, profiles, user of the day, etc etc), it is the very parts of the code that the project scientists are least likely to get right first time. We can trust David Baker on the bio stuff, we can trust Pasquale on the clestial mechanics, but woukld they get a bulletin board package rignt first time? If they chose one, would it integrate well with the project (it is nice that the same login satisfies both the forums and the preferences for example).

Orbit is a one-person show. Six months later he still doesn't have the science apps complete - not because of the apps but becasueof time spent doing non-science stuff like grant apps, approaching potential data sources, etc etc. If he had had to write the equivalent of BOINC he would not be predicting a Beta in Feb 2006 but sometime in 2007 I reckon.

Rosetta: how much extra time (or how many extra programmers?) would David B have needed to get where he is now without BOINC code?


1b. Project startup advice

Invisible to users, but important to them because it invisibly prevents many things going wrong. This is a bad time for Rosetta just now, but there would be times when things got even worse on a project with no past experience to draw on.

Rosetta aims to be able to cope with as many users as SETI. An ambitious recruiting goal that = but if they achieve it Rosetta will avoid the limitations that have hit SETI recently with their huge influx of users to BOINC. The only thing better than learning from your mistakes is to learn from other people's!


1c. Project startup participants

When Orbit called for alpha testers with no code and only the website as it came out of the box from BOINC, he got 1000 people through the door before he knew what was happening, and had to close the door. Word-of-forum within BOINC. In his case, very useful in political terms. Pasquale has used the existence of his large user base to convince data-providers to give him data - even before he had crunched a single byte.


1d. Project startup funding.

No, BOINC does not fund projects. But its existence encourages funders to believe that a DC project is feasible. Many projects would not exist without BOINC because the funding would not be there. Funders like the idea of adding value to their funds by re-using existing BOINC code.

<aside/>
Interestingly, not even SETI would now exist without BOINC. They were refused funding for a continuation of SETI some years ago, and came up with BOINC as an idea they could sell to the funders - reusable code, added value, etc etc. Of course they needed a 'worked example' for how the science part would interface, so guess what area of science they chose? SETI. For years classic SETI has continued unfunded, drawing on the 'spare time' of the people whose officially funded work was BOINC.

Ive thanked SETI before, and will do again, for their foresight and generosity in creating BOINC out of the ashes of SETI@home when the funding was pulled. Thet deserve that thanks, and if they had another motive as well, like survival, well they deserve their chance to do their science after they created BOINC for the rest of us. And I write that as someone who has never been attracted by their science.
</aside>


2. Multi-project handling

2a. Multi-project user interface

Stated very well by John Keck, who got into my thread while I was typing 1 above. Merry Christmas John :-)

I'd add this also applies to the forums - it is good to knw the same BBcode for each forum, and the same basic structure, cafe for the off-topic stuff that actually creates human bonds (well, sometimes) a science & NC thread are the basics of all the project forums, so as soon as I got here I knew where to put which postings.

Also noteworthy is how easy it is to attach to another BOINC prject once I've got one working. If I know the website of the project, that is all I need, the GUI does the rest. Neat.

2b. Multi-project scheduling.

BOINC interleaves as many projects as you care to throw at it. Sometimes it does it well, sometimes not. But compared to anything else out there the fact that it does it 'at all' gives it a huge advantage. Best of all is the cover for projects (like LHC at present) that have work only intermittently. BOINC automatically fills in the gaps by doing more work on the other projects, and whenever LHC comes back BOINC gives it the lion's share till it catches up again. All without effrot from me once I have set the resource shares.

2c. Multi-project third party add ons

By creating a uniform interface, and making that interface and much of the code open source, and the data files XML, BOINC has deliberately encouraged unofficial add ons to be created that then apply across all BOINC projects.

Some of these are web-based info (BOINCstats, MundayWeb, etc), some are additional utilities (eg BOINCview) and some are team based (the BOINC synergy website/stats for example).

2d. Multi-box control

The official BOINC client download controls other boxes. You can control a Win2k box from a winXP one, control linux from windows, and even control a windows client from a unix command line. And if that's not enough, the third party BOINCview lets you control all your boxes from one screen.


So, it's 8 issues and 8 advantages - does that leave it evenly balanced. For me no. Any one of those 8 advantage [i]taken by itself[/b] for me outweigh the irritations. Yes, I'll go on asking JM7 et al to remove the irritations, but in order to make a good package better; not to make a bad one good.

BOINC will never be as good for a single project as a tailor made results handler. That is the nature of one-size fits all. There is only so much JM7 can do. Projects that need that perfect fit will look at BOINC and say "no thanks". Users that want that perfect fit will go with those projects.

Myself, I prefer the combination of diverse science and uniform infrastructure.
I am a critical but unconditional supporter of BOINC.

River~~

ID: 7605 · Rating: 2 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7619 - Posted: 25 Dec 2005, 17:12:43 UTC

I will try to capture this in the Wiki when it settles a bit and we have more contributions.

For me, there is only one problem that I find intolerable...

When CPUs are scheduled due to an event, such as the ending of a work unit, ALL the CPUs are rescheduled. With SETI@Home work units currently taking less than an hour on my systems, that means the switch every setting IS NOT HONORED.

Why is this an issue? Becuase each time the work unit is started and stopped there is overhead. AND, it causes me to have lots of partially completed work laying around. Sometimes WITH ONLY SECONDS OR MINUTES TO GO TILL COMPLETE.

That is right. A result suspended with less than 5 minutes to go ... so, we switch, load up this work unit, and 5 minutes later do this all over again.

If Santa was nice he would bring me a "per-project" setting that I could use to say, "Run work to completion before switching" ... for all projects but CPDN I would set this to yes ... THEN THE DECISION TO SWITCH WOULD BE MADE ON A RATIONAL BASIS.

With 25% resource share, 4 processor system, One CPU should be doing that project at all times ...

Anyway, I feel much better now ...
ID: 7619 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7628 - Posted: 25 Dec 2005, 20:41:13 UTC - in response to Message 7619.  
Last modified: 25 Dec 2005, 21:20:28 UTC

I will try to capture this in the Wiki when it settles a bit and we have more contributions.

I thought you might, hoped you would

For me, there is only one problem that I find intolerable...

When CPUs are scheduled due to an event, such as the ending of a work unit, ALL the CPUs are rescheduled. With SETI@Home work units currently taking less than an hour on my systems, that means the switch every setting IS NOT HONORED.


yes, on reflection I think the only time an all-cpu rescheduling is appropriate is when the client goes from non-EDF into EDF mode.

There is another disadvantage of this reschedule everything behaviour - it is particularly unfortunate on boxes that are switched off every night and that run low-checkpoint projects and/or projects whose apps do not survive a re-boot. This behaviour makes it more likely that a wu will be around when the box next powers down

River~~
ID: 7628 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Santa Claus

Send message
Joined: 25 Dec 05
Posts: 2
Credit: 0
RAC: 0
Message 7630 - Posted: 25 Dec 2005, 20:53:37 UTC - in response to Message 7619.  

[enters whistling 'Rudolph the Red nosed reindeer']

ahh, what hear I from the Buck household


If Santa was nice he would bring me a "per-project" setting that I could use to say, "Run work to completion before switching" ... for all projects but CPDN I would set this to yes ... THEN THE DECISION TO SWITCH WOULD BE MADE ON A RATIONAL BASIS.


well Paul you have been fairly good, so you can have your wish on single cpu boxes right now. But to get it on the multi cpus you will have to be good a little longer, especially to that clever JM7.

Set the interval between swaps to be longer than the longest WU that you'd like to be uninterrupted. Then rescheduling will be forced by end-of-wu rather than the clock

[exits whistling Jingle Bells]

ID: 7630 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7631 - Posted: 25 Dec 2005, 21:11:19 UTC - in response to Message 7630.  
Last modified: 25 Dec 2005, 21:12:03 UTC

...Set the interval between swaps to be longer than the longest WU that you'd like to be uninterrupted. Then rescheduling will be forced by end-of-wu rather than the clock

or by the downloading of new work, or any of the other interrupts. But it is a good start.

Download of work should trigger a test for EDF, but in my view should then only trigger a rescedule if EDF is actually entered, not in other cases as I said in my previous post.

R~~
ID: 7631 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7632 - Posted: 25 Dec 2005, 21:33:30 UTC - in response to Message 7630.  
Last modified: 25 Dec 2005, 21:34:26 UTC

Set the interval between swaps to be longer than the longest WU that you'd like to be uninterrupted. Then rescheduling will be forced by end-of-wu rather than the clock

[exits whistling Jingle Bells]

Well, Jingle Bells or not ...

Sorry, does not work. I have that setting and have had it for a couple months.

When, for example, one of my dual Xeons is running, it is doing 4 work units for what ever project. When any one of those completes, short run time, or what ever, ALL the CPUs get rescheduled. Now, on occasion, one of my high % project will stay on the CPU it was running on. But the most likely outcome for the other 3 CPUs is that they will ALL be set to other work. In many cases, all to which ever project is in most debt.

Once out of synch, and that will happen all the time, you will see this behavior. It is not quite as bad on the single CPUs, even with HT (effectively a dual) so it is not so noticible. I once had the switch time at 6 HOURS, nope, was seeing a change about once every 20 minutes on average ...

With lots of short work units the situation is worse. For example, SETI@Home is now down to about 30-45 minutes, SDG the same, PG are about 3 hours, Rosetta are 4-7 hours ...

Anyway, the problem is not the time switch, I never make it to that number. Remember, with a 4 hour run time, you will see one work unit done an hour on average. Since the CPU Scheduler does not restrict itself to scheduling only the CPU that finished work, you will see turn over as short as 1/4 or the shortest run time of work on the system. Or in my case, as often as every 10 minutes.

Again, this is all dependent on the exact work on the system, debts, and so forth. But, the switches happens far too often. RIGHT NOW, I have 18 work units in partial status on my two 4 CPU systems, of those, only 3 should be there, the CPDNs ...

One had 24 seconds to go, 4 more with less than an hour, 11 total with less than 4 hours. THe setting is meaningless because it is "force a change every "x" seconds" unless BOINC gets a whim and changes just for the heck of it.

----

Later we can talk about being smarter about scheduling work so as to take the best advantage of the HT capability ... in other words, is it more efficient to run two CPDN, CPDN with SETI, SETI with SETI, or SETI with Einstein. This is a matrix question, and may also be affected indirectly by cache size, but, which projects are done most effeciently over time when scheduled to run on the two logical CPUs?

At this point we have opinions but no hard data. Just like we think that the new FLOP counting is going to be a substantial improvement over the current system. But, back in beta we had hopes that mild tinkering with the benchmark would fix it also ...

Anyway, Just a peeve. I suppose if we want hard numbers I can scan the logs and show how bad it is ...

==== edit

My current setting has been 120 minutes ...

ID: 7632 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7649 - Posted: 26 Dec 2005, 7:54:06 UTC

Could not sleep, so ...

I did an analysis of the logs off of the two Xeons. The logs cover a non-contiguous period from Oct/Nov to Today and I have not yet tried to do an analysis of the scheduling interval (mostly because I have not figured out how to take the data and resolve it.

But, each of the combined logs is about 70K lines.

For Xeon-64:
1513 Results finished
5878 Starts, restarts, resumes
2759 Unique time stamps

Or, an average of 3.88 starts, restarts, resumes per result.

Breaking down as:
 TimeInstanceCount   	  InstanceCount   	  InstancePercentage
1 	924 	33.49
2 	924 	33.49
3 	560 	20.30
4 	342 	12.40
5 	3 	0.11
6 	3 	0.11
8 	2 	0.07
9 	1 	0.04


For the Xeon-32:
1508 Results finished
5690 Starts, restarts, resumes
2899 Unique time stamps

Or, an average of 3.77 starts, restarts, resumes per result.

Breaking down as:
 TimeInstanceCount   	  InstanceCount   	  InstancePercentage
1 	1127 	38.88
2 	989 	34.12
3 	575 	19.83
4 	191 	6.59
5 	7 	0.24
6 	9 	0.31
7 	1 	0.03


Aside from the fact that one is a Xeon-64 and the other is Xeon-32 they are almost identical Dell machines, same model, same memory size, etc. Not that this should affect the issue.
ID: 7649 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7656 - Posted: 26 Dec 2005, 10:35:51 UTC

Ok, 4 hours later ...

I did figure out how to move the data comprising the start/restart into a table where there would be an index field for a record number with the time stamp in the remaining columns.

With that, you can do a aliased join of the table to itself using record number = record number -1, or, you can find the time difference between adjacent rows.

Using that I can extract this data:

   	   	
00:00:00, Multiple Restarts 	2791 	49.06
> 00:00:00 	2898 	50.94

What this says is 50% of the time there are multiple CPUs rescheduled at the same instance in time. So, if we ignore the multiple restarts we can get an interval on the time difference between reschedules.

Leading to this table:

< 00:15:00 Minutes 	1311 	45.24
>= 00:15:00 Minutes But < 00:30:00 	523 	18.05
>= 00:30:00 Minutes But < 00:45:00 	329 	11.35
>= 00:45:00 Minutes But < 01:00:00 	250 	8.63
>= 01:00:00 Minutes But < 02:00:00 	427 	14.73
>= 02:00:00 Minutes But < 06:00:00 	57 	1.97
> 06:00:00 (Log Break) 	1 	0.03


Even it the swap interval was 1 hour, and I know I have been running either 2 or 6 hours for months, it should be obvious that the highest percentage of the time there is swapping at much less than a 60 minute schedule. Oh, and as the CPU count goes up, the numbers WILL get worse. And, there ARE people looking at 8 CPU systems as a matter of the next generation of home computers.

I mean, my next one is likely to be either a PowerMac Quad, or a dual, dual core, HT Xeon ... or 8 CPUs ... of course I could elect to only run projects on it that will not force a change every 10 seconds, like CPDN ... Just for amusement's sake, is there anyone out there that can remember things at some date in the future?

I can set my swap schedle to, say, 4 hours today, remind me in a month and I can redo the analysis (though it will only be on one months data). And if anyone is interested in the raw data/SQL code I can send it to you, gonna be large even as a zip archive as the logs are about 7M alone ...

Oh, and notice the swap interval is about where I estimated it? ... :)
ID: 7656 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7658 - Posted: 26 Dec 2005, 11:13:14 UTC - in response to Message 7632.  

Set the interval between swaps to be longer than the longest WU that you'd like to be uninterrupted. Then rescheduling will be forced by end-of-wu rather than the clock


Sorry, does not work. I have that setting and have had it for a couple months.


Hi Paul. The big guy in the red coat did say that at present the advice only works on a single CPU box.

I think if we can persuade JM7 to avoid the cross-cpu reschedule in all cases where EDF has not just started, then most of your pre-emptions will go away. As you correctly say it is the effect of one cpu on the other that makes BOINC progressively more jittery in proportion to the number of CPUs.

I use a similar setting on the cafe machines, interval = 777 which is longer than a working day. When as normal the boxes are turned off at night, they usually run a Rosetta in the morning and then when that is complete they flick over to CPDN. About once a week CPDN runs all day.

The boxes are hyperthreaded, but the response for the cafe customers is unacceptable if I allow BOINC to have both virtual cpus, so I have limited BOINC to running on just one CPU. When occasioanlly I put it up to 2cpus again, it generated the kind of chaos you describe.

Given Dr A's reluctance to add more options to the prefs, I think that tidying up the multi-cpu issue is a more likely way forward.

R~~
ID: 7658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7659 - Posted: 26 Dec 2005, 11:24:15 UTC - in response to Message 7632.  
Last modified: 26 Dec 2005, 11:45:39 UTC


Later we can talk about being smarter about scheduling work so as to take the best advantage of the HT capability ... in other words, is it more efficient to run two CPDN, CPDN with SETI, SETI with SETI, or SETI with Einstein. This is a matrix question, and may also be affected indirectly by cache size


by cache size (both BOINC cache and cpu cache) and by RAM. Users would have to check it out for themselves on the particular box.

On Zeatgrid we found the bizarre effect that in some cases it ran slower with more RAM on otherwise identical boxes - there are so many complexities and interactions that there is in my opinion no sense in trying to calculate for them in advance.

On the cafe boxes, 2.8GHz HT, it is better for CPDN to tun it with Einstein, but it is also better for EInstein to run with another Einstein. So if you had a 50-50 split, which way should the auotpilot go?

You and I would prefer loads of settings so we could set it up to run exactly right. Dr A's view seems to be that more people would get the setting wrong than right, or would set them once carefully then not check back often enough so the settings would become wrong over time, and therefore the project should have minimal user configuration. Although it is not what I personally want on my machines, I can see the point.

edit - added:

So in short I think the optimal allocation of projects to cpus is unlikely to happen. The lack of it will be another of those irritants we put up with in order to receive the benefits of BOINC.

River~~
ID: 7659 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7661 - Posted: 26 Dec 2005, 12:48:52 UTC - in response to Message 7656.  

Oh, and as the CPU count goes up, the numbers WILL get worse. And, there ARE people looking at 8 CPU systems as a matter of the next generation of home computers.


An important point. Equally as the proportion of HT machines go up, the more people will be affected by all the multi-cpu issues.

I'd say, with the mix of machines around at present, JM7 made the right choice when he got things working first for the single cpu box. On that front, in my opinion, he did a brilliant job (NB - not perfect, brilliant).

Within a year, that will not be looking so good, as the mix of single and dual/virtual cpu boxes changes.

So here are two changes that would improve handling for multi-cpu boxes.

1. Separate rescheduling except where EDF demands an across-the-box rethink. My guess is that this is reasonably easy - main difference would be the need to keep separate timers running for each cpu.

2. Better scheduling in EDF to account for the 'long job' problem. This is a lot more complex and would need some careful testing in alpha (example and initial suggestion below)

Example, Job A has 7 hours to run, deadline 24hours. Job B has 6 hours to run, deadline 24 hours. Job C has 46 hours to run, deadline 48 hours.

On two cpus these can all finish on time if you put the long job in to bat right away, and run the short jobs on the other cpu. The current EDF puts both the short jobs in first as they have shorter deadlines.

Suggestion: Here is one algorithm which is better than the current one because when it fails it falls back onto the current one. Or at least so I hope ;-)


0. Create a set of lists, as many as you have cpus
1. Put the current running job in the list for its current cpu.
2. Sort the 'Ready to run' jobs into decreasing order of size.
3. Starting with the biggest, put each job into the cpu with the smallest slack but which can still run the job without overrunning a deadline (overrun might mean getting to 90%, or whatever criteria you like)
4. If you can successfully allocate all jobs withou apparently going over a deadline, then on any vacant cpu start the job from its own list that has the nearest deadline, but leave the occupied cpus running as they are. We have shown the thing works without rescheduling the ongoing jobs, so leave them be.
5. If the first attempt is unsuccessful, refill the lists but omit step 1, now allowing more than one running task to appear in the same list.
6. If 5 successfully allocates all jobs within deadline, then take the running tasks in order of nearest deadline, allocate the list with the task to that cpu. But if the task is in a list that has already been allocated, mark this task/cpu for pre-emption.
7. After allocating or marking all running tasks, allocate the remaining lists to the remaining cpus, and start up the job with the nearest deadline from each of those lists.
8. If both attempts to fill the lists predict an overrunning deadline, fall back to the current algorithm. Maybe give a distinctive warning message that even EDF is likely to overrun a deadline, which allows the user to abort jobs manually if desired. Until such time as the user does this, to current EDF is used as it has the best chance of recovery if one of the early jobs runs short
9. when each job completes, check if we are still in deadline problems. If not resume round robin, if so do 1-8 over. Don't work from the previous lists - don't even keep them.


You will see that step 8 ensures that in no case will this system do worse than the current, providing that no job overruns the estimated run time (or estimate plus whatever margin you build in).

I am not bothered about this algorithm - if there is a better one, fine. I am asking for any algorithm that solves the simpler cases of the long jobs problem.

If it also cuts down on task switches (as I try to do in step 6) that is a bonus. The priority is to avoid unnecessary deadline overruns.


River~~
ID: 7661 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 7664 - Posted: 26 Dec 2005, 15:38:35 UTC

I don't know where, or if, this fits into the current discussion of scheduling but I'll drop a line anyway; ignore it if you wish.

My computer is Intel P4, 3 GHz, w/HT, Windows Xp, blah, blah.
My que size is set to 0.01 days or 14.4 minutes. Why?

I discovered that when I run EaH WUs on both logical cpus simultaneously it takes about 13 1/4 hours to complete them both.

If I run 1 EaH WU on 1 cpu, and run only Rosetta on the other cpu, then the EaH WU completes in 11 1/4 hours, 2 hours sooner.

Maybe this has to do with competing resources on my system, or with floating point vs integer, I don't know. But that is why my que size is 0.01 days and my resource shares are 50/50. The small que size guarantees that I only have 1 EaH WU and 1 Rosetta WU onboard at any given time, one WU for each processor.

Is it possible for the scheduling modules to take under advisement whether a WU/app is predominantly floating point or integer and run the WUs opposite each other to maximize cpu resources and throughput?

ID: 7664 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7669 - Posted: 26 Dec 2005, 19:13:51 UTC

Is it possible for the scheduling modules to take under advisement whether a WU/app is predominantly floating point or integer and run the WUs opposite each other to maximize cpu resources and throughput?

That was my part about what is the most efficient way to process work ...

In theory, in your case, with 50/50 allocation the work should progress with one CPU always running one project and the other the other, assuming that nothing else runs on the system. But, that minor "loss" where other processes "steal" time is what throws these things out of balance.

Again, this is as River~~ says, more tuning parameters that would allow more control over the use of the system. In your case you don't want 50/50 allocation, you want CPU1 to run EAH 100% of available time and RAH on CPU2 ... which may or may not be 50/50 ...

In general, I agree with Dr. Anderson; except ... most people would not care and would not use the settings. Just like there are many that insist on running only one project. Now, CPDN I can see in that the chances of them being down just when you needed another years worth of work is pretty slim ... :)

But, SETI@Home, for example, is still suffering from an embarrasment of riches. More participants almost than they have work. And for whatever reason they still seem to gain participants faster than anyother project ... 1K yesterday, 300 EAH, 200 CPDN, 185 RAH ... Though the new longer running application is going to relieve this pressure for some time ... well who knows for how long?

Anyway, those that are more sophisticated, AND can get more done with better control are hampered... ah well ...

Hmmm, I have also hijaaked the thread ...
ID: 7669 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
keputnam

Send message
Joined: 18 Sep 05
Posts: 24
Credit: 2,084,465
RAC: 0
Message 7689 - Posted: 26 Dec 2005, 23:32:43 UTC

A minor but irritating gripe is that we STILL cannot select to run as a service under the Local System Account from the installer. We have to install to a named user, then stop the boinc service and change the startup.

Also, if doing a version upgrade, why doesn't the installer just update the current install by default, why make us tell it what install option to use?

But I agree - the multi-project pluses far out-weight this.

ID: 7689 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Lee Carre

Send message
Joined: 6 Oct 05
Posts: 96
Credit: 79,331
RAC: 0
Message 8155 - Posted: 2 Jan 2006, 4:22:04 UTC - in response to Message 7689.  
Last modified: 2 Jan 2006, 4:40:32 UTC

A minor but irritating gripe is that we STILL cannot select to run as a service under the Local System Account from the installer. We have to install to a named user, then stop the boinc service and change the startup.

Also, if doing a version upgrade, why doesn't the installer just update the current install by default, why make us tell it what install option to use?

But I agree - the multi-project pluses far out-weight this.

as for the whole service install thing, very annoying, why isn't the default to use the "local system account" as this enables graphics, and gets round (or at least should get round) access/permission(s) problems that i've seen for some users

but deffenately need some improvement with the installer, i just want to be able to run it, and all my existing settings to be honoured, i want to "update" rather than "reinstall" as Ken suggested
ID: 8155 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Lee Carre

Send message
Joined: 6 Oct 05
Posts: 96
Credit: 79,331
RAC: 0
Message 8156 - Posted: 2 Jan 2006, 4:25:39 UTC - in response to Message 7689.  
Last modified: 2 Jan 2006, 4:40:54 UTC

ID: 8156 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Lee Carre

Send message
Joined: 6 Oct 05
Posts: 96
Credit: 79,331
RAC: 0
Message 8158 - Posted: 2 Jan 2006, 4:39:31 UTC - in response to Message 7689.  
Last modified: 2 Jan 2006, 4:41:04 UTC

ID: 8158 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : BOINC - known issues - so why use it?



©2024 University of Washington
https://www.bakerlab.org