Something wrong with Server-Side-Scheduler

Message boards : Number crunching : Something wrong with Server-Side-Scheduler

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 13
Message 6281 - Posted: 15 Dec 2005, 2:04:15 UTC - in response to Message 6276.  

The scheduler USED to do the resource fraction calculation, and clients do not do it. Now that the scheduler no longer seems to be doing the resource fraction calculation, the client code should be updated to do it.


dgnuff, I think that's a pretty definitive answer... I wouldn't presume to guess "when", but since the man in charge of doing it just said it should be done, I think it's safe to say it will be! :-)

ID: 6281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 6286 - Posted: 15 Dec 2005, 3:04:07 UTC - in response to Message 6202.  

Here's some food for thought. I took a look at the source, and found the following comment in client/cs_scheduler.c:

// determine work requests for each project
// NOTE: don't need to divide by active_frac etc.;
// the scheduler does that (see sched/sched_send.C)

Aha, I think, let's take a look in sched/sched_send.C

Nowhere in that file can I find anything that references a project's individual resource share. Could this be a bug?


Have you talked with JMVII? If he hasn't been by here, that's something he should see, and something only he could answer... if you can't contact him and he doesn't comment here, I'll email him, but I imagine he's pretty busy.


No, I've not talked to anyone about this. Since I'm a software engineer myself, I tend to regard this as a fairly low priority bug. It's not preventing anything from running, and due to the long term behavior of LTD (tautology anyone?) I do get where I want to be eventually. It just surprised me a time or two before I got to the bottom of what's going on.

Since there are higher priority issues on radar right now, I was not going to make too much noise.

It should also be noted that I'm not that familiar with the code, I was assuming that if something is called xyz at one point in the code, it'll be called xyz elsewhere. In particular, if I see PROJECT *p; ... ... ... p->resource_share in one file, I'd expect that searching for "project_share" would be a reasonable way of tracking down the code that is supposed to exist in the second file.

However, I've been in the business long enough to know that while it looks reasonable in theory, this should not be assumed to be the case in reality. :)
ID: 6286 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,514,472
RAC: 0
Message 6389 - Posted: 16 Dec 2005, 1:36:15 UTC - in response to Message 6276.  
Last modified: 16 Dec 2005, 1:38:03 UTC

The scheduler USED to do the resource fraction calculation, and clients do not do it. Now that the scheduler no longer seems to be doing the resource fraction calculation, the client code should be updated to do it.



v4.19 and earlier clients did the calculations on client, while later clients does not do these calculations. For some time the server didn't calculate anything either.

Still there was a problem with not all work was counted, but this was also fixed...
But, a couple days afterwards, after complaints by users on Einstein@home, a "hole" was made there a project completely out of work could get 1 result, even no hope of returning it by deadline.

This "hole" was largely fixed by the new client-scheduling, but still there is a possibility client can give a result that based on active_frac and so on a computer has no hope of finishing, even if runs only this result.



Anyway, 05.07.2005, yet another "hole" was made, this is that scheduling-server stopped using resource-share fraction when deciding how much work to send.

Now, something was needed to change, since otherwise a computer connected to many projects could be too slow to get any work from any project (if hadn't been for 1st. hole), but this 2nd. "hole" means gets too much work.

A quick fix for this would be for client again starting to calculate resource-share when deciding how much work to ask for...

But, a better fix would be to implement the neccessary server-changes so scheduling-server uses the detailed "other_results" and "in_progress_results" coupled with active_frac, on_frac, resource_share, benchmark and so on, when deciding if can handle another result or not...

Yes, this method is more time-consuming to implement, but will also mean it's possible to plug both "holes" again...

ID: 6389 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 13
Message 6390 - Posted: 16 Dec 2005, 1:52:18 UTC
Last modified: 16 Dec 2005, 2:00:09 UTC

But, a better fix would be to implement the neccessary server-changes so scheduling-server uses the detailed "other_results" and "in_progress_results" coupled with active_frac, on_frac, resource_share, benchmark and so on, when deciding if can handle another result or not...


Would that handle projects that are suspended? Example: resource_share on Rosetta, SETI, Einstein, and CPDN all 100, but all of those suspended except Rosetta. Would it get 1/4 the work or the full load? Does the Rosetta server even have the resource share for the _other_ projects available, to know the "1/4" part?

Something tells me it'd be easier on the client... don't know which is "better".

-----

Yeti, you didn't know what you started when you opened this thread... :-)

Now we've heard from the client-side scheduler developer, and the server-side developer. The only question now is if Rom will chime in!

ID: 6390 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,514,472
RAC: 0
Message 6399 - Posted: 16 Dec 2005, 4:04:00 UTC - in response to Message 6390.  


Would that handle projects that are suspended? Example: resource_share on Rosetta, SETI, Einstein, and CPDN all 100, but all of those suspended except Rosetta. Would it get 1/4 the work or the full load? Does the Rosetta server even have the resource share for the _other_ projects available, to know the "1/4" part?

Something tells me it'd be easier on the client... don't know which is "better".



When asking for work, appart for benchmark, active-frac and so on, you're including:

<work_req_seconds>172.800000</work_req_seconds>
<resource_share_fraction>0.000987</resource_share_fraction>
<rrs_fraction>1.000000</rrs_fraction>
<prrs_fraction>0.500000</prrs_fraction>
<estimated_delay>0.000000</estimated_delay>

When you also includes the listing of all results on host, example:

<ip_result>
<report_deadline>1133930821.000000</report_deadline>
<cpu_time_remaining>8654.345419</cpu_time_remaining>
</ip_result>
<ip_result>
<report_deadline>1133820081.000000</report_deadline>
<cpu_time_remaining>770004.485646</cpu_time_remaining>
</ip_result>
<ip_result>
<report_deadline>1132842026.000000</report_deadline>
<cpu_time_remaining>29443.455954</cpu_time_remaining>
</ip_result>
<ip_result>
<report_deadline>1134519655.000000</report_deadline>
<cpu_time_remaining>7835.917076</cpu_time_remaining>
</ip_result>


Based on all this info, would expect Scheduling-server, that is the only one that has any idea of what type of results is available to be assigned to a computer, can check example

1: Adding one result will not break anything, since even "effective" resource-share is only 0.2 the other cached work is so long to deadline it can wait... but adding 2 results will blow past the deadline.
2: Computer already have 5 days work in other projects that needs to be crunched within 6 days, since my shortest result takes 1.1 days cpu-time on this computer and deadline is 7 days have no hope of crunching everything before deadline, meaning no work.
3: Variation over #2, let's say you've got 600h with 5-month deadline, and 12h with 7-days deadline. Let's also say 1 result has 7-day deadline, and on this computer takes 6.6 days cpu-time. On server-side, scheduling-server can easily see no way to finish everything before deadline, meaning no work.


The client-side scheduler can easily find-out in example #1 that based on everything needs to ask for 1234 seconds or something of work, and this will give no problem with deadlines.

But, how would you solve #2, since even if you asks for only 1 second of work, if you gets assigned anything atleast one result will not be finished before the deadline.

Well, not sure if you'll at all ask for work in #2, but atleast in #3, there's a good chance you'll ask for work. In both #2 and #3, if you get assigned anything, you're screwed...



Now we've heard from the client-side scheduler developer, and the server-side developer. The only question now is if Rom will chime in!


Can't claim has had much to do with the server-side scheduler appart for some small bug-fixing...
ID: 6399 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nasher

Send message
Joined: 5 Nov 05
Posts: 98
Credit: 618,288
RAC: 0
Message 6404 - Posted: 16 Dec 2005, 6:38:08 UTC

well i think both the client's computer as well as the server side computer need to talk to each other to determin the correct amount of work...

the client sends i can produce xx cycles per day for your project and the server should take that and compair to its deadlines and say by the deadlines and the amount of days work you want i can send you YY work units. since the clients computer knows how many cycles it has avalible per hour (cpu speed and fraction of computer time for the project) while the server side knows the deadlines of the project and has a guess on how long the project should take

if its just one or the other guessing what the other can do you get too many or to few.

i dont know about the disk space problems ... its kinda scary realizing the smallest drives out there that you can buy new is 40GB or larger (i know its been a few months since i looked for a HD so it may be larger now) but its still shoks me how much room the files today need
ID: 6404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 14,945,062
RAC: 0
Message 6423 - Posted: 16 Dec 2005, 10:50:08 UTC

@Ingleside

It is important for calculating, how much work to fetch or send, to respect, which projects are suspended on the client or have the switch "no new work".

I have 5 - 10 projects attached to my clients, several are suspended or switched to "no new work". Normally, I let my clients crunch two main projects and 1 or 2 backup-projects.

This looks like:

Rosetta 48%
LHC 48%
P@H 1%
E@H 1%
...

So, LHC has often no work, when then Rosetta has no work, I often saw from project-schedulers: Won't finish in time, project get's only 1% of ResourceShare.

That may have been good with the old 4.x Clients; but with the new client-scheduler, this should be changed. The client-scheduler now keeps his eyes on the deadlines, and, so far I could see, he does it very good !



Supporting BOINC, a great concept !
ID: 6423 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 6430 - Posted: 16 Dec 2005, 12:59:54 UTC

@Ingleside,

Don't we also have to take into account the number of CPU problem?

What I mean is that you cannot just take the deadlines add them up and divide by the number of CPUs. On my dual Xeons with my CPDN work it can look like I have sufficient work if you don't take into account the fact that this one work unit could "fill" only one CPU, regardless of the deadlines ...

However, even if I was to "blow" that one deadline due to an overcommit on that work unit alone (not that CPDN really cares about the deadlines as long as you are "trickling") it is possible that I am undercommitted on all other CPUs ...

Hmmm, sounds like the processing needs to create "tracks" of the CPU commitments by Work Unit/Results present .. then look for idle "room".
ID: 6430 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 6486 - Posted: 17 Dec 2005, 0:12:46 UTC - in response to Message 6423.  

@Ingleside

It is important for calculating, how much work to fetch or send, to respect, which projects are suspended on the client or have the switch "no new work".

I have 5 - 10 projects attached to my clients, several are suspended or switched to "no new work". Normally, I let my clients crunch two main projects and 1 or 2 backup-projects.

This looks like:

Rosetta 48%
LHC 48%
P@H 1%
E@H 1%
...

So, LHC has often no work, when then Rosetta has no work, I often saw from project-schedulers: Won't finish in time, project get's only 1% of ResourceShare.

That may have been good with the old 4.x Clients; but with the new client-scheduler, this should be changed. The client-scheduler now keeps his eyes on the deadlines, and, so far I could see, he does it very good !



The solution I'd use for this is that the client needs to make a decision that "Project X won't give me work right now", and so client side it removes Project X from the resource share calculation handed to the scheduler.

So in the case you've cited above, assuming LHC and R@H are both 48, but out of work, the remaining 4 projects, which are normally 1%, all jump to 25%.

Of course, this heads for the problem already noted where if S@H has downloads that get jammed, and R@H pulls a lot of work, if the S@H D/L then "unjam" you could wind up in a deadline crisis.

My solution to that was to allow the client to unilaterally abort a WU if there's no way it can be completed in time.

OK, tear that solution apart at your leisure - I'm sure I've missed something in there. :)
ID: 6486 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : Something wrong with Server-Side-Scheduler



©2024 University of Washington
https://www.bakerlab.org