Message boards : Number crunching : Something wrong with Server-Side-Scheduler
Previous · 1 · 2 · 3
Author | Message |
---|---|
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
The scheduler USED to do the resource fraction calculation, and clients do not do it. Now that the scheduler no longer seems to be doing the resource fraction calculation, the client code should be updated to do it. dgnuff, I think that's a pretty definitive answer... I wouldn't presume to guess "when", but since the man in charge of doing it just said it should be done, I think it's safe to say it will be! :-) |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
Here's some food for thought. I took a look at the source, and found the following comment in client/cs_scheduler.c: No, I've not talked to anyone about this. Since I'm a software engineer myself, I tend to regard this as a fairly low priority bug. It's not preventing anything from running, and due to the long term behavior of LTD (tautology anyone?) I do get where I want to be eventually. It just surprised me a time or two before I got to the bottom of what's going on. Since there are higher priority issues on radar right now, I was not going to make too much noise. It should also be noted that I'm not that familiar with the code, I was assuming that if something is called xyz at one point in the code, it'll be called xyz elsewhere. In particular, if I see PROJECT *p; ... ... ... p->resource_share in one file, I'd expect that searching for "project_share" would be a reasonable way of tracking down the code that is supposed to exist in the second file. However, I've been in the business long enough to know that while it looks reasonable in theory, this should not be assumed to be the case in reality. :) |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
The scheduler USED to do the resource fraction calculation, and clients do not do it. Now that the scheduler no longer seems to be doing the resource fraction calculation, the client code should be updated to do it. v4.19 and earlier clients did the calculations on client, while later clients does not do these calculations. For some time the server didn't calculate anything either. Still there was a problem with not all work was counted, but this was also fixed... But, a couple days afterwards, after complaints by users on Einstein@home, a "hole" was made there a project completely out of work could get 1 result, even no hope of returning it by deadline. This "hole" was largely fixed by the new client-scheduling, but still there is a possibility client can give a result that based on active_frac and so on a computer has no hope of finishing, even if runs only this result. Anyway, 05.07.2005, yet another "hole" was made, this is that scheduling-server stopped using resource-share fraction when deciding how much work to send. Now, something was needed to change, since otherwise a computer connected to many projects could be too slow to get any work from any project (if hadn't been for 1st. hole), but this 2nd. "hole" means gets too much work. A quick fix for this would be for client again starting to calculate resource-share when deciding how much work to ask for... But, a better fix would be to implement the neccessary server-changes so scheduling-server uses the detailed "other_results" and "in_progress_results" coupled with active_frac, on_frac, resource_share, benchmark and so on, when deciding if can handle another result or not... Yes, this method is more time-consuming to implement, but will also mean it's possible to plug both "holes" again... |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
But, a better fix would be to implement the neccessary server-changes so scheduling-server uses the detailed "other_results" and "in_progress_results" coupled with active_frac, on_frac, resource_share, benchmark and so on, when deciding if can handle another result or not... Would that handle projects that are suspended? Example: resource_share on Rosetta, SETI, Einstein, and CPDN all 100, but all of those suspended except Rosetta. Would it get 1/4 the work or the full load? Does the Rosetta server even have the resource share for the _other_ projects available, to know the "1/4" part? Something tells me it'd be easier on the client... don't know which is "better". ----- Yeti, you didn't know what you started when you opened this thread... :-) Now we've heard from the client-side scheduler developer, and the server-side developer. The only question now is if Rom will chime in! |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
When asking for work, appart for benchmark, active-frac and so on, you're including: <work_req_seconds>172.800000</work_req_seconds> <resource_share_fraction>0.000987</resource_share_fraction> <rrs_fraction>1.000000</rrs_fraction> <prrs_fraction>0.500000</prrs_fraction> <estimated_delay>0.000000</estimated_delay> When you also includes the listing of all results on host, example: <ip_result> <report_deadline>1133930821.000000</report_deadline> <cpu_time_remaining>8654.345419</cpu_time_remaining> </ip_result> <ip_result> <report_deadline>1133820081.000000</report_deadline> <cpu_time_remaining>770004.485646</cpu_time_remaining> </ip_result> <ip_result> <report_deadline>1132842026.000000</report_deadline> <cpu_time_remaining>29443.455954</cpu_time_remaining> </ip_result> <ip_result> <report_deadline>1134519655.000000</report_deadline> <cpu_time_remaining>7835.917076</cpu_time_remaining> </ip_result> Based on all this info, would expect Scheduling-server, that is the only one that has any idea of what type of results is available to be assigned to a computer, can check example 1: Adding one result will not break anything, since even "effective" resource-share is only 0.2 the other cached work is so long to deadline it can wait... but adding 2 results will blow past the deadline. 2: Computer already have 5 days work in other projects that needs to be crunched within 6 days, since my shortest result takes 1.1 days cpu-time on this computer and deadline is 7 days have no hope of crunching everything before deadline, meaning no work. 3: Variation over #2, let's say you've got 600h with 5-month deadline, and 12h with 7-days deadline. Let's also say 1 result has 7-day deadline, and on this computer takes 6.6 days cpu-time. On server-side, scheduling-server can easily see no way to finish everything before deadline, meaning no work. The client-side scheduler can easily find-out in example #1 that based on everything needs to ask for 1234 seconds or something of work, and this will give no problem with deadlines. But, how would you solve #2, since even if you asks for only 1 second of work, if you gets assigned anything atleast one result will not be finished before the deadline. Well, not sure if you'll at all ask for work in #2, but atleast in #3, there's a good chance you'll ask for work. In both #2 and #3, if you get assigned anything, you're screwed...
Can't claim has had much to do with the server-side scheduler appart for some small bug-fixing... |
nasher Send message Joined: 5 Nov 05 Posts: 98 Credit: 618,288 RAC: 0 |
well i think both the client's computer as well as the server side computer need to talk to each other to determin the correct amount of work... the client sends i can produce xx cycles per day for your project and the server should take that and compair to its deadlines and say by the deadlines and the amount of days work you want i can send you YY work units. since the clients computer knows how many cycles it has avalible per hour (cpu speed and fraction of computer time for the project) while the server side knows the deadlines of the project and has a guess on how long the project should take if its just one or the other guessing what the other can do you get too many or to few. i dont know about the disk space problems ... its kinda scary realizing the smallest drives out there that you can buy new is 40GB or larger (i know its been a few months since i looked for a HD so it may be larger now) but its still shoks me how much room the files today need |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 0 |
@Ingleside It is important for calculating, how much work to fetch or send, to respect, which projects are suspended on the client or have the switch "no new work". I have 5 - 10 projects attached to my clients, several are suspended or switched to "no new work". Normally, I let my clients crunch two main projects and 1 or 2 backup-projects. This looks like: Rosetta 48% LHC 48% P@H 1% E@H 1% ... So, LHC has often no work, when then Rosetta has no work, I often saw from project-schedulers: Won't finish in time, project get's only 1% of ResourceShare. That may have been good with the old 4.x Clients; but with the new client-scheduler, this should be changed. The client-scheduler now keeps his eyes on the deadlines, and, so far I could see, he does it very good ! Supporting BOINC, a great concept ! |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
@Ingleside, Don't we also have to take into account the number of CPU problem? What I mean is that you cannot just take the deadlines add them up and divide by the number of CPUs. On my dual Xeons with my CPDN work it can look like I have sufficient work if you don't take into account the fact that this one work unit could "fill" only one CPU, regardless of the deadlines ... However, even if I was to "blow" that one deadline due to an overcommit on that work unit alone (not that CPDN really cares about the deadlines as long as you are "trickling") it is possible that I am undercommitted on all other CPUs ... Hmmm, sounds like the processing needs to create "tracks" of the CPU commitments by Work Unit/Results present .. then look for idle "room". |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
@Ingleside The solution I'd use for this is that the client needs to make a decision that "Project X won't give me work right now", and so client side it removes Project X from the resource share calculation handed to the scheduler. So in the case you've cited above, assuming LHC and R@H are both 48, but out of work, the remaining 4 projects, which are normally 1%, all jump to 25%. Of course, this heads for the problem already noted where if S@H has downloads that get jammed, and R@H pulls a lot of work, if the S@H D/L then "unjam" you could wind up in a deadline crisis. My solution to that was to allow the client to unilaterally abort a WU if there's no way it can be completed in time. OK, tear that solution apart at your leisure - I'm sure I've missed something in there. :) |
Message boards :
Number crunching :
Something wrong with Server-Side-Scheduler
©2024 University of Washington
https://www.bakerlab.org