Message boards : Number crunching : Something wrong with Server-Side-Scheduler
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 1 |
Slightly off the thread a question to you Yeti. It is clear that you are able to do work for Predictor. I have had some 3 months of being unable to get through to the server on Predictor. I can't even get through to www.scripps.edu. Whats the secret? is there a new url or something? As I can't get thru to the website I can't enquire in the predictor message board. The adress should be: http://predictor.scripps.edu/ Hope this helps Supporting BOINC, a great concept ! |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
rosetta@home 12/12/2005 12:37:03 Requesting 6653 seconds of new work rosetta@home 12/12/2005 12:37:08 Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded rosetta@home 12/12/2005 12:37:08 No work from project rosetta@home 12/12/2005 12:41:13 Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi rosetta@home 12/12/2005 12:41:13 Reason: To fetch work rosetta@home 12/12/2005 12:41:13 Requesting 6425 seconds of new work rosetta@home 12/12/2005 12:41:18 Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded rosetta@home 12/12/2005 12:41:18 No work from project rosetta@home 12/12/2005 12:53:43 Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi rosetta@home 12/12/2005 12:53:43 Reason: To fetch work rosetta@home 12/12/2005 12:53:43 Requesting 12749 seconds of new work rosetta@home 12/12/2005 12:53:48 Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded --- 12/12/2005 12:53:50 request_reschedule_cpus: files downloaded rosetta@home 12/12/2005 12:57:53 Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi rosetta@home 12/12/2005 12:57:53 Reason: To fetch work rosetta@home 12/12/2005 12:57:53 Requesting 101 seconds of new work rosetta@home 12/12/2005 12:57:58 Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded rosetta@home 12/12/2005 12:57:58 No work from project rosetta@home 12/12/2005 13:02:03 Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi rosetta@home 12/12/2005 13:02:03 Reason: To fetch work rosetta@home 12/12/2005 13:02:03 Requesting 814 seconds of new work rosetta@home 12/12/2005 13:02:08 Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded rosetta@home 12/12/2005 13:02:08 No work from project The prolem is somewhat intermittant. This is from the log of one of my systems. If you note, it has successfully collected work once in recent history. Also as noted by Scribe in other threads, the Rosetta team are aware of the problem, and are working to fix it. At the risk of incurring the wrath of Bill Michael (very big grin here) I will point out that the system whose log I have just quoted above is still merrily crunching away, and should do so for another 24 to 48 hours due to having a 4 day cache. |
John Price Send message Joined: 4 Dec 05 Posts: 4 Credit: 6,142 RAC: 0 |
Slightly off the thread a question to you Yeti. It is clear that you are able to do work for Predictor. I have had some 3 months of being unable to get through to the server on Predictor. I can't even get through to www.scripps.edu. Whats the secret? is there a new url or something? As I can't get thru to the website I can't enquire in the predictor message board. Tnats what I have been trying and this is all I get. The page cannot be displayed The page you are looking for is currently unavailable. The Web site might be experiencing technical difficulties, or you may need to adjust your browser settings. -------------------------------------------------------------------------------- Please try the following: Click the Refresh button, or try again later. If you typed the page address in the Address bar, make sure that it is spelled correctly. To check your connection settings, click the Tools menu, and then click Internet Options. On the Connections tab, click Settings. The settings should match those provided by your local area network (LAN) administrator or Internet service provider (ISP). See if your Internet connection settings are being detected. You can set Microsoft Windows to examine your network and automatically discover network connection settings (if your network administrator has enabled this setting). Click the Tools menu, and then click Internet Options. On the Connections tab, click LAN Settings. Select Automatically detect settings, and then click OK. Some sites require 128-bit connection security. Click the Help menu and then click About Internet Explorer to determine what strength security you have installed. If you are trying to reach a secure site, make sure your Security settings can support it. Click the Tools menu, and then click Internet Options. On the Advanced tab, scroll to the Security section and check settings for SSL 2.0, SSL 3.0, TLS 1.0, PCT 1.0. Click the Back button to try another link. Cannot find server or DNS Error Internet Explorer Standard answer for DNS failure or page not responding??? I have tried disabling firewall and lowered all WinXP security settings but still no success!!!! Can I ask whether you can see that page right now? |
Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 14,945,062 RAC: 1 |
Can I ask whether you can see that page right now? Yeah, I have no problem accessing the side I would guess, you have a problem with your network and / or with your isp or you are blocked / banned from Predictor ... I remember, that there has been an ISP, that blocked transfers to other ISPs, so that data-packages couldn't get through to a BOINC-Project, but I don't remember, which ISP and which BOINC-Project had been involved. Supporting BOINC, a great concept ! |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
. --------------------------------------------------------------------- John.......You have to put a 1 after 'predictor' ......the full URL for predictor is...predictor1.scripps.edu/ Hope this helps....Cheers, Rog. |
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0 |
I have a regular pacman type cruncher a athlon64 X 2 and I cant keep any work for it. It crunches 50 wu's a day and I constantly get the NO WORK From PROJECT message on my mmessage log. It is to the point now that I have a great big blank space in my work tab area. I read the entire thread and am hoping whatever happens at 3PM gives me a download. This box crunches two at a time so I also did what one of the posters recomened and increased network cache from .15 days to 3 days. Cheers....hope the problem clears soon my puter jus spinning its wheels right now. |
John Price Send message Joined: 4 Dec 05 Posts: 4 Credit: 6,142 RAC: 0 |
. Thanks but this did not work either. Can some one look on the predictor site for an email address to the webmaster there, and I will see if I can contact them that way. Also those that can get thru can you let me know the IP address of your DNS server as maybe my one in NZ can't resolve it? Can't understand why not but got to try everything. I will also approach my ISP to see if for some strange reason they are blocking the site. |
John Price Send message Joined: 4 Dec 05 Posts: 4 Credit: 6,142 RAC: 0 |
. How Bizarre!! armed with the information that others are able to get thru to Predictor, I tried dialling up on a 56k modem. This allowed me to get thru no problem. So then I have to look at what was wrong with my usual connection using ADSL. I closed down the 56k and reopened the adsl and using the same browser I was able to get straight thru.!!!!! a 3 month drought was over! I can't explain that unless it is some unusual cacheing at my provider Xtra.co.nz. (p.s the dial up and ADSL accounts are both with the same ISP.) However just like rosetta at the moment I can't get work from predictor either. It says there is work availble but not for my platform. I tried on a Linux box and a winXP box, there can't be too many platforms left after that.... Howevr problem of missing site solved and thanks all for your help. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,177 RAC: 17 |
At the risk of incurring the wrath of Bill Michael (very big grin here) I will point out that the system whose log I have just quoted above is still merrily crunching away, and should do so for another 24 to 48 hours due to having a 4 day cache. I don't have any problems with a 4-day cache for anyone who is familiar with the way the scheduler works... I just have a problem with a 10-day cache and four or five short-deadline projects, and people who then complain that BOINC goes into EDF mode trying to keep up! :-) |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
At the risk of incurring the wrath of Bill Michael (very big grin here) I will point out that the system whose log I have just quoted above is still merrily crunching away, and should do so for another 24 to 48 hours due to having a 4 day cache. For what it's worth, Boinc INVARIABLY goes EDF when it pulls Predictor work for me. I suspect that this is because the scheduler code isn't properly taking into account the project resource shares I have. I'm 99% Rosetta, 1% Predictor, and Predictor typically D/L's work that requires 1/2 a day to a day to complete. By my estimation, whith those project resource shares it should not D/L more than one, or at most two WU's - not the 10 to 12 that it typically pulls. Still, it's not a major concern, I wind up with LTD's measured in the tens of thousands, it forgets about Predictor for three or four weeks while it does Rosetta, before starting the whole thing over again. IMHO, it looks like the scheduler is assuming something closer to a 25% share when it D/L's work. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,177 RAC: 17 |
For what it's worth, Boinc INVARIABLY goes EDF when it pulls Predictor work for me. ... I'm 99% Rosetta, 1% Predictor I'm not totally happy with the way the scheduler decides on work fetch, but I do kind of understand what's happening here. No work for Predictor will be fetched until the LTD has gotten "beyond" your cache-size setting - so before it gets even 1 second of work, the Predictor LTD must be at least four days. Your expectation is that would then get work equal to 1% of four days, or about an hour worth, or 1 result. For whatever reason, that's not how the scheduler "thinks" when it's satisfying a debt. (JMVII would be the only one who could explain the details on this...) With a smaller cache (grin) the difference on the LTD is smaller, and it will really get only 1 result. I can't explain or excuse it getting 10-12, since that sounds like it's more like 15% of your cache setting than 1%, but the "do nothing, then grab x of the 1% project and get it out of the way in EDF mode" is what is normally seen with a 99:1 split. All I can think of is that by getting "more now", the delay before getting any more works out to be about the same regardless of cache size. It goes into EDF because even with fewer results, if it truly gave only 1%, it possibly wouldn't meet deadline before the next "connect every" setting before that deadline - and that's a binary decision, it can't decide to give, say, 50% to Predictor, if that's what it would really take to meet deadline; it's either 1% or 100%. If you watch it close enough, and hit the "update" button at the right moment to cause it to recalculate, when Predictor is down to where it _could_ do the remainder in 1% of 4 days, it'll trip back to round-robin! That of course would only leave less than an hour of Predictor to go, but that little bit might not be done for a couple of days. Just as an experiment, if Predictor is really there only for when Rosetta is down or out of work, you might try making it 999:1 instead of 99:1 and see what happens... if the delay is longer but the amount loaded is the same, then I'm totally lost. After months of "juggling" resource shares, trying to find the point where everything would progress as I wanted it to, I've finally simplified the heck out of mine. They're all at 100. Since different machines have different projects attached, I just suspend any projects on one or more machines that are "getting too far ahead" of my current goals. For example, I'm not too concerned with SETI right now, so it's suspended on all but the two machines that are too slow to effectively run anything else. I gave up and just detached SZTAKI altogether, and since I hit my 10000 goal for Predictor, it's suspended on all but one slow machine. Whatever comes in for SETI and Predictor is just "maintain position a bit" credit. I only have two boxes that can handle Rosetta, but it's suspended on the Mac Mini for right now, pending a new Mac app; I ran it "heavy" on the PC to make up for that and to reach my 10000 goal for Rosetta, and now it's been backed off to give more to Einstein to get that project "above" SETI in credits, and CPDN because it's new and I want to get it above SZTAKI... But I want to test the new caching SETI app so I'll probably soon run it _only_ for a few days on that PC... ARRRGGH! It's so much easier to be a single-project cruncher! :-) |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
The page cannot be displayed John, The URL I have for predictor is: http://predictor.scripps.edu/ The IP for that address is: 137.131.252.96 The actual addresses for the project's data and scheduling servers are embedded in the home page, so, if you cannot get there ... well ... nothing will work. You *MAY* need to use a proxy to get there ... Can you see the other sites? Try the list in the Wiki or on the BOINC main page. This is an odd one ... |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
For what it's worth, Boinc INVARIABLY goes EDF when it pulls Predictor work for me. I suspect that this is because the scheduler code isn't properly taking into account the project resource shares I have. I'm 99% Rosetta, 1% Predictor, and Predictor typically D/L's work that requires 1/2 a day to a day to complete. By my estimation, whith those project resource shares it should not D/L more than one, or at most two WU's - not the 10 to 12 that it typically pulls. dgnuff, That is the trouble with the lop-sided allocations, you are asking for something too, um, hard. But, over time it will work out as Bill said. I will add, like him, I sometimes long for the days of simplicity when I could only get Classic to run on my machines (I tried CPDN back then but the 2 year run times and the crashes were too much). But, that usually only lasts till I see how much more I can do... Anyway, I use a scheme a little like his, where I set basic parameters on the allocations to projects and then adjust from there with shutdowns on specific machines. For example, my PowerMac is running primarily Einstein@Home and SETI@Home ... because I can run optimized applications that really rip ... But, the bulk of my machines I run a standard set of projects with an overall tendency towards those projects that are doing science. Which, unfortunately, just earned Rosetta@Home a demotion to 20% from a prior 30%. It is not that I do not think that the work is unimportant. It is just that this is like LHC@Home, engineering testing ... when they start doing actual runs on some of the diseases I will likely increase the allocations ... In the mean time, I am working to try to get Einstein@Home "over" SETI@Home which is why it is running 30-60% on all my machines. That is the good news, the bad news is that it is going to take several long months to get the "inversion' done, at which point I hope Rosetta@Home is starting to do "real" and direct research. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,177 RAC: 17 |
In the mean time, I am working to try to get Einstein@Home "over" SETI@Home which is why it is running 30-60% on all my machines. Thanks to the SETI outage, Einstein passed SETI for me late yesterday evening. I'm playing with Crunch3r's SETI app though, so I may pick up a couple thousand SETI credits in this next couple of days just from the one PC. I'm guessing it should do 40 credits/hour easily, first 15 results averaged 32.5 minutes each! Mac Mini is _all_ Einstein at the moment, so maybe it won't fall too far behind and make it hard to get them 'swapped' again. As for others... I gave up on SZTAKI completely, and have cut Predictor to "maintenance". Too many troubles and too little communication from those two. Putting my "windowbox" (not even a garden, much less a farm) mainly on Rosetta and CPDN, other than whatever it takes to keep Einstein ahead of SETI. (Have SETI running on daughter's iBook and such, can't micromanage that, it keeps a trickle flowing...) We'll see how long it takes me to get Rosetta caught up with SETI! |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
[quote]The page cannot be displayed John, The URL I have for predictor is: http://predictor.scripps.edu/ The IP for that address is: 137.131.252.96[quote] >I think he has found his problem and is ok.......I was curious about the discrepency of the URL though. I attached using both 'predictor1.scripps.edu' and yours above (without the'1') and they both attached to the project. Maybe they changed the URL but still allow the use of an older URL?? |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
I'm not totally happy with the way the scheduler decides on work fetch, but I do kind of understand what's happening here. No work for Predictor will be fetched until the LTD has gotten "beyond" your cache-size setting - so before it gets even 1 second of work, the Predictor LTD must be at least four days. Your expectation is that would then get work equal to 1% of four days, or about an hour worth, or 1 result. For whatever reason, that's not how the scheduler "thinks" when it's satisfying a debt. (JMVII would be the only one who could explain the details on this...) Here's some food for thought. I took a look at the source, and found the following comment in client/cs_scheduler.c: // determine work requests for each project // NOTE: don't need to divide by active_frac etc.; // the scheduler does that (see sched/sched_send.C) Aha, I think, let's take a look in sched/sched_send.C Nowhere in that file can I find anything that references a project's individual resource share. Could this be a bug? So here's what I have in mind. I presume that I can determine the amount of work by the change in LTD when a system gets Predictor work. If that's the case (and you have the patience), I'll wait for the next of my systems to do this. I'll try to catch the LTD before and after, although they may only be approximate. Once I have this, I'll multiply Rosetta's share by 10, and see what happens. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,177 RAC: 17 |
Here's some food for thought. I took a look at the source, and found the following comment in client/cs_scheduler.c: Have you talked with JMVII? If he hasn't been by here, that's something he should see, and something only he could answer... if you can't contact him and he doesn't comment here, I'll email him, but I imagine he's pretty busy. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
In the mean time, I am working to try to get Einstein@Home "over" SETI@Home which is why it is running 30-60% on all my machines. I think I agree with you. Again. I gotta stop doing this ... :) Predictor@Home is at it again, they are sending out work that cannot be completed by the deadline and no response from the project on the issues ... So, as a reward for their efforts, I cut them down to 1% for the moment ... CPDN kinda did this too me with one Sulfur work unit which is causing EDF on one computer to get the work done by Feb. I have a 1,000,000 plus debt built up on that machine. I am tempted to let it run the course to see if it is graceful in the recovery. Hmm, just noticed I have two CPDN work units on that machine ... also have a "slab" model too ... well, that should be done by the end of the month (5 more days). I also gave SETI@Home a boost as a reward for their efforts this last week. I think they did a heck of a job ... and contrary to rumor, BOINC worked as it should ... |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,177 RAC: 17 |
Here's some food for thought. I took a look at the source, and found the following comment in client/cs_scheduler.c: More info... there are two parts to the "how much to get" issue. The client says "send me x seconds". The scheduler decides how many results that is, and "rounds up" to the next highest number, then sends that many. The active_frac "etc" part _is_ considered by the scheduler in figuring out how many "seconds" makes up a result. However, NONE of this addresses the resource share, and that is definitely _not_ a bug on the server side. The resource share _must_ be calculated, if it is going to be, on the client. The server has no idea what other projects you are running, at what resource shares, and especially not about any that might be suspended or no-new-worked, etc. And this is where I think we're looking at this "wrong", at least from the programming perspective. Logic from the user perspective says "I'm running Predictor at 1% with a cache size of 4 days, it should get .01*96 hours, or about 1 hour worth of work". And if "cache size" was ONLY that, an actual "cache size", instead of the kludge connect-every/cache/whatever-else it currently is, the user perspective would be absolutely right - for anyone with an always-on connection, it SHOULD get 1 hour's work. However, since the program's "assignment" is to get "4 DAYS worth of work", no matter what, the simple, logical answer won't work. Let's take the extreme case; you have _nothing_ in your cache, for any project - totally dry, except for the one Rosetta result that is running. You are on dial-up. You connect to the internet. The work fetch algorithm kicks in for Predictor and says "hey, I need to get work!" - so how much should it get? 1 hour worth? What if after it's gotten your 1 hour worth, you disconnect? Or what if you _try_ to get Rosetta work, but Rosetta is down/out-of-work/whatever? Suddenly it's going to be 4 days before you connect again, and you only have 1 hours work! So, it gets significantly more. (Logic from _this_ side says it should get 4 days worth...) I'm hoping JMVII will visit here and tell us what the work fetch algorithm actually DOES do with the resource share, if anything. |
John McLeod VII Send message Joined: 17 Sep 05 Posts: 108 Credit: 195,137 RAC: 0 |
Here's some food for thought. I took a look at the source, and found the following comment in client/cs_scheduler.c: The scheduler USED to do the resource fraction calculation, and clients do not do it. Now that the scheduler no longer seems to be doing the resource fraction calculation, the client code should be updated to do it. BOINC WIKI |
Message boards :
Number crunching :
Something wrong with Server-Side-Scheduler
©2024 University of Washington
https://www.bakerlab.org