DNS Problems and Late Work Units

Message boards : Number crunching : DNS Problems and Late Work Units

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
SuperSluether

Send message
Joined: 7 Jul 14
Posts: 10
Credit: 1,357,990
RAC: 0
Message 81267 - Posted: 8 Mar 2017, 21:46:52 UTC

I saw that Rosetta has been having DNS problems lately. While I can connect directly via IP in my browser, I don't know how to do this in BOINC, and now I have some work units that will be late.

1 unit was due yesterday, and 5 more are due on March 10th. Should I abort the late work unit?

Or, another way of asking, what happens to Rosetta work units when they are late?
ID: 81267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81270 - Posted: 9 Mar 2017, 2:45:49 UTC - in response to Message 81267.  

I saw that Rosetta has been having DNS problems lately. While I can connect directly via IP in my browser, I don't know how to do this in BOINC, and now I have some work units that will be late.

1 unit was due yesterday, and 5 more are due on March 10th. Should I abort the late work unit?

Or, another way of asking, what happens to Rosetta work units when they are late?


You can use your host file, but you need to know each server name and possibly its aliases, as well as the IP addresses.

I just used the hosts file method to get here with this line:

128.95.160.140 boinc.bakerlab.org

However, that is not sufficient to get the BOINC client working again. Fortunately, I don't care that much about this rather amateurish project, so I'm not planning to spend more time on it. Already spent too much time being frustrated and annoyed, and even wasted the time sending email to the professor, who didn't bother to reply. Maybe he's sick or something.

Anyway, this website should at least be modified to include a mention of the status on the front page. It just has some old news about a media article. (Obviously didn't make much of an impact on me.)
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81270 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile BlisteringSheep

Send message
Joined: 15 Sep 06
Posts: 5
Credit: 25,807,655
RAC: 7,908
Message 81273 - Posted: 9 Mar 2017, 3:17:56 UTC

A static list:

128.95.160.140 boinc.bakerlab.org
128.95.160.141 ralph.bakerlab.org
128.95.160.142 srv1.bakerlab.org
128.95.160.143 srv2.bakerlab.org
128.95.160.144 srv3.bakerlab.org
128.95.160.145 srv4.bakerlab.org
128.95.160.146 srv5.bakerlab.org

This covers all the names and IPs I needed at least, for both Rosetta & Ralph.
ID: 81273 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SuperSluether

Send message
Joined: 7 Jul 14
Posts: 10
Credit: 1,357,990
RAC: 0
Message 81275 - Posted: 9 Mar 2017, 14:28:19 UTC - in response to Message 81273.  
Last modified: 9 Mar 2017, 14:30:03 UTC

Fortunately, I don't care that much about this rather amateurish project, so I'm not planning to spend more time on it. Already spent too much time being frustrated and annoyed


We have an impatient one here... It's not like Rosetta wanted this to happen, their registrar (Dotster) stretched a 10-minute process into 2+ days.

A static list:

128.95.160.140 boinc.bakerlab.org
128.95.160.141 ralph.bakerlab.org
128.95.160.142 srv1.bakerlab.org
128.95.160.143 srv2.bakerlab.org
128.95.160.144 srv3.bakerlab.org
128.95.160.145 srv4.bakerlab.org
128.95.160.146 srv5.bakerlab.org

This covers all the names and IPs I needed at least, for both Rosetta & Ralph.


Thanks! Just out of curiosity, how did you find these IPs?
ID: 81275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile BlisteringSheep

Send message
Joined: 15 Sep 06
Posts: 5
Credit: 25,807,655
RAC: 7,908
Message 81276 - Posted: 9 Mar 2017, 14:35:01 UTC - in response to Message 81275.  

Fortunately, I don't care that much about this rather amateurish project, so I'm not planning to spend more time on it. Already spent too much time being frustrated and annoyed


We have an impatient one here... It's not like Rosetta wanted this to happen, their registrar (Dotster) stretched a 10-minute process into 2+ days.

A static list:

128.95.160.140 boinc.bakerlab.org
128.95.160.141 ralph.bakerlab.org
128.95.160.142 srv1.bakerlab.org
128.95.160.143 srv2.bakerlab.org
128.95.160.144 srv3.bakerlab.org
128.95.160.145 srv4.bakerlab.org
128.95.160.146 srv5.bakerlab.org

This covers all the names and IPs I needed at least, for both Rosetta & Ralph.


Thanks! Just out of curiosity, how did you find these IPs?


I looked at the boinc files to find out the hosts referenced, got whois information from InterNIC, then used their authoritative nameservers (ns5.bakerlab.org) to resolve them. Their nameservers aren't part of the same IP block, and global DNS still knows their addresses.
ID: 81276 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,988,274
RAC: 8,636
Message 81278 - Posted: 9 Mar 2017, 18:22:32 UTC - in response to Message 81273.  

A static list:

128.95.160.140 boinc.bakerlab.org
128.95.160.141 ralph.bakerlab.org
128.95.160.142 srv1.bakerlab.org
128.95.160.143 srv2.bakerlab.org
128.95.160.144 srv3.bakerlab.org
128.95.160.145 srv4.bakerlab.org
128.95.160.146 srv5.bakerlab.org

This covers all the names and IPs I needed at least, for both Rosetta & Ralph.


It works!!!

ID: 81278 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 81279 - Posted: 10 Mar 2017, 1:09:03 UTC - in response to Message 81273.  

A static list:

128.95.160.140 boinc.bakerlab.org
128.95.160.141 ralph.bakerlab.org
128.95.160.142 srv1.bakerlab.org
128.95.160.143 srv2.bakerlab.org
128.95.160.144 srv3.bakerlab.org
128.95.160.145 srv4.bakerlab.org
128.95.160.146 srv5.bakerlab.org

This covers all the names and IPs I needed at least, for both Rosetta & Ralph.

ID: 81279 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 81280 - Posted: 10 Mar 2017, 1:13:28 UTC - in response to Message 81273.  

A static list:

128.95.160.140 boinc.bakerlab.org
128.95.160.141 ralph.bakerlab.org
128.95.160.142 srv1.bakerlab.org
128.95.160.143 srv2.bakerlab.org
128.95.160.144 srv3.bakerlab.org
128.95.160.145 srv4.bakerlab.org
128.95.160.146 srv5.bakerlab.org

This covers all the names and IPs I needed at least, for both Rosetta & Ralph.


Thank you! I've been poking at this for a couple days and your list had the one server I missed.
Just reported a couple dozen completed WUs just under deadline.
ID: 81280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,988,274
RAC: 8,636
Message 81281 - Posted: 10 Mar 2017, 8:19:10 UTC

We can help the "return of Dns" writing to Icaan President on Twitter
@Icaan, @Icaan_presindent, @IcannOmbudsman
Please, #RescueRosettaathome, #RescueBakerlabdotorg
ID: 81281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 81286 - Posted: 10 Mar 2017, 16:52:05 UTC - in response to Message 81281.  

It looks like we are back!!

We can help the "return of Dns" writing to Icaan President on Twitter
@Icaan, @Icaan_presindent, @IcannOmbudsman
Please, #RescueRosettaathome, #RescueBakerlabdotorg

ID: 81286 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,371,266
RAC: 53,072
Message 81287 - Posted: 10 Mar 2017, 17:15:54 UTC

What happened? Some DNS issue, I know, but how come they took so long to fix it?
ID: 81287 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SuperSluether

Send message
Joined: 7 Jul 14
Posts: 10
Credit: 1,357,990
RAC: 0
Message 81288 - Posted: 10 Mar 2017, 17:48:18 UTC - in response to Message 81287.  

What happened? Some DNS issue, I know, but how come they took so long to fix it?


Somebody was late on verifying the registration, and Dotster took much longer than they should have to re-verify it.
ID: 81288 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81290 - Posted: 10 Mar 2017, 19:27:10 UTC - in response to Message 81287.  

What happened? Some DNS issue, I know, but how come they took so long to fix it?


There is now a link on the homepage to details on the DNS issues.
Rosetta Moderator: Mod.Sense
ID: 81290 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81292 - Posted: 10 Mar 2017, 21:17:31 UTC - in response to Message 81267.  

I saw that Rosetta has been having DNS problems lately. While I can connect directly via IP in my browser, I don't know how to do this in BOINC, and now I have some work units that will be late.

1 unit was due yesterday, and 5 more are due on March 10th. Should I abort the late work unit?

Or, another way of asking, what happens to Rosetta work units when they are late?


There have been various messages about receiving credit for late work units, but my observations suggest it happens sometimes. I'm also pretty sure that downloaded units will not start if they are past their deadline (so those data downloads were obviously completely wasted bandwidth).

I mostly blame the DNS problems on the black-hat hackers and Al Gore, sort of. However, I think it's more of a topical issue for the "Cafe Rosetta" than "Number crunching", so I'll comment there.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81292 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 81295 - Posted: 11 Mar 2017, 0:38:24 UTC

I added a 3 day grace period to the server configuration (). Hopefully this will help otherwise I'm open to suggestions and feedback.
ID: 81295 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81299 - Posted: 11 Mar 2017, 1:34:55 UTC - in response to Message 81295.  

I added a 3 day grace period to the server configuration (<grace_period_hours>). Hopefully this will help otherwise I'm open to suggestions and feedback.


My suggestion would be simply to avoid downloading data to each client unless the normal usage pattern of that specific client computer makes it reasonably likely the data will be processed quickly enough to satisfy your schedule concerns. Sometimes that might require downloading small units of work for computers that are not running so much.

The obvious problem is that the BOINC client may not be capable of providing the projects with the information they need to do that sort of intelligent scheduling. Obviously the client software is positioned to track the usage patterns of each client computer it is running on, but I've seen no evidence that it does so. Also, the API would need calls for the projects to query that history-based information, preferably each time the server is contacted. For intelligent scheduling you basically need to know how is this computer used and how much work does it have queued now. Only then can you make a sound decision about what additional work to send and what the deadlines should be for that work.

Haven't we been over all of this several times? I feel like you [an administrator or possibly even the director of the project] should be well positioned to see exactly how many of your downloads are not returned with results before their deadlines have elapsed. All I can do is try to purge (abort) old units that I am reasonably sure will not meet their deadlines--but obviously I do have privileged information about how I use my computers and I don't need to track their usage histories to make those predictions.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81299 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Darrell

Send message
Joined: 28 Sep 06
Posts: 25
Credit: 51,934,631
RAC: 0
Message 81302 - Posted: 11 Mar 2017, 3:36:02 UTC - in response to Message 81299.  
Last modified: 11 Mar 2017, 3:37:19 UTC

[.
snip
.]

The obvious problem is that the BOINC client may not be capable of providing the projects with the information they need to do that sort of intelligent scheduling. Obviously the client software is positioned to track the usage patterns of each client computer it is running on, but I've seen no evidence that it does so.


IIRC, the BOINC Manager makes the decisions on how much work/how many WUs to download and it is based on
1) how much total work is already in process,
2) how much total work is downloaded but not yet in process,
3) how large the WUs are for the project that wants work,
4) the resource share for the project in relation to the other project(s) in the recent past (history), and
5) how large the user requested work queue is.

On Rosetta, the user -may- control item 3 by:

BOINC Manager -> Rosetta@home -> Your preferences -> [login if needed here] edit preferences for the venue(s) your computer(s) use -> Target CPU run time = {x hour} -> Update preferences

and item 5 by:

BOINC Manager -> Options -> Computing preferences -> Computing -> Store at least {m} days of work -> Store up to an additional {n} days of work -> OK

If any of these parameters are changed more often than a few days apart, the history data won't fit, and too much or too little may be downloaded. If the computer usage varies widely over a few days, the same thing may happen (e.g., run 24/24 hours for 5 days, then off for 3 days).

Using a smaller queue and smaller WU size with a consistent daily use pattern on the computer(s) reduces the risk of lost bandwidth. Assigning backup project(s) reduces the risk of idle computers. These are under user control and choice.

I feel like you [an administrator or possibly even the director of the project] should be well positioned to see exactly how many of your downloads are not returned with results before their deadlines have elapsed. All I can do is try to purge (abort) old units that I am reasonably sure will not meet their deadlines--but obviously I do have privileged information about how I use my computers and I don't need to track their usage histories to make those predictions.

I agree the project could or possibly does track such data, but I am guessing the payback is too small to be worth the effort. After all, how many non-advanced users (those who never touch tuning parameters) are there in relation to those of us who do? The project (and David E K) do address some of the things over which we have no control, and the other things that we can control, we should adjust as best we can.
ID: 81302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81303 - Posted: 11 Mar 2017, 5:20:41 UTC

Yes, the BOINC Manager decides when to request work, how much work to request, and from which projects. It also tries to avoid over-committing a given machine by pulling down more work than can reasonably be expected to be completed.

For the Rosetta server to make any further refinement on that existing system would require additional disk IO for every client scheduler request, and additional CPU for every client scheduler request. Other projects have done more complex scheduling systems, but many have discovered, the hard way, that they do not scale well.

In a nutshell, the bandwidth is cheaper than the additional database, disk and CPU load. And the BOINC Manager does a fairly good job of minimizing the potential problem of requesting too much work for the machine to process.
Rosetta Moderator: Mod.Sense
ID: 81303 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81304 - Posted: 11 Mar 2017, 5:27:06 UTC - in response to Message 81302.  

Thank you [Darrell] for the informative parts of your post, even if you had to IIRC disclaim them. The short summary appears to be that the BOINC client (AKA BOINC Manager on each client computer) does, in theory, collect most of the appropriate information to do better scheduling.

My short response (though I feel like I am basically repeating myself from a slightly different perspective) is that the work-scheduling results are not very good. In addition, I can report that I spent a lot of time tweaking the various settings that are subject to my control, and either the client ignores my suggestions or I have been unable to figure out how to set them "properly". I definitely feel that I wasted too much time and effort, but this was not limited to the Rosetta@home project (though I was already running R@h when I noticed the pattern of discarding downloaded units). There seems to be a deep assumption in there somewhere that most of the clients are supposed to be running continuously for many hours at a time. (Some of mine do, and others don't.)

IIRC the Rosetta@home people have raised or at least mentioned their bandwidth concerns on several occasions. Perhaps I am the only participant who has noticed, but I frequently notice wasted bandwidth, notwithstanding my efforts to avoid such waste (while still 'earning' the points). As I have stated a couple of times, everything keeps coming back to deadlines that are difficult or impossible to satisfy.

I don't see any solution to the fundamental problem of bandwidth. They have a lot of data to analyze, and I'm sure they have already explored the obvious efficiencies such as arranging for the same data to receive multiple analyses on a single client computer. (Not sure if I've actually seen some evidence of such patterns.)

What is clear (at least to me) is that downloading data that never gets processed at all is not useful. That bandwidth could have been conserved.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81304 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81305 - Posted: 11 Mar 2017, 5:34:46 UTC - in response to Message 81303.  

Yes, the BOINC Manager decides when to request work, how much work to request, and from which projects. It also tries to avoid over-committing a given machine by pulling down more work than can reasonably be expected to be completed.

For the Rosetta server to make any further refinement on that existing system would require additional disk IO for every client scheduler request, and additional CPU for every client scheduler request. Other projects have done more complex scheduling systems, but many have discovered, the hard way, that they do not scale well.

In a nutshell, the bandwidth is cheaper than the additional database, disk and CPU load. And the BOINC Manager does a fairly good job of minimizing the potential problem of requesting too much work for the machine to process.


Now that's a deep and insightful reply, though surprising. If I understand you [Mod.Sense] correctly, and if I am not oversimplifying, then you are saying that your own CPU resources are more limited than your bandwidth, and there is no easy way to transfer the CPU load to the clients where there is an abundance of cycles.

If this is an accurate assessment, then it seems you should ask the BOINC-side people if they can improve the client's capabilities. It may also explain some of the capabilities attributed to the multi-project management extensions that I had researched a while back. (When I was studying them, I was sometimes left with the question of "Now why would anyone want to do that?")
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81305 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : DNS Problems and Late Work Units



©2024 University of Washington
https://www.bakerlab.org