Posts by deesy58

1) Message boards : Number crunching : Project File Upload Handler Is Missing (Message 69170)
Posted 11 Jan 2011 by deesy58
Post:
Michael espoused ...

Do we really have to have these same complaints every time there is a server problem?


Michael, which complaints are you referring to? I think that the comments here run in two basic categories - the first being the fact that the project is down, and the second is that once again, as in the past, there is almost no status being passed back to the community.

You are right, it is what it is, and when it comes to balancing hardware resources against a finite pile of money, compromises have to be made. In my career I have worked in more than a few data centers whose equipment is so old the place could have been designated a historic landmark.

I am familiar with the challenges of keeping out-dated equipment up and running.

However, there is no excuse for the lack of communications with the community providing computing resources to the project. The last published update was days ago.

The community of Rosetta users is a diverse group - it ranges from the casual user who puts Rosetta up on his home PC to those who convert their back rooms into a mini-data center.

We all crunch Rosetta because we believe that what the project is trying to accomplish is important. However, how can be continue to believe that what we are doing is important if the project leadership continues to treat the community like an expendable resource, not worthy of the few minutes required to keep us informed?

Well, I'll step down off my soap box now with just one further thought: keep crunching brother - we are all family, even when we disagree.

CH





Well said.

deesy

2) Message boards : Number crunching : Project File Upload Handler Is Missing (Message 69136)
Posted 10 Jan 2011 by deesy58
Post:
[rant]
You know, I left the Seti project and attached to this one because of server based problems. Now it's looking to me like the same shoestring mentality is ruling this project too, as concerns the servers and database. I don't know much about setting up the hardware to run a project such as this, but I would think that primary consideration would be given to make a robust data handling system with redundancies and it looks to me like academia in general doesn't seem to think (to me at least) that making sure your infrastructure is robust enough to take into account the possibility of equipment failure bringing an entire project to a screeching halt. Spend more money on your project infrastructure before you go on dreaming new methods to crunch numbers or whatever, you all. I have seen this with this project, Seti and also the Folding@Home project run by Stanford, although theirs seems to be more a problem with dreaming up new clients and never getting the bugs worked out of their older DC clients.

I hope that you all can get this back up and running soon. Otherwise, I guess I will move on to someone else's project, since it ticks me off to have something I am donating my time to by running their project (and paying the electricity for to run their project on my computers) not take some precautions to keep their project going by providing redundancy in their project's infrastructure to handle hardware failure.
[/rant]


muddocktor calls this a rant. but it looks a lot more like an objective and cogent analysis of the problems that ensue when arrogance finds its way into project management.

Strange attitudes to have when "donating" your time (by donating time I mean letting your computer stay on). A catastrophic failure is no fun. It takes a lot of work, time, and stress. I believe there are only one or two guys that maintain the servers. I think their time is better served by writing a line or two, as they did, as to the status rather than write an essay on causes and effects and have everyone second guess their methods, actions, and infrastructure on the forums - which is exactly what would happen.

If you can't upload, just keep crunching until you can, what's wrong with that?


Riiight. Just keep wasting energy for an indeterminate period waiting for some sort of communication from somebody.

The traffic light at the intersection has suffered a catastrophic failure. It is stuck on "red." We should just sit at the intersection with our vehicle engines running until somebody decides to make repairs (or we run out of fuel).

...I think it would be useful if the forum were hosted elsewhere from the other equipment, but I don't have any idea if that's practical.


Sure it would be practical. Why wouldn't it. Web sites are remotely hosted all the time.

Anybody can suffer a catastrophic failure of equipment. What is difficult to understand is the paucity and vagueness of communication from the Project's managers.

deesy
3) Message boards : Number crunching : Servers? (Message 68164)
Posted 21 Oct 2010 by deesy58
Post:

The problem might have been different, but the symptoms appear to have been similar:

My system has been out of Rosetta work for 8 hours but finally managed to get a new task about 30 minutes ago.


This isn’t your OP, but it is a part of a post in your thread.

Although this is part of but a single post, I am fairly certain that I recall other posts on other threads that have similarly reported a complete exhaustion of buffered work, followed by an idle period lasting for a number of hours before new work was received after Rosetta server outages. I will leave it to others, if they are sufficiently interested, to seek the specific posts that report idle time waiting for work.

Why, for example, did this particular poster run out of work while nobody else appeared to do so at the same time? My machine was busy during the period that posts were made to your thread.

Are you sure that you are not describing a distinction without a difference?


As you are quoting me there I might as well respond. Though the "symptoms" may sound similar, it is indeed a completely different problem. The issues you have had relate to problems connecting to the server. The issue described in the other thread was that the work queue ran out of tasks for a short period of time.

The Project team are usually quite good at refilling the work queue on time, so I would guess that they were just caught by surprise by the jump in project speed from the normal 100 Teraflops to the current 121 Teraflops. A 1/5 increase in speed probably emptied the queue a lot faster than they were expecting.

You have implied that my system was idle while Rosetta was short of work, but that wasn't really the case. When my system couldn't get work from Rosetta it just downloaded an extra task from Poem@home and crunched on that for a while. Both projects aim to improve our understanding of proteins, so it doesn't matter to me which one is running.

In answer to your question of why I ran out of Rosetta work and you didn't, it is simply a matter of buffer sizes. You mentioned above that you have a 2 day buffer so an 8 hour shortage of work would probably not have been noticed. However my system is powered down quite often, which can confuse BOINC's calculations on the size of buffer to maintain and lead to missed deadlines, so I keep the buffer at minimal levels.


Okay, I understand that you were able to process for a different project. The point is, from the perspective of the Rosetta Project your machine was idle. Any of us who use BOINC and process for any of the projects that are managed by BOINC can process for a different project if the server systems of our primary project go down. It is not very logical to say that we were able to process for some other project, therefore, we were contributing to Rosetta.

Folding@Home is also a protein research project. Could we say that our machines were contributing to the same type of research if we switched to FAH while the Rosetta servers were unable to supply work? Aren't the projects dissimilar in many respects? Is Poem@Home working on solutions to the very same problems as Rosetta? If so, wouldn't that be an unnecessary duplication of efforts and waste of resources that would make it difficult for Project management to obtain grant monies?

I am truly sorry that I appear to be unable to make my point with sufficient clarity that everybody can understand it. I think I'll give up trying ... :(

How big is your buffer?

deesy
4) Message boards : Number crunching : Servers? (Message 68156)
Posted 21 Oct 2010 by deesy58
Post:
Deesy58 -

I think that it is safe to say that while the end result was the same - a problem with getting new work units out to the community, the problem described in this thread and the "Houston we have a problem" thread were completely different in nature.

Otherwise I would not have started a new thread.


The problem might have been different, but the symptoms appear to have been similar:

My system has been out of Rosetta work for 8 hours but finally managed to get a new task about 30 minutes ago.


This isn’t your OP, but it is a part of a post in your thread.

Although this is part of but a single post, I am fairly certain that I recall other posts on other threads that have similarly reported a complete exhaustion of buffered work, followed by an idle period lasting for a number of hours before new work was received after Rosetta server outages. I will leave it to others, if they are sufficiently interested, to seek the specific posts that report idle time waiting for work.

Why, for example, did this particular poster run out of work while nobody else appeared to do so at the same time? My machine was busy during the period that posts were made to your thread.

Are you sure that you are not describing a distinction without a difference?

The problem described in the "Houston we have a problem" thread centered completely on the fact that the reservoir to available work units had dropped to zero for a period of 6 to 8 hours or so. Communications between the BOINC client and the project server were up and functional.

The problem described in this thread seemed to center around a network issue in the project facility - and judging from the fact that when things came back up my browser could hit bakerlab.org but the BOINC client could not it appeared that maybe the BOINC client cached the "old" IP address when it was brought up and that some time during all this that address changed.


I am sure that your description is technically accurate. What portion of Rosetta contributors, however, do we imagine understand or care about specific server or IP protocol issues when they simply visit a “Server Status Page” that tells them all servers are up and running, but their computers are idle, waiting for work?

The results of ping, nslookup, and trace route commands seem to support this. Was it a change in network configuration or maybe DHCP got in the middle, I don't know for sure, and likely never will.


It appears that, regardless of whether boinc.org responds to the ping or nslookup commands, rosetta.org does not always respond without an error. When you were unable to resolve the name, I had no problems. Earlier this evening I had no problems. Now, however, as I write this message (11:45 PM PDT on 10/20/2010) I receive the following message:

*** cdns2.cox.net can't find rosetta.org: Server failed

What conclusions can be drawn from this intermittent error message?

But since once I could hit the name server and resolve bakerlab.org again, BOINC still would not connect until after a restart it is logical to assume that BOINC does indeed cache the address instead of doing a lookup each time.

I admit that it is speculation, and that I don't have the facts to conclusively state that this is the exact scenario. However, I have had my hands deep in the bowels to many an IP stack and feel comfortable that this was a logical conclusion to draw.


I’m not sure your speculation fully describes the nature of the problem[s]. If you are correct, why wouldn’t this issue affect all users during the time of the connection failure[s]?

So because the problems were different in nature, your references to the other thread seem a little weak and unrelated to me. But that is just my two cents worth and I don't think either of us were made privy to the technical details by the project.


I know what Mod.Sense explained. It fits with my experience, and it seems to fit with the experiences of some others. If, however, it is accurate, then it seems to me that the methods of distributing work during the first – say 24 hours – after a server outage is really not optimum. It makes little sense that User "A" receives sufficient work to fill a ten-day buffer while User "B" sits idle for nine, ten or more hours waiting for work.

Have a good night and don't stress so much over past problems - I think in both these cases it is clear the problem was on the "project side" and not with your system, my system, or Sid's system.


I don’t think I am stressing very much at all. Also, if past problems are not solved, they have a penchant for becoming future problems - no?

I have been trained throughout my career to identify problems, analyze them, and offer solutions or prompt others to offer solutions. Does anybody believe that this is a bad thing?

I perceive a problem with the way work is distributed after a server outage. I cannot be certain that there is a viable solution to that problem. Perhaps it is not solvable. But if it is possible to find a solution, then somebody should be looking for and proposing it. If nothing changes, nothing can improve.

Thanks for your lucid and helpful contribution to the thread.

deesy
5) Message boards : Number crunching : Servers? (Message 68154)
Posted 20 Oct 2010 by deesy58
Post:
Sid Celery's post is so inane that it doesn't deserve any more of a reply than it has already been given, or that I include here.

BTW, my machine repeatedly asked for more work and was told that none was available, and that, perhaps, the servers were down. I am satisfied that Mod.Sense has adequately explained the way the system works. I think that it is not a very efficient system but, apparently, we must live with it.

Interestingly, while the posters on the "Houston, we have a problem ..." thread have been experiencing a shortage of work very recently, I have not.

Just lucky, I guess ... ;)

deesy
6) Message boards : Number crunching : Servers? (Message 68153)
Posted 20 Oct 2010 by deesy58
Post:
I guess you assume (incorrectly) that EVERYBODY who experiences work outages posts on THIS thread.


No, but you can list all the other threads reporting the same issue if you like, or the individual posts. Perhaps I missed them.


Hmm. You really see only what you want to see, don't you. Why don't you take a look at the posts in the current "Houston, we have a problem ..." thread? There are, of course, other posts in other threads, but you probably don't want to look at them because they would not support your preconceived notions. Lame!

deesy
7) Message boards : Number crunching : Servers? (Message 68142)
Posted 20 Oct 2010 by deesy58
Post:

Do you actually think that there might be multiple versions of BOINC, and that my version might be different from yours?

Of course there are and running on a wide range of platforms too.


Q.E.D.

Don't we all run the same basic version of BOINC and Rosetta that are automatically updated when necessary?

Rosetta yes, Boinc no, it certainly isn't updated when necessary unless you manually download a new version.


Okay. I am running Version 5.10.28. Is there a newer/better version available for Windows?


From the first link in the quote above you will see the currently recommended version of BOINC for Windows 2000/XP/Vista/7 is 6.10.58. It comes in 32 bit and 64 bit versions.

I cannot provide either positive or negative commentary on the latest version as my system is stable with version 6.4.5. I expect that I won't be upgrading until either my system starts to become unstable or I hear of a new feature that may be useful to me.


Oops! Typo! It is version 6.10.58.

Sorry!

deesy
8) Message boards : Number crunching : Servers? (Message 68137)
Posted 19 Oct 2010 by deesy58
Post:
In msg 1 you said you had 14 tasks ready to upload, 5 of which had failed. Where did you get 20 from? If you'd actually had 20 it might well have been enough. Personally I keep 2 days worth of 8 hour tasks too, which on my unattended Vista quad machine running 247 is around 24 tasks. I don't know how many cores you run (2?), but it doesn't seem to add up.


Well, this is simple first grade Arithmetic. What I said was: "I have 14 completed tasks waiting. Of the 14, 5 failed on "Computation Error[s], and the others are "Ready to Report." I have two tasks running, and four more "Ready to Start." I believe that if you add 14 + 2 + 4 you will get a total of 20. That is, unless you do Arithmetic differently on your planet ...


Looking through this thread I see, in my absence, only one other person confirmed your problem and they found they had to restart Boinc locally to restore connectivity. There's a hint.


I guess you assume (incorrectly) that EVERYBODY who experiences work outages posts on THIS thread. Not only is Arithmetic invalid on your planet, but it appears that Logic is also invalid.

Do you actually think that there might be multiple versions of BOINC, and that my version might be different from yours?

Of course there are and running on a wide range of platforms too.


Q.E.D.

Don't we all run the same basic version of BOINC and Rosetta that are automatically updated when necessary?

Rosetta yes, Boinc no, it certainly isn't updated when necessary unless you manually download a new version.


Okay. I am running Version 5.10.28. Is there a newer/better version available for Windows?

If you believe that I have somehow acquired a defective version of BOINC, you should say so, and you should tell me (and everybody else) how to correct it.

Just as well I didn't say so then. There are more issues than just the software and platform to consider, like with any software or connectivity issue.


Well, you seem to be focused exclusively on software and platforms in this thread. What else (of any value) do you have to contribute?

It appears that you are making unwarranted assumptions, again.

I don't think so. You're saying the servers were up but you couldn't connect and you're assuming that's because the servers weren't actually up. But the only person reporting the same as you restarted Boinc locally and the problem went away. I know the servers were up because I connected several times when you were still suffering a problem and I happen to know I don't have a magic key that lets me in and keeps you out. If it really was a lottery I'd expect occasional failures to connect for me and occasional successes for you. That didn't happen for me (I'm just not that lucky, unfortunately) and you never did quite detail answers to my question about what errors you had and at what times...


Maybe you should read the moderator's posts. It is clear from posts made by Mod.Sense that the time interval during which any user might not receive work after recovery from a Rosetta server outage is purely a matter of chance. Do you dispute that assertion?

Thanks for that. First, as I pointed out before, these are Boinc defaults, not Rosetta defaults (though I doubt that makes a lot of difference in itself tbh). More importantly, defaults in this situation are lowest common denominator across all projects, not tailored to meet every eventuality of this one - especially as a 3 hour runtime and a 0.25 buffer would hardly be recommended as a panacea for all Boinc projects by anyone. That's enough, but 3rd, as has been pointed out already, no project can guarantee uptime, so putting all eggs in one project's basket is going to result in a problem eventually. If Rosetta goes down for a month, we're all running out of Rosetta tasks, though I'll still be running 24/7 here.


Your penchant for avoiding specific questions is reminiscent of a politician. Perhaps you should consider taking up politics (if you haven't already).

Let's boil this down to basics. How do you explain a user with a two-day buffer running out of work for more than nine hours after Rosetta's servers fail for a period of only nine or ten hours? Mod.Sense explained it quite clearly by pointing out that it is a matter of CHANCE/LUCK/FORTUNE/KISMET/HAPPENSTANCE when additional work might be received, and my experience bears out that assertion. If you have an alternative answer that makes any sense at all, perhaps you would like to share it?

If I have a two-day buffer for a dual-core processor, then I have 24 tasks in my buffer. This is in addition to the two tasks that are currently being processed. These two tasks will complete anywhere between one minute and four hours after the inception of the server outage, but never in less time. That means that I will have a MINIMUM of 48 hours of processing to complete before running out of work. With a two-day buffer, my machine should never even notice a 9-10 hour server outage.

Would it be an improvement to limit the number of work units that were distributed to each user for some period of time after a server outage - say 24 hours? Why not ration the work until such time as the system has completely recovered from the outage. It's difficult to see how such an approach would not be more efficient.


Actually no. If it was a case of the server struggling to meet the demands on it (which I don't accept at all btw, but just say mod.sense is right) a user would get insufficient tasks on an eventual successful connection, so it'll just come back again and again to fill the rest of its buffer, resulting in more hits even after connection was successful, not less.

Where this strategy would help is if there were only a few tasks to grab and it was better if everyone got something to get them started. That wasn't the case from what I can recall. (Just seen your convoluted analogy. Very good, but typically it's the one that didn't apply in this situation).


Well, isn't it the whole point of grid computing that a large number of computers should each be able to perform a small portion of the total work? Tell me how this goal is met if a number of computers receive no work at all while others receive more than they can possibly process within a short time? Users can select a buffer of as many as ten days. What is accomplished for the benefit of the project as a whole when some contributors sit idle waiting for work while other users have accumulated enough tasks to keep their machines busy for an additional ten days? Whether it happens to be the current BOINC/Rosetta design, or whether it is difficult to accomplish, is not the point. The point is that it is INEFFICIENT, and it does not speak well for the architecture of the software.

Stop looking at the servers in a microcosm, and look at the entire grid as a whole. It might change your perspective.

For the record, these weren't 'surplus' tasks - simply the rate my unattended quad machine completed tasks over that 36 hour period (plus refilling the buffer). No doubt I had some of the early 2.16 tasks that crashed on start-up among those.


Well, if they're sitting in your buffer waiting to be processed, then they are (for some period of time) surplus. Your quad cord is not infinite in its capacity. It can process only a certain number of tasks simultaneously. Any tasks that are waiting for current tasks to complete are, by definition, surplus. How is it sane for some users to have extra tasks waiting in a buffer while other users are not able to obtain any tasks at all?

This is the issue that nobody has been able to explain to me, or to anybody else who has experienced the same problem.


It's sounding more like a local connectivity issue the more I read. Did you try restarting Boinc andor your computer at any point when you saw the servers were reporting as up or is this thread really all down to you blaming someone else before checking at your end? That would explain everything, wouldn't it?


Read my earlier posts. I have a broadband connection that is always available, and I have no access problems with any other Internet sources.

You seem determined to rationalize your position that there is something wrong with my system and its settings, but you are unable to explain just exactly what that might be. A little more analysis, and a little less emotion might be useful. Ranting is not.

Anyway, I'm sure everything's solved now. Until the inevitable next time... ;)



If nothing has changed, why would you believe that everything is solved now? If there was a problem, and no repair was made, wouldn't it be irrational to expect that the problem was solved? How can you be so sure that, the next time Rosetta's servers fail, it won't be you who has to wait for nine or ten hours with no work?

deesy
9) Message boards : Number crunching : Servers? (Message 68091)
Posted 14 Oct 2010 by deesy58
Post:
I believe the implication was that perhaps you were having a problem more specific to your machine then the project in general (such as being handed a small pile of tasks that fail immediately).


Not exactly. I believe that my observed problem began before the introduction of minirosetta Version 2.16 tasks. Even though a number of 2.16 tasks failed immediately on initiation after receipt, I still received at least two tasks that began running successfully within one minute, and I had several more that were entered into my buffer.

My issue is, and always has been, that it takes as long as two days or more after an outage for my machine to acquire new work. As a result, all of the tasks in my buffer are completed and my machine stands idle for a number of hours waiting for work. This has occurred on more than one occasion, and I was at a loss to explain it, especially since some other users were asserting that it was because of something I was doing incorrectly.

Now that you have explained clearly that it is purely a matter of chance that determines when a user might receive new work after a server outage, these misleading assertions by others are effectively debunked, and I realize that there is probably nothing at all wrong with my machine or its settings. It seems clear that those other users were, for whatever reasons, "blowing smoke" in order to spread FUD.

It seems clear that a good analogy might be playing the lottery. One might win a jackpot with the very first ticket purchased, but one also might purchase tickets religiously for years without winning anything at all.

Thanks.

deesy
10) Message boards : Number crunching : Servers? (Message 68088)
Posted 14 Oct 2010 by deesy58
Post:
no queue at all -- solely a matter of luck

Yes, the host will be short of work very quickly in such a case and then be vying for a scheduler request with everyone else. Odds being no better nor worse at getting new tasks.


Thanks. That explains a lot.

It never ceases to amaze me how so many people confuse simple luck with superior abilities. :)

deesy

11) Message boards : Number crunching : Servers? (Message 68082)
Posted 13 Oct 2010 by deesy58
Post:
In rereading your original posts, I believe the extended backlog on the servers was due to the tasks that had a new optional parameter specified, which was no longer supported by the v2.16 version (the failures you mentioned). So a lot of the work that was being sent out did not keep the machines busy. The tasks immediately failed. The machine immediately required more work and the cycle continued.


Although I am not positive of this, I thought we were still working with minirosetta 2.14 when the outage occurred. Was this not really the case? When was the switch from 2.14 to 2.15 made?

I do not believe that my machine's lack of work had anything to do with Version 2.16, although the initially large number of computation errors clearly was.

Do I understand correctly that if a machine is sent a number of tasks that fail shortly after being downloaded and processing has begun, then that machine can almost immediately request additional work and enter the queue ahead of all the other machines that are still waiting for work? Or is it that there really is no queue at all -- that it is solely a matter of luck whether a server is available and idle during the very few milliseconds during which a request for work is received?

deesy
12) Message boards : Number crunching : Servers? (Message 68080)
Posted 13 Oct 2010 by deesy58
Post:
Because Rosetta does not guarantee to provide you with work at all times. BOINC includes the option to subscribe to other useful projects to keep your computer busy when your main project is experiencing difficulties. That you have chosen to work exclusively with Rosetta is admirable, but you do have to accept the pitfall that you have no backup and that you will likely be one of the first to experience any difficulties and see them last longer than other users.


Whether Rosetta makes any specific guarantees is not the issue. The issue is why some users might receive additional work immediately after a server outage while other users must wait for an additional (sometimes prolonged) period before receiving new tasks.

People who have been able to select more than one project encounter problems more rarely as their systems adjust automatically to compensate.


I began participating in the Seti@home project shortly after its inception. I switched to Predictor@home for the period during which it was running in California. Then I switched back to Seti for a while, before joining the Folding@home project under a couple of different user and team names. I accumulated more than a million points under my most recent user ID before switching to Rosetta@home. I am processing for Rosetta for a very specific reason. I believe that their project is more nearly an "applied science" than it is a "theoretical science." I understand that the Project is searching for the causes, cures and treatments for specific diseases. If my understanding of Rosetta's ultimate goals were to change, I would probably look for a different grid computing project to join. As of now, there are no others of which I am aware that would persuade me to apply any of my computing resources.


Because a one size fits all approach doesn't work. One example is that a default setting of a 10 day buffer would be problematic for people with intermittent connections. If you have a set of tasks with a 12 day turn around but only connect to the network once every 7 days, you will return some of your tasks on the 7th day but the rest will expire on day 12; when you reconnect to the network to upload your tasks on day 14 you will find that all work between day 7 and day 10 has expired and is effectively wasted.

For some people a 10 day work buffer may be an ideal solution, which is why you can customise your settings to match your own circumstances.


I take your point. Thanks for the clarification.

deesy
13) Message boards : Number crunching : Servers? (Message 68079)
Posted 13 Oct 2010 by deesy58
Post:

I am really unclear why you presume other users were requesting, and being granted piles of surplus tasks while you were without work.

The whole site was down for about 10 hours on Saturday 9th. When it came back up, my logs show no problem uploading or downloading straight away (sending 6, receiving 8 WUs). I have no idea why your machine didn't dial in for a further 24 hours, but Boinc gets funny sometimes. Your buffer is inadequate, as we discussed before, but apparently you know best so I assume that was intentional. No problems on an unattended machine here because I don't fail to plan. Funny how that keeps on working for me.


Before you posted for the first time my unattended machine had connectedul'ddl'd 5 times successfully over the previous 3.5 hours and, it seems, did so again 11 more times by the time your machine got its first tasks.


Seems clear to me.

deesy

14) Message boards : Number crunching : Servers? (Message 68075)
Posted 13 Oct 2010 by deesy58
Post:

...ah! NOW you are coming around to where some of the original conversation was focused. If you can crunch 2 days worth of work in 9 hours, it clearly indicates that you did not truly have 2 days of work; right? Simple as that. And that means the BOINC client did not request enough to truly maintain a 2 day cache. And that is exactly the sort of thing other people were observing throughout the BOINC community, and changes were made to BOINC to refine the work-fetch rules to avoid idle CPUs.


Not really! I never said that I could crunch two days of work in nine hours. The servers might have been down for 9 hours, but my machine received no new work for more than 24 hours. The result was that all of the work in my buffer was completed, and I had more than twenty tasks waiting to report. During the first 48 hours after the onset of the outage, my machine continued to work through the contents of its buffer. When that was all completed, it was forced to wait for additional work from Rosetta's servers, which was not forthcoming for an additional nine hours. This is the issue that nobody has been able to explain to me, or to anybody else who has experienced the same problem. For additional clarification, please go back and re-read my first two posts on this thread. Note that the interval between them was almost 22 hours. Note also that the servers had already been down for some number of hours before I noticed it and made my first post.

If I/O on the servers really occurs the way you explain, then it would take no longer at all to supply my dual-core, broadband-connected machine with four new tasks than it would to supply User X with 300 tasks over a dial-up connection. This is illogical on its face, despite the capabilities of computer multi-tasking. It assumes that a single disk access could supply all 300 records to be sent, and that the communications I/O time is essentially zero. It also assumes an infinite number of simultaneous client connections.

The Rosetta hardware is, indeed, powerful, and the use of a fiber SAN to connect the servers to the storage system is a great system design feature. The disks are quite fast, but they are still rotating storage, and the volume of I/O is still quite significant. Even though the disks are organized as a single LUN, aren't access times going to vary? Could anybody be confident that all of the data required for a large download would be co-located on the same track[s], or on adjacent tracks of the disk subsystem? Does the caching system completely eliminate any access latency? Is the cache "hit rate" 100%?

All this might be the case, but I remain skeptical. No offense.

I'm still not at all clear why, with a two-day buffer, my machine becomes idle, and remains idle, after brief temporary server outages. If the resource "cost" for a large buffer is insignificant, why doesn't BOINC default to a much larger one - say 10 days or so?

deesy
15) Message boards : Number crunching : Servers? (Message 68055)
Posted 12 Oct 2010 by deesy58
Post:
If I understand you right, you are suggesting that the backlog will be cleared sooner if the bank limits withdraws to $250 per customer per some time limit. But to do so, now each teller has to verify the time limit, one which varies for each specific customer, before completing a transaction, and the customer has to make multiple transactions just to get the $1,000 they came for. Won't this make each transaction take slightly longer? And wouldn't that make it take longer to get on top of the backlog?


I don't think your example is exactly on point. It takes the same amount of time to withdraw $1 as it does to withdraw $1,000. Perhaps a better example might be the situation where you are standing in line at the bank because you only need to cash a check. Five places in front of you in the line is a person who is purchasing seven cashier's checks, certifying two additional checks, depositing the childrens' piggy banks, making a mortgage payment, making a car payment, and paying all of his/her utility bills.

Supermarkets have solved this type of problem (at least in the U.S.) by implementing "Express Lanes" where only a limited number of items can be checked out. Before the establishment of such conveniences in virtually all supermarkets, it was possible to stand in line for 15 or 20 minutes (or more) just to pay for a single carton of milk.

The approach, and this is from the Berkeley server code, is instead to try to make best use of each contact with the client. "best" here meaning send them everything they need. You don't know when they will be able to connect again. This might be the only work request the machine is allowed to make all week. Any client requesting "more work then it could possibly process" is already refused. In general, if everything is running well on the client side, no such requests are ever made.


Hmm. If I understand correctly, a user can request as much as ten days worth of additional work to be loaded into a buffer. If that user has a quad-core processor, and is using the default 3-hour run time, how many work units would be downloaded to that user during a single connection, and how much time would that process take? Assume that the user is, as you point out as an example, using a dial-up connection, and the server must wait for the completion of the transaction before responding to a request from another user.

I believe what you are suggesting though is to only send one task per CPU for example.


Actually, no. Since the number of processors in use by the average user is one, two or four, I would suggest that the number of tasks to be downloaded during the first connection after a server outage be limited to four.

The idea has some merit, indeed I've had thoughts along those lines myself, but the code changes to determine when to enter this server conservation mode add additional overhead and are quite complex, and the potential benefits are fairly minimal.


I agree that the task would be complex. This might, however, be one of those situations where Occam's Razor might not apply. Without some sort of cost/benefit analysis, we'll never know. The questions is, what would be the overall effect on the productivity of the project as a whole?

There are a certain number of database hits that occur for each scheduler request, regardless of the amount of work being sent. So if you are doing 20 IOs already, to send out 1 task, why not do 5 more and send out 6 tasks and fulfill the entire request? Your alternative is to process multiple server hits, doing the 20 IOs multiple times to send out 6 tasks.


Could you expand on this a little?

You could argue that there are multiple server hits occurring now as well, and that is certainly true. But when a web server is backlogged and requests are timing out, the server basically doesn't even actually see the ones it couldn't get to in time, so they bare no cost to the server's performance. So the approach taken by Berkeley yields the least demand for resources on the server overall. The number of requests actually processed is minimized by this approach.


Well, yes, if one focuses only on the loading of the server, and not on the production of the entire grid. Let me try a different analogy:

Suppose we have a network of pumps that remove water from an area that is prone to flooding (New Orleans, The Netherlands, etc.) and the supply of diesel fuel used to power the pumps has been temporarily interrupted. When a shipment of fuel arrives, would it be better to start with the first pump and completely fill its fuel tank, giving it enough fuel to run for several days (or more), but at the expense of having insufficient fuel to run other pumps? Or might it be better to ration the fuel amongst all of the pumps, ensuring that all of them are brought back on line as quickly as possible? Perhaps all pumps could be kept running until additional shipments of fuel arrive at the levees/dikes.

No project will have work available all of the time. With a two day buffer, you have already mitigated your risk of not having work during an outage. Unfortunately for you, it would seem your two days was up (i.e. your need for more work occurred) during an outage. If the average outage is 6 hours and the average server recovery time is 2 hours, your 2 day buffer already reduced your odds of encountering a server backlog to 1 in 6. Given those (roughly historical, yet guesstimated) numbers, 5 out of 6 times you would be completely unaware of a 6hr outage when carrying a 2 day buffer of work.


This sounds completely reasonable and logical, but it does not reflect my experiences during two server outages since I have been participating in this project. If I have a two-day buffer (plus the tasks that are currently being processed), and if the servers are only out of service for about nine hours, how is it that my machine works it way through the entire buffer and runs out of work for (in the most recent case) more than nine hours? The numbers do not seem to add up.

I don't think I am the only one who has thought about this question, so any additional light you might be able to shed would probably be appreciated by others, also.

Thanks for sharing your knowledge.

deesy


16) Message boards : Number crunching : Servers? (Message 68052)
Posted 12 Oct 2010 by deesy58
Post:
If you picture a bank, with 1000 customers demanding to withdraw their savings immediately, only with no lines... just a free-for-all on how and when a customer reaches a teller, that is what the world is like for an internet server (of any kind, not just BOINC). So while the bank is open (i.e. servers are running), your request may timeout before it gets a reply. So, in addition to cases where BOINC clients sit idle, not requesting more work when you would think they should, the response times from an overloaded server are always highly variable. It just boils down to random chance. When a request is received, it is fulfilled the same way as when the servers are not busy (i.e. the request, if it is completed, gets assigned all of the work necessary to fill it, assuming enough work is available). So the small percentage of requests that complete in the chaotic bank lobby, are handled with the same care and attention to detail as when the bank is not busy.

Updating to the project only puts your hat in the ring more times (assuming the updates are not so frequent that you get the last request too recent messages). If the server is still busy, you still only have a 1% chance of getting lucky. The good news is that the servers are generally able to work through such backlogs in just a few hours, and so many people don't even notice there was an outage.


I understand your analogy to a bank. Wouldn't this be similar to the way that Web servers and Database servers function?

Would it be an acceptable design for a Database server to fill a request for multiple records (many more than the client could possibly process at the time) while other requests go unfilled for as long as several hours. I can imagine the havoc that would be generated in a large client-server or Web-based ERP system (for example) if such a strategy for recovery after an outage were employed. Would it be an improvement to limit the number of work units that were distributed to each user for some period of time after a server outage - say 24 hours? Why not ration the work until such time as the system has completely recovered from the outage. It's difficult to see how such an approach would not be more efficient.

Food for thought.

deesy
17) Message boards : Number crunching : Servers? (Message 68042)
Posted 11 Oct 2010 by deesy58
Post:
I, also, was unable to successfully ping BOINC during the outage. Since it appears that there really was a server outage, this would not be surprising.

deesy
18) Message boards : Number crunching : Servers? (Message 68040)
Posted 11 Oct 2010 by deesy58
Post:
dessy58, in the future, please do not respond to posts you find not particularly helpful. Often when server outages occur, various users have different observations over time as things recover. One user stating their observations after you have stated yours should not be taken as any contradiction nor expectation on what you observe. If you string together 5 or 6 such factual observations, you can often see progress with time on getting things back to normal. And so it is commonplace to make such posts in threads such as this one.

When facts about errors, and retries are omitted from problem descriptions the reader is left to presume many things; especially in a project where you can specify a preference for tasks to run anywhere from an hour to 24hrs.

If the reader frequents a number of BOINC project message boards, they often fill in missing details with similar problems they are familiar with. There are a number of issues where BOINC's core client is not requesting work from projects even when cores are idle.

BOINC versions do not automatically up date themselves, so every client machine can be different. Indeed there are scores of versions possible.

When you have a number of rapid failures in a row, the BOINC core client can get confused about how long to expect tasks to take to complete and has trouble requesting a proper amount of work to match the desired network preferences. It appears a batch of v2.16 tasks that were sent out failed on startup. These have since been removed. When such pervasive problems are encountered, the servers get bottlenecked trying to replace and reissue the failing tasks.

No, project servers do not contact the attached machines when they recover from an outage... in order or otherwise. Project servers never contact the attached machines, the architecture is always client-pull, not server-push. Depending upon how many times your machine tried to contact the project, the delay time until the next request gets increasingly large, that may explain any time gap where your machine did not attempt to contact the project for a few hours (if that occurred).


Thanks for the advice, Mod.Sense. I suppose I shouldn't be responding to a Troll. It's just that the implications in his post are that NOBODY ELSE experiences any idle time, and that it is because of something I am doing wrong, but which can't be explained by anybody. Actually, I have received e-mail from other participants that contradict his assertions, so I shouldn't really pay any attention to his sarcastic rants.

Your explanation could, however, be a little more clear:

Why, for example, might it be the case that some users appear to have their work buffers replenished very quickly after an outage, while others are forced to wait a number of hours for additional work? Why is this case even if BOINC is exited and restarted, and why would it be the case even if "Update" is selected? Is there anything at all that can prevent this idle time on a user's computer (other than selecting a massive 10-day buffer)?

I understand that the system uses a "pull" technique to distribute work. What I don't understand is why my machine repeatedly reports that the servers might be down even when the "Server Status Page" reports all servers up and running.

Not that this is a big thing, but the Troll seems to be implying that I am doing something wrong if I accept BOINC defaults. If the settings are wrong, why would they be defaults? Do you have any thoughts on this specific matter?

BTW, my tasks are set to run for 4 hours with a two-day buffer. I have a broadband connection that is always available, and I have selected no restrictions in my preferences (that I am aware of). I allow 100% of the processors on my machine to be used 100% of the time. I switch between applications every 50,000 minutes. (Is this wrong?)

I allow a maximum of 100 Gigabytes of disk space or 50% of total available disk space. I restrict swap space to 75% of the page file. Memory usage is limited to 70% when the computer is in use, and 90% when the computer is idle. Applications remain in memory when suspended.

Which of these settings would you recommend be changed to ensure that my machine will not be idled for prolonged periods after a Rosetta server outage?

Thanks!

deesy


19) Message boards : Number crunching : Servers? (Message 68036)
Posted 11 Oct 2010 by deesy58
Post:

# of tasks in your buffer isn't sufficient information. But you know that, so I leave it to you to fill in the blanks.


My buffer is large enough to hold a two-day supply of work. If Rosetta's servers are out of service for only nine or ten hours, a two-day buffer should be adequate, shouldn't it?

The rest isn't relevant if Boinc doesn't call for tasks. Before you posted for the first time my unattended machine had connectedul'ddl'd 5 times successfully over the previous 3.5 hours and, it seems, did so again 11 more times by the time your machine got its first tasks. How many times in the 14hr period (not more) before you received your first tasks did you get the reply "Internet access OK - project servers may be temporarily down" or "Scheduler request completed: got 0 new tasks"?


There you go again. I suppose you believe that if it is raining at your house, it must, necessarily, be raining at everybody else's house, too. How simplistic!


The first answer I think will be 'none' because the task server was up when the website returned and you posted after that. The second answer, from what you've said, might be once, but seeing as you received many tasks as soon as your machine reported it may also be 'none'. In which case it's a different problem that may be entirely local.


You think wrong! My machine attempts repeatedly to connect to Rosetta's servers to acquire new work, and receives repeated error messages. I guess you really don't have a very good understanding of how computers and their software work. Do you really believe that all users can be magically and instantly supplied with adequate work when previously-down servers come back on-line? I suppose so! As Arthur C. Clarke once said: "Any sufficiently advanced technology is indistinguishable from magic."

I can't see where I've said no-one else should have a problem because I didn't. What I'm seeking to clarify is whether Boinc on your machine didn't uldl tasks because it didn't actually ask for any. Why that should be, I can't say, but Boinc does get it's task requests wrong a lot (something I see regularly, if not often) so it wouldn't surprise me. Again, that would be a local problem.


Do you actually think that there might be multiple versions of BOINC, and that my version might be different from yours? Don't we all run the same basic version of BOINC and Rosetta that are automatically updated when necessary? If you believe that I have somehow acquired a defective version of BOINC, you should say so, and you should tell me (and everybody else) how to correct it.

What do you mean by "a local problem?" Could you be more specific? It appears that you are making unwarranted assumptions, again. Your posts are not particularly helpful, regardless of how knowledgeable you think you might be.

deesy



20) Message boards : Number crunching : Servers? (Message 68023)
Posted 11 Oct 2010 by deesy58
Post:
The whole site was down for about 10 hours on Saturday 9th. When it came back up, my logs show no problem uploading or downloading straight away (sending 6, receiving 8 WUs). I have no idea why your machine didn't dial in for a further 24 hours, but Boinc gets funny sometimes. Your buffer is inadequate, as we discussed before, but apparently you know best so I assume that was intentional. No problems on an unattended machine here because I don't fail to plan. Funny how that keeps on working for me.

On your computation error issues, all those mem_widd tasks show the same failure and Yifan has confirmed the problem is at Rosetta's end - see the 2.16 thread.


My buffer is large enough to hold about 20 tasks. Are you saying that 20 is insufficient?

My machine received no new work for at least 24 hours, even though you say that the site was down for only ten hours. Do you know the algorithm for reconnecting after an outage? Perhaps the servers connect to one user at a time, and completely load their buffers with work before moving on to the next user, instead of giving each contributor two or four tasks so that everybody is able to contribute again as soon as possible. Although such an algorithm would be simpler than ensuring that all contributors have work as quickly as possible, it would be less efficient, wouldn't you agree?

There you go again. Assuming that, just because you have a given experience, therefore everybody must have the same experience is a little simplistic, and it is an example of defective reasoning.

A lot of faulty thinkers believe (mistakenly) that fortune is actually the result of their superior reasoning/planning/business skills. I suppose if you won the lottery you would believe that it was because of your exceptional Math skills.

deesy


Next 20



©2024 University of Washington
https://www.bakerlab.org