Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 35 · 36 · 37 · 38 · 39 · 40 · 41 . . . 55 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 77253 - Posted: 3 Aug 2014, 15:06:47 UTC
Last modified: 3 Aug 2014, 15:13:05 UTC

So what is being done to "future" proof this project and prevent such overloads from happening?

Is there a way to allow new accounts to be added, but not crash the system like this?

What if anything can you guys do different that will prevent this from happening again? Is there some sort of automation sequence you can write or more hardware or modern hardware to install that could allow for such a big wave of new users to enter the system without crashing it?

Credits to me don't matter, just that my BOINC manager was getting clogged up with Rosie's problems and I have other projects that use my system that also fill the screen, so it was turning into a mess. Plus it seemed since the tasks could not be reported or uploaded BOINC manager did not know how to allocate resources or so it seemed.

Also, we have been asking for years now for someone to keep the main page up to date with info about problems or other news. Like now, the only main page post about this problem is some technical stuff by KEL. Nothing saying it has been solved. Though I guess you can get all that from here.

Communication to the outside non scientific world has always been a challenge for this project and I have suggested before that hiring or getting a volunteer from the communications department student pool would be a plus. They could be the projects PR/spokesperson. Also it seems internal communication is problem.
Yes one has a right to dump their phone and computer and return to the basics, but it would be nice if there was a backup person that knew about things like this charity organization and then could tell the others that a big clump of new users could be coming online.

Anyway..hope you guys learned something from this and will improve things for the future.

Happy Crunching...
ID: 77253 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77254 - Posted: 3 Aug 2014, 15:30:07 UTC

Greg_BE wrote:
but it would be nice if there was a backup person that knew about things like this charity organization and then could tell the others that a big clump of new users could be coming online


You can only have a backup person who knows about a new surge of users if the primary contact knows themselves. From David's comments I am surmising that this surge came entirely without warning.

David E K wrote:
We were not warned of the spike, do not know the cause yet, and are not prepared to serve the large executable and database files currently.


David E K wrote:
I was told by Matthew Blumberg at Gridrepublic that the new users are real crunchers and that they "started a new marketing campaign via charityengine.com." So I re-enabled the account creation for these users. Our servers may get sluggish again but hopefully things will settle down as the new user rates decrease. And hopefully optimizing the connections on our servers will help. In the future, we hope to get more servers.



Greg_BE wrote:
What if anything can you guys do different that will prevent this from happening again? Is there some sort of automation sequence you can write or more hardware or modern hardware to install that could allow for such a big wave of new users to enter the system without crashing it?


...
...

Also, we have been asking for years now for someone to keep the main page up to date with info about problems or other news. Like now, the only main page post about this problem is some technical stuff by KEL. Nothing saying it has been solved. Though I guess you can get all that from here.


Partially answered by David E K and krypton above...

krypton wrote:
We will be getting more servers, to prevent this from happening in the future.

Once we know who these new users are, we'll post something on the front page.


In terms of the main page update, the source of the problem was identified late Satruday/early Sunday depending on the timezone. Hopefully the promised main page update will occur during normal working hours on Monday.
ID: 77254 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Gallstone

Send message
Joined: 31 May 12
Posts: 3
Credit: 416,986
RAC: 2,615
Message 77255 - Posted: 3 Aug 2014, 15:42:23 UTC

Puuuuh, my four overdue tasks have uploaded now. It dragged, it really dragged, but finally it worked.

In three of the four cases I still got points because I uploaded the task before my successor. Only in one case my successor passed by me and I got no points.

Interesting.

OK, everybody learned something out of it, hopefully.

One advice on technical staff: if possible please treat incoming data with higher priority than outgoing data, just like in normal life, give higher priority to older unfinished processes/tasks/jobs/duties before caring about newer ones. Alternatively leave a certain amount of network connections reserved to incoming (result) data.

Now, I want to apologize for my hefty statement lately, I just was a little pissed yesterday.
ID: 77255 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,641,236
RAC: 111
Message 77258 - Posted: 3 Aug 2014, 22:29:19 UTC

Homepage puts R@H at 291 TFLOPS, considering this is all x86 horsepower (no GPU clients) that is incredibly impressive! It would be nice if this becomes the new norm, as I'm sure it would have a tangible impact on experiment Turn-Around-Time and scientific progress.
ID: 77258 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77259 - Posted: 3 Aug 2014, 23:05:56 UTC

Unfortunately, time will reveal that the 200+ TFLOPS as a blip due to so many people being unable to send back their results for several days. So several days of results have uploaded and had credit issued here in a single day.

But, if there were about 65,000 active hosts returning work last week, and now there are close to 80,000. If they all keep crunching, it would be reasonable to hope to see the project TFLOPS increase over 20%! If you factor in that many of those new 15,000 have a better than average chance of being newer machines, perhaps their average capacity to do work is a bit ahead of the previous average as well.

Hopefully this temporary logjam was an investment to bring on a sustainable, larger base of crunching machines.
Rosetta Moderator: Mod.Sense
ID: 77259 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77260 - Posted: 3 Aug 2014, 23:14:07 UTC - in response to Message 77258.  
Last modified: 3 Aug 2014, 23:29:47 UTC

Homepage puts R@H at 291 TFLOPS, considering this is all x86 horsepower (no GPU clients) that is incredibly impressive! It would be nice if this becomes the new norm, as I'm sure it would have a tangible impact on experiment Turn-Around-Time and scientific progress.


It is up to 330 TFLOPS now. Of course that will also include clearing the backlog of uploads, so the normal figure will drop to somewhere lower.

Based on the graphs at BOINC stats there are almost twice as many active users at the moment as in June, when Rosetta was running at 130 TFLOPS, so 260 TFLOPS would be a reasonable estimate. There are a couple of other factors that will affect things - how many charity engine clients still need to connect to Rosetta? And how often will charity engine clients be connected? The CE site says the clients only run BOINC projects when there is no paid computation work available.

I expect the next hurdle for the scientists will be having enough work units ready to issue.


Edit:

Looking at the Host graphs the proportion of the increase in active hosts is lower than the proportionate increase in active users. That would suggest that Charity Engine has close to a 1:1 ratio for users and hosts while many native BOINC users have multiple clients.

Based on the current data on new hosts, perhaps an increase of 25%? That would put the new speed at around 162 TFLOPS.

It will be interesting to see how it plays out in reality.
ID: 77260 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 77262 - Posted: 4 Aug 2014, 0:38:15 UTC - in response to Message 77255.  

<snip>

OK, everybody learned something out of it, hopefully.

One advice on technical staff: if possible please treat incoming data with higher priority than outgoing data, just like in normal life, give higher priority to older unfinished processes/tasks/jobs/duties before caring about newer ones. Alternatively leave a certain amount of network connections reserved to incoming (result) data.

<snip>


Well, I stopped by to learn something about what went wrong and how it was fixed, but certainly didn't. Maybe I just failed to find the right comment in this thread, but we already knew there was a problem. We already knew that it was mostly fixed over the weekend, though I didn't see anything about the ongoing problem, which fortunately seems to be relatively minor. I'd describe it in detail, but it's really hard for me to to believe they {the project managers} will suddenly develop improved communication skills to explain what is still wrong.

I'm basically willing to assume they will eventually get the last wrinkles worked out. I'm even willing to believe that the new server behaviors might be closer to the proper ones than before. There is one apparent change I've noticed that might represent a reasonable optimization... However, I'm not sure where I stand on my rumor of the cause being an NSA (or CIA or Mossad) intrusion, except that the random incompetence theory mitigates against it. Too much risk of a clumsy monkey stumbling over something.

I'm replying to this particular comment because I strongly disagree with the second paragraph quoted above. From the perspective of a donor to the project, I strongly prefer to continue donating, and therefore I think the downloads of tasks should have priority. I don't see any problem with storing the pending work on my machine, subject to the condition that the upload delays don't kill donations for the sake of still meaningless-to-the-donors deadlines. (If it isn't an absolute deadline but just a discount time, then that's another communication failure on the part of the project managers...)

Back to the communications topic: The "News" entry on the main project webpage has already been mentioned--but negatively as in not being used effectively. There are also two other existing communication channels that should be considered. One is the "Notices" tab of the BOINC Manager. If anyone attempted to use it, they certainly didn't get any message out.

The second poorly used communication channel is the "Server Status Page", which was not helpful. Specifically, I think that the Server Status Page needs to be moved to an external server so that it can also report on the status of communications to the servers from an outside perspective. The obvious solution is as a reciprocal arrangement with other BOINC projects. This would only be a minor back-scratching mechanism, but hopefully it would lead to stronger back scratching. Seems pretty unlikely that Rosetta is the first BOINC project to encounter and fix this particular problem, whatever it was.

By the way, an earlier reply "jumped on me" about the exact timing of the bandwidth-wasting 80-meg "Computation Error" tasks. That criticism was apparently based on my comment posted here several months after I noticed and started investigating those tasks. Also, the critic was confused about when they stopped, but mostly I just consider it as more evidence of highly amateurish project management in that the bandwidth was wasted for so many months. Not exactly a defense, but I'm not proud of the quality of all of my own work when I was in graduate school, so I feel like cutting them some slack on that point. I'm rather more concerned as to whether the sloppiness extends to the research results derived from the Rosetta calculations... Y'all have a rather large supercomputer here, and you seem to be taking it for granted, so to speak. (I might start looking for another project that appreciates my donations more, except that I've already participated in a couple of projects and discovered that none of them were perfect... Also, Some of the researchers I support are doing some collaboration with another department of your university.)
ID: 77262 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,641,236
RAC: 111
Message 77263 - Posted: 4 Aug 2014, 1:08:14 UTC - in response to Message 77262.  
Last modified: 4 Aug 2014, 1:10:42 UTC

A quick recap of what happened, for anyone like Shanen who missed it since it was burried in this thread:

- It appears that it was not a network / switch setting, nor was it any kind of hack/NSA intrusion (that rumor was started by someone as a joke)
- The cause appears to be a very large spike (20k+) new users joining the project all at once and, as is necessary when first attaching to Rosetta@home, all requesting to downloading the main Rosetta database file (~250MB if memory serves, which would put a requirement of transferring ~4,880GB of data all at once from the R@H servers.).
- This large swath of users looks to be attributed to the CharityEngine project which is built on the BOINC platform and attaches to a couple BOINC projects (Rosetta being one of them) to keep their workers busy when there is no CE work to do. The joining of this large pool of users was not communicated to Rosetta staff/management and they had no for-warning to take any measures to prepare for it.
- Incredibly bad timing of this entire incident compounded the issue as most of the Rosetta team was out of town to attend a conference, while another key person was on a camping trip without any phone reception / internet access.
- Most of this logjam is now cleared and work has resumed as normal.
ID: 77263 · Rating: 0 · rate: Rate + / Rate - Report as offensive
JimWOC

Send message
Joined: 27 Dec 05
Posts: 2
Credit: 6,179,797
RAC: 0
Message 77264 - Posted: 4 Aug 2014, 2:02:02 UTC - in response to Message 77263.  
Last modified: 4 Aug 2014, 2:05:33 UTC

My backlog of uploads has cleared, but I am still getting a lot of Computation Errors. I have 32 shown in just a few minutes and the list is growing.
ID: 77264 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77265 - Posted: 4 Aug 2014, 2:44:10 UTC - in response to Message 77264.  

Can you post a log?

My backlog of uploads has cleared, but I am still getting a lot of Computation Errors. I have 32 shown in just a few minutes and the list is growing.

ID: 77265 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77266 - Posted: 4 Aug 2014, 2:56:17 UTC
Last modified: 4 Aug 2014, 3:00:07 UTC

Jim apparently has a host that is throwing everything back
https://boinc.bakerlab.org/rosetta/results.php?hostid=1801946
https://boinc.bakerlab.org/rosetta/result.php?resultid=678868361
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
finish file present too long
</message>
...the rest of the log is rather extensive, but includes

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x760D3226

Engaging BOINC Windows Runtime Debugger...

And it has been doing this for days, while the reassigned task gets completed OK
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=614155164
Rosetta Moderator: Mod.Sense
ID: 77266 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77269 - Posted: 4 Aug 2014, 12:51:39 UTC

Jim, it seems as though one of the core files used for crunching may have been corrupted on that machine. Simple way to reset things (especially when it does not appear you have any work in progress) is to go to the projects tab, select R@h and click the button to reset the project. This will abort all work in progress, remove all of the project programs and files, and start from scratch with downloading new copies of everything.

One way files may get corrupted is by anti-virus software. So if the problem persists after a project reset, that would be another thing to check.
Rosetta Moderator: Mod.Sense
ID: 77269 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 77270 - Posted: 4 Aug 2014, 12:57:17 UTC - in response to Message 77226.  

Yep, I'm currently optimizing the number of connections on all our servers. Looks like they can keep up without too much load/memory usage so far. These servers are pretty old and I'm sure we'll upgrade soon hopefully.

Not only new server but also new server code. The code running at the moment is very outdated.
Greetings,
TJ.
ID: 77270 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,641,236
RAC: 111
Message 77271 - Posted: 4 Aug 2014, 15:05:07 UTC

Just in case no one else mentions this, although work cleared and new work came down last night, I'm now getting this when requesting new work:

  • 8/4/2014 11:01:53 AM | rosetta@home | Requesting new tasks for CPU and ATI
  • 8/4/2014 11:02:15 AM | rosetta@home | Scheduler request completed: got 0 new tasks
  • 8/4/2014 11:02:15 AM | rosetta@home | Server can't open database


All database servers show green in the server status page. Hmm..

ID: 77271 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77272 - Posted: 4 Aug 2014, 16:04:49 UTC - in response to Message 77270.  

Can you be more specific about which code you are referring to?

Yep, I'm currently optimizing the number of connections on all our servers. Looks like they can keep up without too much load/memory usage so far. These servers are pretty old and I'm sure we'll upgrade soon hopefully.

Not only new server but also new server code. The code running at the moment is very outdated.

ID: 77272 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77274 - Posted: 4 Aug 2014, 18:22:21 UTC

They are referring to the actual BOINC Server code. R@h has not done a refresh for many years. Newer versions have reformatted the webpages for hosts and tasks and other feature additions that people grow to expect, but they are not available on R@h.
Rosetta Moderator: Mod.Sense
ID: 77274 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 77275 - Posted: 4 Aug 2014, 19:52:23 UTC - in response to Message 77272.  

Can you be more specific about which code you are referring to?


On other boinc projects in "server status page" there are server software version (for example, in Poem@home, is 24848) but not in Rosetta. So we don't know how old is the server code.
Some volunteers speculate that Rosetta's admins don't update server 'cause the deep customization of code. But no admins confirm it....

ID: 77275 · Rating: 0 · rate: Rate + / Rate - Report as offensive
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 21,590,659
RAC: 6,272
Message 77276 - Posted: 4 Aug 2014, 20:55:58 UTC
Last modified: 4 Aug 2014, 20:56:50 UTC

8/4/2014 4:50:35 PM | rosetta@home | Reporting 9 completed tasks
8/4/2014 4:50:35 PM | rosetta@home | Requesting new tasks for CPU and NVIDIA
8/4/2014 4:50:57 PM | rosetta@home | Scheduler request failed: Couldn't connect to server
8/4/2014 4:51:01 PM | | Project communication failed: attempting access to reference site
8/4/2014 4:51:03 PM | | Internet access OK - project servers may be temporarily down.
ID: 77276 · Rating: 0 · rate: Rate + / Rate - Report as offensive
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 21,590,659
RAC: 6,272
Message 77277 - Posted: 4 Aug 2014, 21:31:59 UTC - in response to Message 77271.  

Just in case no one else mentions this, although work cleared and new work came down last night, I'm now getting this when requesting new work:

  • 8/4/2014 11:01:53 AM | rosetta@home | Requesting new tasks for CPU and ATI
  • 8/4/2014 11:02:15 AM | rosetta@home | Scheduler request completed: got 0 new tasks
  • 8/4/2014 11:02:15 AM | rosetta@home | Server can't open database


All database servers show green in the server status page. Hmm..



Just got the same message.
ID: 77277 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77278 - Posted: 4 Aug 2014, 21:32:02 UTC

I just disabled new users from charityengine until our servers can catch up with download demand.
The number of downloads that happened last week nearly doubled, today alone.
ID: 77278 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 35 · 36 · 37 · 38 · 39 · 40 · 41 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org