Website status report incorrect

Questions and Answers : Web site : Website status report incorrect

To post messages, you must log in.

AuthorMessage
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,654,673
RAC: 0
Message 77075 - Posted: 29 Jul 2014, 6:28:43 UTC

According to the website status page at https://boinc.bakerlab.org/rosetta/rah_status.php, everything is hunky dory, but it's quite clear the server is not accepting completed work (from some hours ago). Perhaps this is part of some less focused strangeness that has been going on over the last few days, but if so, then there should be some kind of announcement about the problem, and I can't find that anywhere, either. On the top page, the latest news is more than a month old, and the last-listed tweet is about 4 months old.
ID: 77075 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,654,673
RAC: 0
Message 77096 - Posted: 29 Jul 2014, 19:33:01 UTC
Last modified: 29 Jul 2014, 19:37:29 UTC

Well, the system is still clearly out of order, and the website is still clearly incorrect in its status reports. No finished work being accepted by the Baker Lab side, and no fresh work units coming down. At least 12 hours since my original report here, and it was already several hours of brokenness at that time...

Allo? Anyone there?

P.S. Even the Twitter account appears to be dead as of March?
ID: 77096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77099 - Posted: 29 Jul 2014, 20:39:59 UTC

You may want to keep an eye on the Problems and Technical Issues with Rosetta@home thread. Several users have reported the same problem there and one of the project team members has replied to say they are investigating.

krypton wrote:
Thanks for the reports!! We are looking at this now.
ID: 77099 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,654,673
RAC: 0
Message 77116 - Posted: 30 Jul 2014, 0:26:41 UTC
Last modified: 30 Jul 2014, 1:08:55 UTC

Okay, now about that website Server Status Page... It is still showing that the project is all green and that is clearly all wrong. Whatever is wrong should be detected and indicated there. It's also nice if you include an estimated repair time. Communication is good, eh?

I have to repeat my point, even including the bad joke: If you don't communicate more effectively, you are liable to cause all sorts of rumors to appear.

The bad joke was my proposed rumor about an NSA trainee bollixing your website. This is NOT really a funny idea, because a rogue task (AKA work unit) could do ALL sorts of bad things, probably including hijacking your computer's camera to take embarrassing photos.

The people running these BOINC projects need to take some responsibility for providing accurate and timely information about the status of their projects. My own observations actually suggest that something has been going south from around the 22nd of July, but it clearly fell off the edge of the earth yesterday (or maybe the day before that).

For what it is worth, I suspected the change around the 22nd may have involved some attempt to fix the "Computation Error" tasks. My theory was that they nipped some of the sub-projects that were causing those errors, and the change was causing a significant drop in the statistics.

However, because of how poorly the project managers communicate, I wasn't really expecting any clarification from them. Hello, people? We know you aren't NSA professionals, but still...

P.S. Perhaps I should feel some personal culpability here, insofar as I may have been an inadvertent contributor to the design of BOINC. However, if I had been asked MUCH more politely to contribute more, then I hope that proper resource security of the client would have been one of my major concerns. Therefore, I disclaim and proclaim "It ain't my fault!"

P.P.S. If you delete the joke again, then maybe I'll stop taking the rumor as a joke. We seem to be back again to the need for improved communications skills, eh?

P.P.P.S. I'm not really Canadian, though it was an increasingly attraction optional rumor during the Dubya years (of the big dick Cheney).
ID: 77116 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim J

Send message
Joined: 21 Feb 14
Posts: 4
Credit: 429,101
RAC: 0
Message 77141 - Posted: 30 Jul 2014, 17:38:51 UTC

I have 19 uploads pending like this:

... Upload: retry in 01:10:53 (project backoff 00:26:40)

As shanen says, the status page still reports what appears to be misinformation.

The Problems and Technical Issues page reports issues from 2011.
Any idea yet on what is going on today?
ID: 77141 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ralph

Send message
Joined: 10 Jan 12
Posts: 2
Credit: 397,191
RAC: 0
Message 77190 - Posted: 1 Aug 2014, 7:57:12 UTC

I got the same problem Since July 28.
Thanks for help.
ID: 77190 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ralph

Send message
Joined: 10 Jan 12
Posts: 2
Credit: 397,191
RAC: 0
Message 77191 - Posted: 1 Aug 2014, 7:57:14 UTC

I got the same problem Since July 28.
Thanks for help.
ID: 77191 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim Mowbray

Send message
Joined: 22 Oct 06
Posts: 1
Credit: 13,730,686
RAC: 28
Message 77201 - Posted: 1 Aug 2014, 12:32:51 UTC

Same problem for me also. No work units being uploaded and no units received since July 28.
ID: 77201 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ThrowerGB

Send message
Joined: 4 Dec 05
Posts: 3
Credit: 12,259,708
RAC: 29
Message 77202 - Posted: 1 Aug 2014, 15:46:52 UTC - in response to Message 77141.  

[quote]I have 19 uploads pending like this:

... Upload: retry in 01:10:53 (project backoff 00:26:40)

I have the same problem. It's been going on for at least a week now. I have 24 uploads pending and no downloads in my queue.
ID: 77202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim J

Send message
Joined: 21 Feb 14
Posts: 4
Credit: 429,101
RAC: 0
Message 77210 - Posted: 2 Aug 2014, 6:28:45 UTC - in response to Message 77202.  

I have 19 uploads pending like this:
... Upload: retry in 01:10:53 (project backoff 00:26:40)

I have the same problem. It's been going on for at least a week now. I have 24 uploads pending and no downloads in my queue.


Today 13 downloads arrived for me. Rosetta worked through them and now it is idle again. I have 32 uploads pending.
ID: 77210 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77212 - Posted: 2 Aug 2014, 12:09:40 UTC - in response to Message 77141.  
Last modified: 2 Aug 2014, 12:11:14 UTC

The Problems and Technical Issues page reports issues from 2011.
Any idea yet on what is going on today?


Check the link I gave earlier. It is a sticky thread from another part of the forum, so has information from different time periods. The early posts are from 2011 while the latest posts are about this issue.

This Q&A section of the forum isn't visited much so you are unlikely to get answers in the short term. Most of the discussion is taking place in the Number Crunching section.
ID: 77212 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Warped

Send message
Joined: 15 Jan 06
Posts: 47
Credit: 1,586,400
RAC: 181
Message 77215 - Posted: 2 Aug 2014, 15:15:10 UTC

Strange as it may seem, the Server Status Page is actually correct. If your computer was inside the firewall at the University of Washington, you would not be aware of any issue. The problem is the internet connection from the campus being throttled to the point where uploads and downloads from us are timing out. This is not monitored on the Server Status Page.

Based on what I see in the Number Crunching section of the Message Boards, I do not expect any resolution until Monday since the Rosetta staff have been away and I expect the UW IT staff will only be back on Monday. In addition, that's Pacific Time so it will likely only be about 15h00 UTC before resolution can be expected. On top of this, when resolved, the routers and switches will get hammered with data.
Warped

ID: 77215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim J

Send message
Joined: 21 Feb 14
Posts: 4
Credit: 429,101
RAC: 0
Message 77240 - Posted: 3 Aug 2014, 6:29:42 UTC - in response to Message 77212.  

...The early posts are from 2011 while the latest posts are about this issue.


I later saw that - thanks!
ID: 77240 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim J

Send message
Joined: 21 Feb 14
Posts: 4
Credit: 429,101
RAC: 0
Message 77241 - Posted: 3 Aug 2014, 6:30:09 UTC

Well my CPU rate shot up a while ago and I found Rosetta was working through 19 new downloads.

The 32 completed tasks had been uploaded!
Maybe things are settling down...
ID: 77241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77250 - Posted: 3 Aug 2014, 12:44:58 UTC

Problem now diagnosed:

krypton wrote:
Some good news and bad news:

Good news: The servers and UW network is working normally. We got a
HUGE spike in new users/computers connected to the R@H project.

Bad news: We didn't get a heads up notice and so were not prepared to
handle soo much traffic at once. These computers are still in the
process of downloading Rosetta/Database. As Ananas suggested, it's an
issue with the number of allowed concurrent connections per server.

Good news: Once all these new computers get a copy of rosetta/database
(which is a large single download), everything will go back to normal.
We will be getting more servers, to prevent this from happening in the
future.

Once we know who these new users are, we'll post something on the front page.

Once again, thank you for all the feedback, these were very helpful in
debugging the issue.

-Krypton.


David E K wrote:
Yep, I'm currently optimizing the number of connections on all our servers. Looks like they can keep up without too much load/memory usage so far. These servers are pretty old and I'm sure we'll upgrade soon hopefully.


Polian wrote:
Looks like a DoS attack to me, to be honest:

I just picked a random new user ID, 680000 and went up by one from there

Example new users:

https://boinc.bakerlab.org/rosetta/show_user.php?userid=680000
https://boinc.bakerlab.org/rosetta/show_user.php?userid=680001
https://boinc.bakerlab.org/rosetta/show_user.php?userid=680002
https://boinc.bakerlab.org/rosetta/show_user.php?userid=680003
https://boinc.bakerlab.org/rosetta/show_user.php?userid=680004
https://boinc.bakerlab.org/rosetta/show_user.php?userid=680005
https://boinc.bakerlab.org/rosetta/show_user.php?userid=680006


David E K wrote:
Hmm, I also suspected this but the IP's from the logs were coming from various places. Maybe I'll have to disable new users for now until we figure things out.


David E K wrote:
I was told by Matthew Blumberg at Gridrepublic that the new users are real crunchers and that they "started a new marketing campaign via charityengine.com." So I re-enabled the account creation for these users. Our servers may get sluggish again but hopefully things will settle down as the new user rates decrease. And hopefully optimizing the connections on our servers will help. In the future, we hope to get more servers.

This issue coincided with the annual RosettaCon meeting (so most of the Baker lab members were out of town), a final ramp up in CASP targets, and me going on a family camping vacation where I had no phone reception. Normally, I would have been able to react faster to debug and help diffuse the situation.

I am sorry for any inconvenience and the fact that it took a few days to finally make some progress.
ID: 77250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Questions and Answers : Web site : Website status report incorrect



©2021 University of Washington
https://www.bakerlab.org