Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 33 · 34 · 35 · 36 · 37 · 38 · 39 . . . 55 · Next

AuthorMessage
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77198 - Posted: 1 Aug 2014, 10:48:39 UTC - in response to Message 77197.  

Some good news: KEL (aka "IT staff") has gotten a hold of the UW network engineer, whom is looking at it.

3) Most of the Rosetta Community is out of town for a conference... Won't be back at the university till this weekend. I was not able to repair it myself, and have to wait till the experts are back (or at least till they have access to the internet). =[

Obviously this isn't the news we wanted, but it's important you've said it because we can adjust our expectations (and processing) accordingly.

It's disappointing you've been put in this position and the IT staff haven't supported you by calling over expert help from elsewhere in the faculty. Thanks for trying.

ID: 77198 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,160,504
RAC: 9,210
Message 77200 - Posted: 1 Aug 2014, 12:24:36 UTC - in response to Message 77198.  

Some good news: KEL (aka "IT staff") has gotten a hold of the UW network engineer, whom is looking at it.

And I just popped in to see if anyone else had reported the odd task getting back and the odd few coming in.

I've had one get uploaded (and credited) and four come down successfully on another machine.

About 120 still to go over 4 machines, mind...
ID: 77200 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Miklos M

Send message
Joined: 8 Dec 13
Posts: 29
Credit: 5,277,251
RAC: 0
Message 77203 - Posted: 1 Aug 2014, 18:58:16 UTC - in response to Message 77200.  

Some good news: KEL (aka "IT staff") has gotten a hold of the UW network engineer, whom is looking at it.

And I just popped in to see if anyone else had reported the odd task getting back and the odd few coming in.

I've had one get uploaded (and credited) and four come down successfully on another machine.

About 120 still to go over 4 machines, mind...

I tiny trickle here, so the machines are working away, but still over a 100 waiting to upload and often a message that there are too many uploads waiting, so no new tasks are being sent. I wonder how many more days..............
ID: 77203 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 77204 - Posted: 1 Aug 2014, 21:01:58 UTC
Last modified: 1 Aug 2014, 21:09:48 UTC

I just monitored a working download amd it was surprisingly fast (~250 KBit/s). Not for a tiny file where the speed display is more or less random, it has been a 200k minirosetta_database ZIP file, so the speed value is relevant.

So it might not be the server speed that causes the trouble but a way too low number of allowed concurrent connections.

Another indicator that it is probably not a speed problem would be the message. From what I have seen, it never said that something has been interrupted or timed out but it says "system connect" only few seconds after the attempt, just as if it did get a physical connection immediately, but it had been rejected.

Otoh. upload and download might of course behave different - I just thought I'd mention it as it might help with the analysis.
ID: 77204 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77206 - Posted: 1 Aug 2014, 23:27:00 UTC

Great points Ananas. If the number of concurrent allowed connections were the issue, that would also explain why the config adjustments to timeout values that were suggested did not seem to help. You can only survive longer on a timeout if you get a connection. If your connection is refused, then it would behave more like there is an internet problem (which is exactly what the BOINC Manager is reporting).
Rosetta Moderator: Mod.Sense
ID: 77206 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77207 - Posted: 2 Aug 2014, 0:21:04 UTC - in response to Message 77206.  

Indeed. UW may have limited the number of concurrent connections. The "faster" servers (ralph etc) don't get as much traffic which would explain why there is a fast response.

Thanks! This is helpful.

Great points Ananas. If the number of concurrent allowed connections were the issue, that would also explain why the config adjustments to timeout values that were suggested did not seem to help. You can only survive longer on a timeout if you get a connection. If your connection is refused, then it would behave more like there is an internet problem (which is exactly what the BOINC Manager is reporting).

ID: 77207 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,160,504
RAC: 9,210
Message 77208 - Posted: 2 Aug 2014, 1:46:13 UTC

An interesting page showing daily credits for the whole of Rosetta

Rosetta daily credits

It seems credits are being awarded at 8-18% of the daily pre-problem level. Which is odd as I've barely had 2% go back in total, myself. Nothing at all since that little burst I reported earlier.

Oh well... <sigh>
ID: 77208 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 77209 - Posted: 2 Aug 2014, 2:48:08 UTC
Last modified: 2 Aug 2014, 2:49:46 UTC

Sid, if you read the explanation on the home page, you should see that the on-campus computers are affected in a rather different way - they still have a fast connection to the Rosetta@Home server, but rather slow to most of the rest of the internet. Therefore, it is likely that on-campus computers are now contributing most of the uploads that get credits.
ID: 77209 · Rating: 0 · rate: Rate + / Rate - Report as offensive
jareeq

Send message
Joined: 28 Apr 12
Posts: 2
Credit: 4,149,828
RAC: 0
Message 77211 - Posted: 2 Aug 2014, 8:24:05 UTC - in response to Message 77188.  
Last modified: 2 Aug 2014, 8:28:34 UTC

Guys, You suck. I have never saw network failure that can't be repaired within 12 hours or less (I manage large networks). It's 4'th day without ability to upload/download anything. Come on guys, I am supporting you since 2005, and I always thought about R@H as best of the best projects. But from some time I am considering leaving it because:

1. You do not wan't to share source code - how the hell could I be sure I am not part of Bitcoin botnet or other strange project?
2. There is large number of errors in WU's
3. Current project status for me is DOWN.

Guys do something or you loose lot of compute power.


it's true, although I have never experienced such problems project status is indeed DOWN errors in WU never dropped to accidental level
...source code is available
ID: 77211 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,160,504
RAC: 9,210
Message 77213 - Posted: 2 Aug 2014, 12:42:17 UTC - in response to Message 77209.  

Sid, if you read the explanation on the home page, you should see that the on-campus computers are affected in a rather different way - they still have a fast connection to the Rosetta@Home server, but rather slow to most of the rest of the internet. Therefore, it is likely that on-campus computers are now contributing most of the uploads that get credits.

But up to 18% of all credits? If that were true it'd hardly be worth putting it out worldwide. More than just that, surely
ID: 77213 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jorge Flores

Send message
Joined: 24 Jun 13
Posts: 1
Credit: 347,811
RAC: 0
Message 77214 - Posted: 2 Aug 2014, 14:35:02 UTC

Good morning
Like many users, I have noticed that the Project is "hung up". This has lasted for several days. I believe you should have the courtesy to publish on your Home page the nature of the problem and the estimated time to fix it; the least you could do is to inform all of us users so we know what to expect.

Sincerely,
J Flores


ID: 77214 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Gallstone

Send message
Joined: 31 May 12
Posts: 3
Credit: 411,647
RAC: 2,491
Message 77216 - Posted: 2 Aug 2014, 16:57:54 UTC

There is no other way to say it but operators don't care about us users on rosetta!

My tasks that I wanted to upload for half a week, now are officially overdue.

The crazy thing is: these tasks have been reissued to other volunteer crunchers, which is a completely sensless thing, because these users will also crunch and may also not be able to upload their tasks. It is abolutely useless to deliver new tasks or reissue "unreplied" tasks where as not "accepting" older, completed tasks. This is like pouring in water into a bathtub and not caring wether water may be able to drain. That's noting else than a programmed catastrophy. It may even be better to completely shut down the project for a few days rather than permanently sending but not receiving tasks. If as I read, network configuration limit the number of network connections it may also be, that outgoing transmissions block out ingoing transmissions.

And you kow what? Project leaders are not caring. They are out of town in a conference? What? In a project as large as this, capable technical staff should be at their workplace at usual office hours minimum.

DAMNED!

I'm not usually using expletives in a forum but it is absolutely necessary to do so now to get into the heads of the project staff.

And again DAMNED!

This is a project with low credits and ultra high data transmission rates which may criple someones transmission limits, also using a lot of electric power paid by the crunchers and then coming up with that kind of a nonchalance, it is disgusting!
ID: 77216 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 77217 - Posted: 2 Aug 2014, 20:47:35 UTC

I have stopped accepting new work because of this problem.
I am surprised Network guys at the UW have not identified and fixed this problem already!
Rosie should have its own outside connection to the internet to avoid these problems in the future. This is the worst I have seen this project behave in all the time I have been online with it.

I am used to the occasional fall out of servers and other things, but this failure is on the top of the list and making me wonder if there is anything worthwhile to keep going with it.

I don't care about credits, but I do care about BOINC manager getting clogged with tasks that have no way of uploading.

I am extremely disappointed in the way Rosie's caretakers have handled this and the way the UW's network operations guys seem to be putting this problem on a back burner. so in short BOOO to the UW tech guys and Rosie's caretakers! BOOOO!

Thumbs down!
ID: 77217 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Justicar

Send message
Joined: 16 Nov 11
Posts: 2
Credit: 2,532,984
RAC: 0
Message 77218 - Posted: 2 Aug 2014, 23:35:38 UTC - in response to Message 77216.  

There is no other way to say it but operators don't care about us users on rosetta!


Duh. They never have, and they never will.


And again DAMNED!


One of the moderators is a teetoler; that useless, but hypersensitive, person will, after finishing wiping the tears from his eyes, likely delete your message.


This is a project with low credits and ultra high data transmission rates which may criple someones transmission limits, also using a lot of electric power paid by the crunchers and then coming up with that kind of a nonchalance, it is disgusting!


lol. They don't give a flying flip about you, or anyone else; so long as a few people remain crunching their data to spare them their research funds (before complaining that they need more money not to upgrade their servers), they'll get along fine.
ID: 77218 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BadThad

Send message
Joined: 8 Nov 05
Posts: 30
Credit: 71,834,523
RAC: 0
Message 77219 - Posted: 2 Aug 2014, 23:39:20 UTC

8/2/2014 3:32:30 AM | rosetta@home | Computation for task hc_centroids_1gou_34_0.25_06-01-14_SAVE_ALL_OUT_168124_4296_0 finished
8/2/2014 3:32:30 AM | rosetta@home | Starting task rb_07_28_48622_95033_ab_stage0_h003___robetta_IGNORE_THE_REST_04_05_179922_11_0
8/2/2014 3:32:33 AM | rosetta@home | Started upload of hc_centroids_1gou_34_0.25_06-01-14_SAVE_ALL_OUT_168124_4296_0_0
8/2/2014 3:32:55 AM | rosetta@home | Temporarily failed upload of hc_centroids_1gou_34_0.25_06-01-14_SAVE_ALL_OUT_168124_4296_0_0: connect() failed
8/2/2014 3:32:55 AM | rosetta@home | Backing off 00:02:41 on upload of hc_centroids_1gou_34_0.25_06-01-14_SAVE_ALL_OUT_168124_4296_0_0

ID: 77219 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 77220 - Posted: 3 Aug 2014, 0:11:46 UTC

Sorry, I've been on a camping vacation and have been out of communication. Sergey and Keith have been working hard at diagnosing the problem but some issues are just not possible to fix in short order unfortunately.

We now suspect that our servers have been sluggish the last few days due to a large spike in new users/hosts. We were not warned of the spike, do not know the cause yet, and are not prepared to serve the large executable and database files currently.

Hopefully as the executables and database files get served to most new hosts, the project will slowly go back to normal, and we will look at increasing the number of web servers in the near future.
ID: 77220 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77221 - Posted: 3 Aug 2014, 0:26:18 UTC

Some good news and bad news:

Good news: The servers and UW network is working normally. We got a
HUGE spike in new users/computers connected to the R@H project.

Bad news: We didn't get a heads up notice and so were not prepared to
handle soo much traffic at once. These computers are still in the
process of downloading Rosetta/Database. As Ananas suggested, it's an
issue with the number of allowed concurrent connections per server.

Good news: Once all these new computers get a copy of rosetta/database
(which is a large single download), everything will go back to normal.
We will be getting more servers, to prevent this from happening in the
future.

Once we know who these new users are, we'll post something on the front page.

Once again, thank you for all the feedback, these were very helpful in
debugging the issue.

-Krypton.
ID: 77221 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,070,914
RAC: 0
Message 77222 - Posted: 3 Aug 2014, 0:47:09 UTC

Thanks for the update. Since I crunch for other life science projects, I always have work. Hopefully things will settle out shortly.

I am curious about one thing. You said there was a large increase in users/computers. Do you mean regular crunchers like us or others who use the results we produce?

-Charlie
ID: 77222 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77223 - Posted: 3 Aug 2014, 1:26:19 UTC - in response to Message 77222.  

I mean more crunchers. =]

Thanks for the update. Since I crunch for other life science projects, I always have work. Hopefully things will settle out shortly.

I am curious about one thing. You said there was a large increase in users/computers. Do you mean regular crunchers like us or others who use the results we produce?

ID: 77223 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 77224 - Posted: 3 Aug 2014, 1:45:31 UTC
Last modified: 3 Aug 2014, 1:46:37 UTC

Once it had a connection, an upload of 1.5MB took just 10 seconds, another indicator that neither the server itself nor the line speed lag. So I guess we can assume, that the number of connections is was the limiting factor.

While I'm typing this, a few more of my uploads went through without retries, so someone must have fixed it :-)
ID: 77224 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 33 · 34 · 35 · 36 · 37 · 38 · 39 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org