Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 33 · 34 · 35 · 36 · 37 · 38 · 39 . . . 55 · Next
Author | Message |
---|---|
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
Some good news: KEL (aka "IT staff") has gotten a hold of the UW network engineer, whom is looking at it. 3) Most of the Rosetta Community is out of town for a conference... Won't be back at the university till this weekend. I was not able to repair it myself, and have to wait till the experts are back (or at least till they have access to the internet). =[ |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2073 Credit: 40,607,442 RAC: 5,149 |
Some good news: KEL (aka "IT staff") has gotten a hold of the UW network engineer, whom is looking at it. And I just popped in to see if anyone else had reported the odd task getting back and the odd few coming in. I've had one get uploaded (and credited) and four come down successfully on another machine. About 120 still to go over 4 machines, mind... |
Miklos M Send message Joined: 8 Dec 13 Posts: 29 Credit: 5,277,251 RAC: 0 |
Some good news: KEL (aka "IT staff") has gotten a hold of the UW network engineer, whom is looking at it. I tiny trickle here, so the machines are working away, but still over a 100 waiting to upload and often a message that there are too many uploads waiting, so no new tasks are being sent. I wonder how many more days.............. |
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0 |
I just monitored a working download amd it was surprisingly fast (~250 KBit/s). Not for a tiny file where the speed display is more or less random, it has been a 200k minirosetta_database ZIP file, so the speed value is relevant. So it might not be the server speed that causes the trouble but a way too low number of allowed concurrent connections. Another indicator that it is probably not a speed problem would be the message. From what I have seen, it never said that something has been interrupted or timed out but it says "system connect" only few seconds after the attempt, just as if it did get a physical connection immediately, but it had been rejected. Otoh. upload and download might of course behave different - I just thought I'd mention it as it might help with the analysis. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Great points Ananas. If the number of concurrent allowed connections were the issue, that would also explain why the config adjustments to timeout values that were suggested did not seem to help. You can only survive longer on a timeout if you get a connection. If your connection is refused, then it would behave more like there is an internet problem (which is exactly what the BOINC Manager is reporting). Rosetta Moderator: Mod.Sense |
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
Indeed. UW may have limited the number of concurrent connections. The "faster" servers (ralph etc) don't get as much traffic which would explain why there is a fast response. Thanks! This is helpful. Great points Ananas. If the number of concurrent allowed connections were the issue, that would also explain why the config adjustments to timeout values that were suggested did not seem to help. You can only survive longer on a timeout if you get a connection. If your connection is refused, then it would behave more like there is an internet problem (which is exactly what the BOINC Manager is reporting). |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2073 Credit: 40,607,442 RAC: 5,149 |
An interesting page showing daily credits for the whole of Rosetta Rosetta daily credits It seems credits are being awarded at 8-18% of the daily pre-problem level. Which is odd as I've barely had 2% go back in total, myself. Nothing at all since that little burst I reported earlier. Oh well... <sigh> |
robertmiles Send message Joined: 16 Jun 08 Posts: 1229 Credit: 14,172,067 RAC: 1,095 |
Sid, if you read the explanation on the home page, you should see that the on-campus computers are affected in a rather different way - they still have a fast connection to the Rosetta@Home server, but rather slow to most of the rest of the internet. Therefore, it is likely that on-campus computers are now contributing most of the uploads that get credits. |
jareeq Send message Joined: 28 Apr 12 Posts: 2 Credit: 4,149,828 RAC: 0 |
Guys, You suck. I have never saw network failure that can't be repaired within 12 hours or less (I manage large networks). It's 4'th day without ability to upload/download anything. Come on guys, I am supporting you since 2005, and I always thought about R@H as best of the best projects. But from some time I am considering leaving it because: it's true, although I have never experienced such problems project status is indeed DOWN errors in WU never dropped to accidental level ...source code is available |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2073 Credit: 40,607,442 RAC: 5,149 |
Sid, if you read the explanation on the home page, you should see that the on-campus computers are affected in a rather different way - they still have a fast connection to the Rosetta@Home server, but rather slow to most of the rest of the internet. Therefore, it is likely that on-campus computers are now contributing most of the uploads that get credits. But up to 18% of all credits? If that were true it'd hardly be worth putting it out worldwide. More than just that, surely |
Jorge Flores Send message Joined: 24 Jun 13 Posts: 1 Credit: 347,811 RAC: 0 |
Good morning Like many users, I have noticed that the Project is "hung up". This has lasted for several days. I believe you should have the courtesy to publish on your Home page the nature of the problem and the estimated time to fix it; the least you could do is to inform all of us users so we know what to expect. Sincerely, J Flores |
Gallstone Send message Joined: 31 May 12 Posts: 3 Credit: 443,740 RAC: 0 |
There is no other way to say it but operators don't care about us users on rosetta! My tasks that I wanted to upload for half a week, now are officially overdue. The crazy thing is: these tasks have been reissued to other volunteer crunchers, which is a completely sensless thing, because these users will also crunch and may also not be able to upload their tasks. It is abolutely useless to deliver new tasks or reissue "unreplied" tasks where as not "accepting" older, completed tasks. This is like pouring in water into a bathtub and not caring wether water may be able to drain. That's noting else than a programmed catastrophy. It may even be better to completely shut down the project for a few days rather than permanently sending but not receiving tasks. If as I read, network configuration limit the number of network connections it may also be, that outgoing transmissions block out ingoing transmissions. And you kow what? Project leaders are not caring. They are out of town in a conference? What? In a project as large as this, capable technical staff should be at their workplace at usual office hours minimum. DAMNED! I'm not usually using expletives in a forum but it is absolutely necessary to do so now to get into the heads of the project staff. And again DAMNED! This is a project with low credits and ultra high data transmission rates which may criple someones transmission limits, also using a lot of electric power paid by the crunchers and then coming up with that kind of a nonchalance, it is disgusting! |
Greg_BE Send message Joined: 30 May 06 Posts: 5690 Credit: 5,859,226 RAC: 12 |
I have stopped accepting new work because of this problem. I am surprised Network guys at the UW have not identified and fixed this problem already! Rosie should have its own outside connection to the internet to avoid these problems in the future. This is the worst I have seen this project behave in all the time I have been online with it. I am used to the occasional fall out of servers and other things, but this failure is on the top of the list and making me wonder if there is anything worthwhile to keep going with it. I don't care about credits, but I do care about BOINC manager getting clogged with tasks that have no way of uploading. I am extremely disappointed in the way Rosie's caretakers have handled this and the way the UW's network operations guys seem to be putting this problem on a back burner. so in short BOOO to the UW tech guys and Rosie's caretakers! BOOOO! Thumbs down! |
Justicar Send message Joined: 16 Nov 11 Posts: 2 Credit: 2,532,984 RAC: 0 |
There is no other way to say it but operators don't care about us users on rosetta! Duh. They never have, and they never will.
One of the moderators is a teetoler; that useless, but hypersensitive, person will, after finishing wiping the tears from his eyes, likely delete your message.
lol. They don't give a flying flip about you, or anyone else; so long as a few people remain crunching their data to spare them their research funds (before complaining that they need more money not to upgrade their servers), they'll get along fine. |
BadThad Send message Joined: 8 Nov 05 Posts: 30 Credit: 71,834,523 RAC: 0 |
8/2/2014 3:32:30 AM | rosetta@home | Computation for task hc_centroids_1gou_34_0.25_06-01-14_SAVE_ALL_OUT_168124_4296_0 finished 8/2/2014 3:32:30 AM | rosetta@home | Starting task rb_07_28_48622_95033_ab_stage0_h003___robetta_IGNORE_THE_REST_04_05_179922_11_0 8/2/2014 3:32:33 AM | rosetta@home | Started upload of hc_centroids_1gou_34_0.25_06-01-14_SAVE_ALL_OUT_168124_4296_0_0 8/2/2014 3:32:55 AM | rosetta@home | Temporarily failed upload of hc_centroids_1gou_34_0.25_06-01-14_SAVE_ALL_OUT_168124_4296_0_0: connect() failed 8/2/2014 3:32:55 AM | rosetta@home | Backing off 00:02:41 on upload of hc_centroids_1gou_34_0.25_06-01-14_SAVE_ALL_OUT_168124_4296_0_0 |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Sorry, I've been on a camping vacation and have been out of communication. Sergey and Keith have been working hard at diagnosing the problem but some issues are just not possible to fix in short order unfortunately. We now suspect that our servers have been sluggish the last few days due to a large spike in new users/hosts. We were not warned of the spike, do not know the cause yet, and are not prepared to serve the large executable and database files currently. Hopefully as the executables and database files get served to most new hosts, the project will slowly go back to normal, and we will look at increasing the number of web servers in the near future. |
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
Some good news and bad news: Good news: The servers and UW network is working normally. We got a HUGE spike in new users/computers connected to the R@H project. Bad news: We didn't get a heads up notice and so were not prepared to handle soo much traffic at once. These computers are still in the process of downloading Rosetta/Database. As Ananas suggested, it's an issue with the number of allowed concurrent connections per server. Good news: Once all these new computers get a copy of rosetta/database (which is a large single download), everything will go back to normal. We will be getting more servers, to prevent this from happening in the future. Once we know who these new users are, we'll post something on the front page. Once again, thank you for all the feedback, these were very helpful in debugging the issue. -Krypton. |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,071,286 RAC: 0 |
Thanks for the update. Since I crunch for other life science projects, I always have work. Hopefully things will settle out shortly. I am curious about one thing. You said there was a large increase in users/computers. Do you mean regular crunchers like us or others who use the results we produce? -Charlie |
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
I mean more crunchers. =] Thanks for the update. Since I crunch for other life science projects, I always have work. Hopefully things will settle out shortly. |
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0 |
Once it had a connection, an upload of 1.5MB took just 10 seconds, another indicator that neither the server itself nor the line speed lag. So I guess we can assume, that the number of connections While I'm typing this, a few more of my uploads went through without retries, so someone must have fixed it :-) |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org