Message boards : Number crunching : Problems with web site
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 19 · Next
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I have not been able to download more tasks for about a day now. The scheduler says that communication is deferred. Any ideas? read this message. |
Bill G Send message Joined: 28 Dec 07 Posts: 6 Credit: 11,148,949 RAC: 16,715 |
While that may be part of the problem, I went back to an earlier version of BOINC as you suggested on two Vista computers and I am still not downloading any Rosetta work on them. Seti continues to download. My Windows 7 computer is downloading Rosetta and Seti equally as it should and is working just fine. |
Bill G Send message Joined: 28 Dec 07 Posts: 6 Credit: 11,148,949 RAC: 16,715 |
After switching back to 6.6.36 the two Vista computers started to download just fine...they seem to be having a problem uploading now but I did get at least two days worth of downloads yesterday evening. |
Rabenherz85 Send message Joined: 25 Jun 09 Posts: 3 Credit: 9,089 RAC: 0 |
I can't upload complete WU either....Rosetta 1.87 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Just server troubles yet again...computers you know are temperamental. So this one upload server is being a pain in the backside. Just let the program sort it out when the server does come online again, your queue of upload and reporting tasks will clear. |
Eugene Send message Joined: 24 Nov 06 Posts: 4 Credit: 252,135 RAC: 0 |
there seems to be problems with servers that distribute and collect WUs I've been having problems receiving and returning WUS for about 2 weeks now. However, from time to time, i was able to receive and to return some WUs so my guess the problem is a HUGE LOAD on the servers. Can somebody from Rosetta staff explain what is going on, namely 1) have they identified the problem and what it is 2) what are the ways developed to fix it and when they expect the fix to be implemented 3) after the fix is done can you please report it so everyone know they can return to rosetta (me, i have switched to other projects temporarily to ease your load troubles) |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,265,269 RAC: 4,483 |
One way to control the problem was set up a long time ago - adjust your workunit settings so that you get workunits with a longer expected run time. Then, those that don't error out or reach the 99 decoys limit will run longer and you'll need fewer of them. Gives less load on the server, too. |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
there seems to be problems with servers that distribute and collect WUs Yep. Some simple errors by the project team combined with a few unforeseen bugs in a recent mini-rosetta version caused normal service to break down for a couple of days. Due to high server load it has taken about a week to get back to normal levels. Can somebody from Rosetta staff explain what is going on, namely I am not a member of the Rosetta staff, but here is what one of them said: "A developer/scientist in the lab accidentally updated the R@h application using the wrong signature file for the database which is unfortunately our largest input file. The update happened during the weekend and no one was around to fix the problem (I personally was on a backpacking trip with my family otherwise I would have immediately dealt with the problem). This caused all jobs to fail and hammered our servers. Our servers are still struggling to keep up with scheduler requests and download/uploads. Coincidentally, a very large code checkin was made to introduce symmetric folding to our minirosetta application and unfortunately there was a bug that caused a 10-fold slow down. Before catching this bug, the R@h app was updated so we had to revert to the previous application version as a quick fix." https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5011&nowrap=true#62640 2) what are the ways developed to fix it and when they expect the fix to be implemented "To make sure this doesn't happen again we are planning to implement a quick benchmark test on Ralph for every application update that will test various protocols for performance and speed. We are still in debug mode for our minirosetta application. There is a small memory leak and a 2 fold slow down in performance. The slow down was caused by a recent refactoring of the hydrogen bond energy code." and later: "1. we will make it a point never to do an update during the weekend or end of the week. 2. we do have a pre production environment - Ralph@home. But this problem was caused by user error . The signature file was accidentally copied over from Ralph when the standard protocol should automatically create the correct signature file. The 10x slow-down wasn't caught by our internal unit tests and benchmark tests but we are going to modify the tests to make sure it will get caught in the future." https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5011&nowrap=true#62656 From my perspective things seem to be getting back almost to normal. There are continued reports of bugged WUs but they seem to be at around the same levels you would get with previous Rosetta versions. We have now returned to about 78 TFLOPS compared to a rough average of between 80 & 95 TFLOPS (I did spot a low point of 28 TFLOPS one day so we are climbing back to where we should be). 3) after the fix is done can you please report it so everyone know they can return to rosetta (me, i have switched to other projects temporarily to ease your load troubles) The project team have made several posts on this forum and made a note of the situation in the news section of the Rosetta homepage. From comments made by other crunchers I believe that an email newsletter is not an option right now, so choices of communication are limited. |
Neil Send message Joined: 7 Mar 07 Posts: 25 Credit: 135,539 RAC: 0 |
Temporarily failed upload 8/6/2009 7:12:30 PM|rosetta@home|Started upload of lr5_seq_score12_ss5.0_rlbd_2cbm_IGNORE_THE_REST_DECOY_14613_1313_1_0 8/6/2009 7:13:21 PM||Project communication failed: attempting access to reference site 8/6/2009 7:13:21 PM|rosetta@home|Temporarily failed upload of lr5_seq_score12_ss5.0_rlbd_2cbm_IGNORE_THE_REST_DECOY_14613_1313_1_0: connect() failed 8/6/2009 7:13:21 PM|rosetta@home|Backing off 3 hr 55 min 53 sec on upload of lr5_seq_score12_ss5.0_rlbd_2cbm_IGNORE_THE_REST_DECOY_14613_1313_1_0 8/6/2009 7:13:22 PM||Internet access OK - project servers may be temporarily down. The suspense builds. First upload failed today about 5:20 PM EST. . |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
whats going on ?? according to the servers nothing but we are down to 32 t flops https://boinc.bakerlab.org/rosetta/rah_status.php |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
whats going on ?? according to the servers nothing but we are down to 32 t flops i forgot if tflops is based on granted credit or not. but if it is, there is a problem with the credit system and alot of people are reporting tasks queued up in pending credit. |
JollySwagman Send message Joined: 30 Aug 08 Posts: 3 Credit: 478,187 RAC: 0 |
Got about 16 WU,s waiting to upload and no new work yet severs say all OK yet when you ping the srv4.bakerlab.org you get timed out C:Program FilesSupport Tools>ping srv4.bakerlab.org Pinging srv4.bakerlab.org [140.142.20.112] with 32 bytes of data: Request timed out. Request timed out. Request timed out. Request timed out. Ping statistics for 140.142.20.112: Packets: Sent = 4, Received = 0, Lost = 4 (100% loss), |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
whats going on ?? according to the servers nothing but we are down to 32 t flops boinc site http://boincstats.com/stats/project_graph.php?pr=rosetta |
John Hunt Send message Joined: 18 Sep 05 Posts: 446 Credit: 200,755 RAC: 0 |
See this post + reply from admin - https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5054&nowrap=true#63249 |
QueueNut Send message Joined: 14 Jan 08 Posts: 9 Credit: 1,465,266 RAC: 0 |
About 24 hours ago I brought a new Core i7 system online with BOINC/Rosetta@home (6.6.36 for windows_intelx86). Message log shows a number of work units downloaded, computed and uploaded. No changes in individual user average or total credit scores. Another Core2 system was down since middle of last week. Brought it up at the same time with 6.6.36, ~24 hours ago. It, too, has been computing work unit results. No change of score from it, either. |
JP Bedard Send message Joined: 14 Aug 09 Posts: 2 Credit: 23,953 RAC: 0 |
I'm a recent user and since September 11th, have my results still not acknowledged. Will I loose them? Thanks. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I'm a recent user and since September 11th, have my results still not acknowledged. not acknowledged? what do you mean exactly? can you post the message or a link to the tasks? |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
These issues all seem to be related to the current delayed awarding of credit. Leave it one more day and most of it will be resolved - validation is rapidly ploughing through the massive backlog. Nothing to worry about at the user end and the r@h end is dealing with it now. |
JP Bedard Send message Joined: 14 Aug 09 Posts: 2 Credit: 23,953 RAC: 0 |
Hi, thanks for the reply. Those were all pending. Many seem to be OK today. Thanks. |
macko Send message Joined: 25 Jun 09 Posts: 32 Credit: 153,495 RAC: 0 |
|
Message boards :
Number crunching :
Problems with web site
©2024 University of Washington
https://www.bakerlab.org