Problems with web site

Message boards : Number crunching : Problems with web site

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 19 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 62565 - Posted: 28 Jul 2009, 19:11:14 UTC - in response to Message 62534.  
Last modified: 28 Jul 2009, 19:11:28 UTC

I have not been able to download more tasks for about a day now. The scheduler says that communication is deferred. Any ideas?



read this message.
ID: 62565 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill G
Avatar

Send message
Joined: 28 Dec 07
Posts: 6
Credit: 11,475,753
RAC: 8,104
Message 62593 - Posted: 29 Jul 2009, 11:57:23 UTC
Last modified: 29 Jul 2009, 11:57:57 UTC

While that may be part of the problem, I went back to an earlier version of BOINC as you suggested on two Vista computers and I am still not downloading any Rosetta work on them. Seti continues to download. My Windows 7 computer is downloading Rosetta and Seti equally as it should and is working just fine.
ID: 62593 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill G
Avatar

Send message
Joined: 28 Dec 07
Posts: 6
Credit: 11,475,753
RAC: 8,104
Message 62627 - Posted: 30 Jul 2009, 11:57:06 UTC

After switching back to 6.6.36 the two Vista computers started to download just fine...they seem to be having a problem uploading now but I did get at least two days worth of downloads yesterday evening.
ID: 62627 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rabenherz85

Send message
Joined: 25 Jun 09
Posts: 3
Credit: 9,089
RAC: 0
Message 62632 - Posted: 30 Jul 2009, 13:29:24 UTC

I can't upload complete WU either....Rosetta 1.87
ID: 62632 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 62651 - Posted: 30 Jul 2009, 19:47:34 UTC

Just server troubles yet again...computers you know are temperamental. So this one upload server is being a pain in the backside. Just let the program sort it out when the server does come online again, your queue of upload and reporting tasks will clear.
ID: 62651 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Eugene

Send message
Joined: 24 Nov 06
Posts: 4
Credit: 252,135
RAC: 0
Message 62744 - Posted: 3 Aug 2009, 14:29:07 UTC

there seems to be problems with servers that distribute and collect WUs
I've been having problems receiving and returning WUS for about 2 weeks now.
However, from time to time, i was able to receive and to return some WUs so my guess the problem is a HUGE LOAD on the servers.

Can somebody from Rosetta staff explain what is going on, namely
1) have they identified the problem and what it is
2) what are the ways developed to fix it and when they expect the fix to be implemented
3) after the fix is done can you please report it so everyone know they can return to rosetta (me, i have switched to other projects temporarily to ease your load troubles)
ID: 62744 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,014
Message 62746 - Posted: 3 Aug 2009, 15:35:05 UTC - in response to Message 62744.  

One way to control the problem was set up a long time ago - adjust your workunit settings so that you get workunits with a longer expected run time. Then, those that don't error out or reach the 99 decoys limit will run longer and you'll need fewer of them.

Gives less load on the server, too.
ID: 62746 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 62750 - Posted: 3 Aug 2009, 20:54:34 UTC - in response to Message 62744.  

there seems to be problems with servers that distribute and collect WUs
I've been having problems receiving and returning WUS for about 2 weeks now.
However, from time to time, i was able to receive and to return some WUs so my guess the problem is a HUGE LOAD on the servers.


Yep. Some simple errors by the project team combined with a few unforeseen bugs in a recent mini-rosetta version caused normal service to break down for a couple of days. Due to high server load it has taken about a week to get back to normal levels.

Can somebody from Rosetta staff explain what is going on, namely
1) have they identified the problem and what it is


I am not a member of the Rosetta staff, but here is what one of them said:

"A developer/scientist in the lab accidentally updated the R@h application using the wrong signature file for the database which is unfortunately our largest input file. The update happened during the weekend and no one was around to fix the problem (I personally was on a backpacking trip with my family otherwise I would have immediately dealt with the problem). This caused all jobs to fail and hammered our servers. Our servers are still struggling to keep up with scheduler requests and download/uploads.

Coincidentally, a very large code checkin was made to introduce symmetric folding to our minirosetta application and unfortunately there was a bug that caused a 10-fold slow down. Before catching this bug, the R@h app was updated so we had to revert to the previous application version as a quick fix."
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5011&nowrap=true#62640

2) what are the ways developed to fix it and when they expect the fix to be implemented


"To make sure this doesn't happen again we are planning to implement a quick benchmark test on Ralph for every application update that will test various protocols for performance and speed.

We are still in debug mode for our minirosetta application. There is a small memory leak and a 2 fold slow down in performance. The slow down was caused by a recent refactoring of the hydrogen bond energy code."

and later:

"1. we will make it a point never to do an update during the weekend or end of the week.
2. we do have a pre production environment - Ralph@home. But this problem was caused by user error . The signature file was accidentally copied over from Ralph when the standard protocol should automatically create the correct signature file. The 10x slow-down wasn't caught by our internal unit tests and benchmark tests but we are going to modify the tests to make sure it will get caught in the future."
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5011&nowrap=true#62656

From my perspective things seem to be getting back almost to normal. There are continued reports of bugged WUs but they seem to be at around the same levels you would get with previous Rosetta versions. We have now returned to about 78 TFLOPS compared to a rough average of between 80 & 95 TFLOPS (I did spot a low point of 28 TFLOPS one day so we are climbing back to where we should be).

3) after the fix is done can you please report it so everyone know they can return to rosetta (me, i have switched to other projects temporarily to ease your load troubles)


The project team have made several posts on this forum and made a note of the situation in the news section of the Rosetta homepage. From comments made by other crunchers I believe that an email newsletter is not an option right now, so choices of communication are limited.
ID: 62750 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Neil
Avatar

Send message
Joined: 7 Mar 07
Posts: 25
Credit: 135,539
RAC: 0
Message 62811 - Posted: 6 Aug 2009, 23:43:02 UTC
Last modified: 6 Aug 2009, 23:44:03 UTC

Temporarily failed upload

8/6/2009 7:12:30 PM|rosetta@home|Started upload of lr5_seq_score12_ss5.0_rlbd_2cbm_IGNORE_THE_REST_DECOY_14613_1313_1_0
8/6/2009 7:13:21 PM||Project communication failed: attempting access to reference site
8/6/2009 7:13:21 PM|rosetta@home|Temporarily failed upload of lr5_seq_score12_ss5.0_rlbd_2cbm_IGNORE_THE_REST_DECOY_14613_1313_1_0: connect() failed
8/6/2009 7:13:21 PM|rosetta@home|Backing off 3 hr 55 min 53 sec on upload of lr5_seq_score12_ss5.0_rlbd_2cbm_IGNORE_THE_REST_DECOY_14613_1313_1_0
8/6/2009 7:13:22 PM||Internet access OK - project servers may be temporarily down.

The suspense builds. First upload failed today about 5:20 PM EST.

.
ID: 62811 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 63237 - Posted: 10 Sep 2009, 10:15:31 UTC

whats going on ?? according to the servers nothing but we are down to 32 t flops

https://boinc.bakerlab.org/rosetta/rah_status.php
ID: 63237 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 63243 - Posted: 10 Sep 2009, 12:33:22 UTC - in response to Message 63237.  

whats going on ?? according to the servers nothing but we are down to 32 t flops

https://boinc.bakerlab.org/rosetta/rah_status.php


i forgot if tflops is based on granted credit or not.
but if it is, there is a problem with the credit system and alot of people are reporting tasks queued up in pending credit.
ID: 63243 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JollySwagman
Avatar

Send message
Joined: 30 Aug 08
Posts: 3
Credit: 478,187
RAC: 0
Message 63246 - Posted: 10 Sep 2009, 16:21:29 UTC

Got about 16 WU,s waiting to upload and no new work yet severs say all OK
yet when you ping the srv4.bakerlab.org you get timed out

C:Program FilesSupport Tools>ping srv4.bakerlab.org

Pinging srv4.bakerlab.org [140.142.20.112] with 32 bytes of data:

Request timed out.
Request timed out.
Request timed out.
Request timed out.

Ping statistics for 140.142.20.112:
Packets: Sent = 4, Received = 0, Lost = 4 (100% loss),

ID: 63246 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 63251 - Posted: 10 Sep 2009, 18:41:39 UTC - in response to Message 63243.  

whats going on ?? according to the servers nothing but we are down to 32 t flops

https://boinc.bakerlab.org/rosetta/rah_status.php


i forgot if tflops is based on granted credit or not.
but if it is, there is a problem with the credit system and alot of people are reporting tasks queued up in pending credit.


boinc site http://boincstats.com/stats/project_graph.php?pr=rosetta
ID: 63251 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile John Hunt
Avatar

Send message
Joined: 18 Sep 05
Posts: 446
Credit: 200,755
RAC: 0
Message 63253 - Posted: 10 Sep 2009, 19:06:45 UTC

ID: 63253 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
QueueNut

Send message
Joined: 14 Jan 08
Posts: 9
Credit: 1,465,266
RAC: 0
Message 63354 - Posted: 15 Sep 2009, 1:44:05 UTC
Last modified: 15 Sep 2009, 1:51:14 UTC

About 24 hours ago I brought a new Core i7 system online with BOINC/Rosetta@home (6.6.36 for windows_intelx86). Message log shows a number of work units downloaded, computed and uploaded. No changes in individual user average or total credit scores.

Another Core2 system was down since middle of last week. Brought it up at the same time with 6.6.36, ~24 hours ago. It, too, has been computing work unit results. No change of score from it, either.
ID: 63354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JP Bedard

Send message
Joined: 14 Aug 09
Posts: 2
Credit: 23,953
RAC: 0
Message 63357 - Posted: 15 Sep 2009, 4:25:51 UTC

I'm a recent user and since September 11th, have my results still not acknowledged.
Will I loose them?
Thanks.
ID: 63357 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 63359 - Posted: 15 Sep 2009, 9:00:54 UTC - in response to Message 63357.  

I'm a recent user and since September 11th, have my results still not acknowledged.
Will I loose them?
Thanks.


not acknowledged?
what do you mean exactly?
can you post the message or a link to the tasks?
ID: 63359 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 63362 - Posted: 15 Sep 2009, 10:02:15 UTC

These issues all seem to be related to the current delayed awarding of credit.

Leave it one more day and most of it will be resolved - validation is rapidly ploughing through the massive backlog. Nothing to worry about at the user end and the r@h end is dealing with it now.
ID: 63362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JP Bedard

Send message
Joined: 14 Aug 09
Posts: 2
Credit: 23,953
RAC: 0
Message 63374 - Posted: 15 Sep 2009, 18:58:42 UTC

Hi, thanks for the reply.
Those were all pending.
Many seem to be OK today.
Thanks.
ID: 63374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile macko
Avatar

Send message
Joined: 25 Jun 09
Posts: 32
Credit: 153,495
RAC: 0
Message 63473 - Posted: 27 Sep 2009, 6:14:04 UTC

Hi all

There is no update on "Results" pages more than 10 days (16.09.09).

With regards
ID: 63473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 19 · Next

Message boards : Number crunching : Problems with web site



©2024 University of Washington
https://www.bakerlab.org