Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 55 · Next

AuthorMessage
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,658,735
RAC: 6,622
Message 77279 - Posted: 4 Aug 2014, 21:40:39 UTC - in response to Message 77278.  

I just disabled new users from charityengine until our servers can catch up with download demand.
The number of downloads that happened last week nearly doubled, today alone.


Thanks. Server just responded correctly.
ID: 77279 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Miklos M

Send message
Joined: 8 Dec 13
Posts: 29
Credit: 5,277,251
RAC: 0
Message 77280 - Posted: 4 Aug 2014, 22:38:03 UTC - in response to Message 77265.  

Can you post a log?

My backlog of uploads has cleared, but I am still getting a lot of Computation Errors. I have 32 shown in just a few minutes and the list is growing.


.Computation errors here too.
ID: 77280 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77281 - Posted: 4 Aug 2014, 23:34:41 UTC - in response to Message 77280.  

Miklos,M wrote:
.Computation errors here too.


Your errors are mainly listed as tasks aborted by the user. Did you notice any unusual behaviour? Of the ones I checked, the reassigned tasks are either still in progress or have been completed successfully by the next user.

I noticed one error for a pd1 graftsheet task that had been reassigned to you but that batch of tasks was failing for almost everyone.
ID: 77281 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77282 - Posted: 5 Aug 2014, 0:07:51 UTC

We increased the number of allowed concurrent users, but the number of max_connections (for mySQL database) remains at 800 users.

Looking at the processlist I noticed that most of the users are in "sleep" status waiting for the default 8 hours wait_timeout before being killed. I set the wait_timeout to 30 mins (which still seem rather high, but maybe required for boinc manager?)

If anyone sees any database errors in the boinc manager logs, please alert me!
ID: 77282 · Rating: 0 · rate: Rate + / Rate - Report as offensive
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,658,735
RAC: 6,622
Message 77283 - Posted: 5 Aug 2014, 0:29:37 UTC

8/4/2014 7:50:26 PM | rosetta@home | Started upload of rb_05_25_47349_92839__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_167016_2278_0_0
8/4/2014 7:50:49 PM | rosetta@home | Temporarily failed upload of rb_05_25_47349_92839__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_167016_2278_0_0: connect() failed
8/4/2014 7:50:49 PM | rosetta@home | Backing off 01:25:31 on upload of rb_05_25_47349_92839__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_167016_2278_0_0
8/4/2014 7:50:52 PM | | Project communication failed: attempting access to reference site
8/4/2014 7:50:54 PM | | Internet access OK - project servers may be temporarily down.
8/4/2014 7:57:51 PM | rosetta@home | Sending scheduler request: To fetch work.
8/4/2014 7:57:51 PM | rosetta@home | Requesting new tasks for CPU and NVIDIA
8/4/2014 7:58:14 PM | rosetta@home | Scheduler request failed: Couldn't connect to server
8/4/2014 7:58:17 PM | | Project communication failed: attempting access to reference site
8/4/2014 7:58:19 PM | | Internet access OK - project servers may be temporarily down.
8/4/2014 7:59:59 PM | rosetta@home | Sending scheduler request: To fetch work.
8/4/2014 7:59:59 PM | rosetta@home | Requesting new tasks for CPU and NVIDIA
8/4/2014 8:00:21 PM | rosetta@home | Scheduler request failed: Couldn't connect to server
8/4/2014 8:00:24 PM | | Project communication failed: attempting access to reference site
8/4/2014 8:00:25 PM | | Internet access OK - project servers may be temporarily down.
8/4/2014 8:02:41 PM | rosetta@home | Sending scheduler request: To fetch work.
8/4/2014 8:02:41 PM | rosetta@home | Requesting new tasks for CPU and NVIDIA
8/4/2014 8:03:04 PM | rosetta@home | Scheduler request completed: got 0 new tasks
8/4/2014 8:03:04 PM | rosetta@home | Server can't open database
ID: 77283 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77284 - Posted: 5 Aug 2014, 0:48:12 UTC

With 30,000 ADDITIONAL new users (after 15,000 the day before), I've been trying to leave the server alone. But I just ran a scheduler request for more work and got this:
8/4/2014 7:46:25 PM | rosetta@home | Scheduler request failed: Failure when receiving data from the peer

Rosetta Moderator: Mod.Sense
ID: 77284 · Rating: 0 · rate: Rate + / Rate - Report as offensive
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,658,735
RAC: 6,622
Message 77285 - Posted: 5 Aug 2014, 1:27:53 UTC
Last modified: 5 Aug 2014, 1:28:24 UTC

All is well at my end, for now. Finally finished an upload and got new tasks.
ID: 77285 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77286 - Posted: 5 Aug 2014, 1:30:17 UTC - in response to Message 77285.  

Thanks googloo and Mod.Sense!
All is well at my end, for now. Finally finished an upload and got new tasks.

ID: 77286 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Daedalus

Send message
Joined: 1 Aug 08
Posts: 39
Credit: 10,102,272
RAC: 1,270
Message 77287 - Posted: 5 Aug 2014, 14:47:09 UTC

Since the 3 august i did not have any problem.
ID: 77287 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 77288 - Posted: 5 Aug 2014, 15:36:25 UTC

All working well over here now. I did two things to hopefully help.

1) I set the 'Target CPU run time' in the online Rosetta Preferences page to 6 hours (double the default of 3 hours) which should mean my machines will bug the server for work less often.

2) I also set my local clients to cache a bit more work in case things go crazy again.

I encourage others to take similar actions to help ration server resources.

ID: 77288 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77289 - Posted: 5 Aug 2014, 19:11:44 UTC - in response to Message 77288.  

All working well over here now. I did two things to hopefully help.

1) I set the 'Target CPU run time' in the online Rosetta Preferences page to 6 hours (double the default of 3 hours) which should mean my machines will bug the server for work less often.

2) I also set my local clients to cache a bit more work in case things go crazy again.

I encourage others to take similar actions to help ration server resources.



I would encourage it as well. Just be aware that BOINC Manager will take some time to get used to the new target runtime. I generally suggest that you start with only a small cache of work, and bump the target runtime only a notch or two each day. I'd also suggest going to 12hrs or beyond if it suits the way you use your machine.

Gradual change in runtime preference helps avoid BOINC Manager downloading more work than you can complete. The change WILL effect tasks that you've already downloaded once it completes an update to the project with the new preference setting. Hence the suggestion to start when cache of existing work is low. Do not increase your number of days of work to request until AFTER you have run at your final runtime setting for a day or so.
Rosetta Moderator: Mod.Sense
ID: 77289 · Rating: 0 · rate: Rate + / Rate - Report as offensive
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,658,735
RAC: 6,622
Message 77294 - Posted: 6 Aug 2014, 18:56:41 UTC

Not getting work, even though BOINC Manager requests tasks. Server status shows only 32 ready to send.
ID: 77294 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77296 - Posted: 6 Aug 2014, 22:06:45 UTC

Right the available work seems to be getting consumed about as quickly as it is being generated. The project is still adjusting to all of the new hosts that have all come at once. Which is a great problem to have! But I've seen on the server status page the actual number of tasks ready to send has been swinging rapidly as new work is generated, and then assigned to hungry hosts.

The BOINC Manager will do retries for work and pull some down when work units are available.
Rosetta Moderator: Mod.Sense
ID: 77296 · Rating: 0 · rate: Rate + / Rate - Report as offensive
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,658,735
RAC: 6,622
Message 77297 - Posted: 7 Aug 2014, 0:47:01 UTC - in response to Message 77296.  

Right the available work seems to be getting consumed about as quickly as it is being generated. The project is still adjusting to all of the new hosts that have all come at once. Which is a great problem to have! But I've seen on the server status page the actual number of tasks ready to send has been swinging rapidly as new work is generated, and then assigned to hungry hosts.

The BOINC Manager will do retries for work and pull some down when work units are available.


Yes, it's been alternating between "no work sent" and actually getting new tasks.
ID: 77297 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2113
Credit: 41,060,649
RAC: 21,432
Message 77298 - Posted: 7 Aug 2014, 2:13:09 UTC

I've been out of the country for 3 days and things seem to have sorted themselves out in that time - a watched pot never boils - save from some connectivity issues at my end, now resolved.

I've just snuck a few tasks for 3 of my 4 machines. The last one should pinch some tomorrow, then I'm back to normal. I doubt WCG will see many further calls for work over the next month with the priorities I've got set.
ID: 77298 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TPCBF

Send message
Joined: 29 Nov 10
Posts: 111
Credit: 5,064,073
RAC: 3,742
Message 77311 - Posted: 9 Aug 2014, 22:30:35 UTC

Just keep getting "no work sent"... :-(
ID: 77311 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77312 - Posted: 9 Aug 2014, 23:46:04 UTC - in response to Message 77311.  
Last modified: 9 Aug 2014, 23:50:51 UTC

I've noticed the same message on my machine. Looking into it. But it's the same pattern as googloo reported. It seems to alternate...
Just keep getting "no work sent"... :-(
ID: 77312 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2113
Credit: 41,060,649
RAC: 21,432
Message 77314 - Posted: 10 Aug 2014, 4:20:47 UTC

Arghh!! Just as I get home to clear a network issue holding up the upload of 48 tasks, the scheduler's taken offline and I'm full up with 55 WCG tasks instead to plough through.

Not having any luck right now :(
ID: 77314 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 77315 - Posted: 10 Aug 2014, 4:37:44 UTC
Last modified: 10 Aug 2014, 4:39:23 UTC

Just checking to see if there is any useful information about the latest problems. Didn't expect to find any, but I could have been surprised.

Perhaps they could just reissue old tasks for double-checking?

Overall it seems to be another example of the supply of people who want to be helpful being larger than the supply of helpful things for them to do. I feel like starting a list...
ID: 77315 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 77316 - Posted: 10 Aug 2014, 5:11:14 UTC

Now what is the problem? Everything is disabled!and again no news.
ID: 77316 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org