Cannot retrieve new work

Message boards : Number crunching : Cannot retrieve new work

To post messages, you must log in.

AuthorMessage
Brian Priebe

Send message
Joined: 27 Nov 09
Posts: 16
Credit: 33,020,247
RAC: 0
Message 77256 - Posted: 3 Aug 2014, 19:39:49 UTC

BOINC event log reports this afternoon:

"03-Aug-2014 15:08:34 | rosetta@home | Server can't open database".
ID: 77256 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77257 - Posted: 3 Aug 2014, 19:51:55 UTC - in response to Message 77256.  

There has been a massive surge in new users recently (see the graphs at BOINC stats) and the servers are struggling to keep up with demand. Things should settle down once the surge slows and the new users have downloaded the core database files.
ID: 77257 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77261 - Posted: 4 Aug 2014, 0:19:35 UTC

Hi Brian,

are you still seeing the error? I checked, we still have workunits in queue (so you should be getting some).
ID: 77261 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 77268 - Posted: 4 Aug 2014, 5:08:14 UTC

I saw this one earlier as well, but on the next pass I had no problems. The heavy load must still be settling.
ID: 77268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Priebe

Send message
Joined: 27 Nov 09
Posts: 16
Credit: 33,020,247
RAC: 0
Message 77291 - Posted: 6 Aug 2014, 4:20:44 UTC - in response to Message 77261.  

are you still seeing the error? I checked, we still have workunits in queue (so you should be getting some).


My machines are getting new work again.
ID: 77291 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,081,660
RAC: 24
Message 77292 - Posted: 6 Aug 2014, 10:44:57 UTC

I'm seeing a slightly different problem. From my boinc logs:

06-Aug-2014 04:30:57 [rosetta@home] Sending scheduler request: To fetch work.
06-Aug-2014 04:30:57 [rosetta@home] Requesting new tasks for CPU
06-Aug-2014 04:31:00 [rosetta@home] Scheduler request completed: got 0 new tasks
06-Aug-2014 04:31:00 [rosetta@home] No work sent


This is happening more often than not on all 4 of my systems. They try to get new work but get nothing. I also crunch for WCG and POEM (on my one 64 bit Linux box - all my systems run Linux) so I have work, just not rossetta. Once in a while I'll get a rosetta workunit but not often.

Charlie
-Charlie
ID: 77292 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77295 - Posted: 6 Aug 2014, 22:05:26 UTC

Right the available work seems to be getting consumed about as quickly as it is being generated. The project is still adjusting to all of the new hosts that have all come at once. Which is a great problem to have! But I've seen on the server status page the actual number of tasks ready to send has been swinging rapidly as new work is generated, and then assigned to hungry hosts.

The BOINC Manager will do retries for work and pull some down when work units are available.
Rosetta Moderator: Mod.Sense
ID: 77295 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 0
Message 77299 - Posted: 7 Aug 2014, 13:49:37 UTC

lol! over 1.1 million tasks in progress. I can not get any new task for my hosts.
ID: 77299 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 77300 - Posted: 7 Aug 2014, 14:49:05 UTC

I increased my task run time preference from 6 hours to 12 to keep my PCs busy and reduce server load.
ID: 77300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77301 - Posted: 7 Aug 2014, 15:13:10 UTC - in response to Message 77299.  

lol! over 1.1 million tasks in progress. I can not get any new task for my hosts.


2 million in progress now. I just hope most of them come back completed rather than hitting 10 day expiration.
Rosetta Moderator: Mod.Sense
ID: 77301 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,081,660
RAC: 24
Message 77302 - Posted: 7 Aug 2014, 17:40:57 UTC - in response to Message 77301.  

lol! over 1.1 million tasks in progress. I can not get any new task for my hosts.


2 million in progress now. I just hope most of them come back completed rather than hitting 10 day expiration.


Curious as to how you get those numbers. (of course, you're on the inside so may have access to better info than we mere mortals do :-)

From the home page in the upper right corner I see this:


Server Status as of 7 Aug 2014 16:07:41 UTC
[ Scheduler running ]
Total queued jobs: 378,664
In progress: 916,437

Then from the server status page, I see this:

As of 7 Aug 2014 17:31:52 UTC
State Approximate #results
Ready to send 15,330
In progress 691,956

Are the Total Queued jobs and ready to send supposed to be the same? I realize the times these numbers are generated are different, but I would expect them to be close.

Thanks for any insight.

Charlie


-Charlie
ID: 77302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 0
Message 77303 - Posted: 7 Aug 2014, 18:49:28 UTC - in response to Message 77302.  


Values in the the Server Status page are sometimes changing dramatically from one update to next update, not sure whether it is providing real status data.
ID: 77303 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77304 - Posted: 7 Aug 2014, 23:03:41 UTC - in response to Message 77303.  


Values in the the Server Status page are sometimes changing dramatically from one update to next update, not sure whether it is providing real status data.


Yes, it's been incredible! I'm referring to the server status page linked from the homepage. There were 2 million in progress when I posted and now there are 1,760,000 and at some point in between there were about 1 million. So the progression was:

2 million in progress
Then over a million reported back as completed, actually a period where a million more than were assigned had reported back.
Then you caught it with about 1 million in progress.
Now we're back to over 1.7m so over 700,000 more were assigned out to hosts than were reported back.

There have been about 100,000 new hosts added this past week. At this point they are all considered "active". Previously the project was running about 60,000 active hosts returning results recently. So even if those hosts just work on a single task at a time, if they run for the default 3hr runtime, that would be 800,000 tasks per day just on the new hosts. Then you multiply by some average number of CPUs and resource share per host and it's a dramatic whole lotta work getting done!

Try to keep in mind that the servers are keeping up fairly well, and that the scale of the project has more than doubled in less than a week. That is a tough feat. So, there will be some growing pains. There will be points in time where all of the WUs have already been assigned to other hosts. Even when work is queued up it takes the server time to generate BOINC WUs out of it so they can be assigned. The underlying databases, networks and file systems are all seeing a dramatic change in workload as well. It will take some time to find and resolve bottlenecks that did not previously exist.
Rosetta Moderator: Mod.Sense
ID: 77304 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2145
Credit: 41,550,899
RAC: 8,846
Message 77305 - Posted: 8 Aug 2014, 0:18:50 UTC

At such an early stage there are bound to be many questions, but I found at least one answer:

Aug 07, 2014
Predictor of the day: Congratulations to ce223411 for predicting the lowest energy structure for workunit gr071414_2h5_2h5_697_fold_SAVE_ALL_OUT_175228_0 !
ID: 77305 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77306 - Posted: 8 Aug 2014, 4:39:20 UTC

We've encouraged the scientists to submit more jobs. Hopefully that helps.

ID: 77306 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2145
Credit: 41,550,899
RAC: 8,846
Message 77309 - Posted: 9 Aug 2014, 2:46:12 UTC - in response to Message 77304.  

There have been about 100,000 new hosts added this past week. At this point they are all considered "active". Previously the project was running about 60,000 active hosts returning results recently. So even if those hosts just work on a single task at a time, if they run for the default 3hr runtime, that would be 800,000 tasks per day just on the new hosts. Then you multiply by some average number of CPUs and resource share per host and it's a dramatic whole lotta work getting done!

Another thing that crossed my mind is that when a new host arrives, the tasks it gets are a bit flaky until a pattern establishes itself with how much work gets done. It's quite probable that they've pulled down a mass of new tasks, some of which will be returned speedily, while others may even get timed out due to inactivity. On that basis I'm expecting it to be a further 10 days before things settle down.

I also note new users are still being added rapidly, albeit now slowing progressively each day
ID: 77309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77310 - Posted: 9 Aug 2014, 3:16:55 UTC

I checked a few hosts at random over the past few days. The new ones typically pulled down one or two tasks, and if they returned them they got 3 or 4 more. One of the things BOINC has to try and get a handle on early on is how many hours per day the computer is likely to be running BOINC. There were many hosts I saw that were added on the third which still had not returned their first completed WU. Either the machine hasn't been running, there wasn't network access, the server was choked at the time they tried to report, or they've turned off the charity engine without aborting their current task. But, for the most part it looked like new hosts were beginning to return results and begin to settle in to a regular workflow. It also appeared to me that most new hosts were not running more than a few hours per day. Otherwise they would have completed more work over the course of several days.

It appears that while there are over 6 million tasks in the queue, the dedicated tasks that churn those into BOINC WUs are having trouble keeping ahead of the demand for tasks. There were over 2 million tasks in progress earlier in the day. Now it shows half that. Yet still only 426,000 successful completed tasks in the past 24hrs. If those figures are accurate and consistent snapshots in time, it implies that half of the reported WUs were not successes.
Rosetta Moderator: Mod.Sense
ID: 77310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2145
Credit: 41,550,899
RAC: 8,846
Message 77313 - Posted: 10 Aug 2014, 1:28:07 UTC - in response to Message 77310.  

I checked a few hosts at random over the past few days. The new ones typically pulled down one or two tasks, and if they returned them they got 3 or 4 more. One of the things BOINC has to try and get a handle on early on is how many hours per day the computer is likely to be running BOINC. There were many hosts I saw that were added on the third which still had not returned their first completed WU. Either the machine hasn't been running, there wasn't network access, the server was choked at the time they tried to report, or they've turned off the charity engine without aborting their current task. But, for the most part it looked like new hosts were beginning to return results and begin to settle in to a regular workflow. It also appeared to me that most new hosts were not running more than a few hours per day. Otherwise they would have completed more work over the course of several days.

It appears that while there are over 6 million tasks in the queue, the dedicated tasks that churn those into BOINC WUs are having trouble keeping ahead of the demand for tasks. There were over 2 million tasks in progress earlier in the day. Now it shows half that. Yet still only 426,000 successful completed tasks in the past 24hrs. If those figures are accurate and consistent snapshots in time, it implies that half of the reported WUs were not successes.

That's where I got the 10 days from - failure to meet deadlines and reissue to (hopefully) active crunchers
ID: 77313 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Cannot retrieve new work



©2024 University of Washington
https://www.bakerlab.org