no work units

Message boards : Number crunching : no work units

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 67412 - Posted: 28 Aug 2010, 6:56:31 UTC

Well, I'm apparently receiving Work Units now, and my machine is happily crunching away.

The thing I'm not sure I agree with is having a Web page indicate that all servers are up and running when that doesn't really seem to be the case. As a systems guy, myself (for the past thirty plus years), I can't remember a time when my management would accept I.T. Department explanations that everything was really OK when they could see for themselves that work was not being done, and tasks were not being completed.

Perhaps it was because Manufacturing is a less tolerant environment for I.T. imprecision ... ??

deesy
ID: 67412 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1893
Credit: 8,126,471
RAC: 5,094
Message 67431 - Posted: 29 Aug 2010, 14:41:19 UTC - in response to Message 67412.  

Well, I'm apparently receiving Work Units now, and my machine is happily crunching away.

The thing I'm not sure I agree with is having a Web page indicate that all servers are up and running when that doesn't really seem to be the case. As a systems guy, myself (for the past thirty plus years), I can't remember a time when my management would accept I.T. Department explanations that everything was really OK when they could see for themselves that work was not being done, and tasks were not being completed.

Perhaps it was because Manufacturing is a less tolerant environment for I.T. imprecision ... ??

deesy


It also doesn't help when the updates are not being done but the web pages say they are! The problem seems to be that the page showing the status is working, but the actual updating of the status is not, so it is showing old results that are far from accurate. This often happens when one web page depends on another for its data, the first is showing bad data so the second also shows bad data. The old 'gigo' has taken over, 'garbage in, garbage out'! My wifes works is currently having the same problem, the WHOLE STATE had a hard drive crash and the backup did not kick in, yet the status web page says all was okay. The problem is the web page was not getting updated and just kept sending out its prior data saying all was okay, when in fact it should have said 'I have no current data' thus indicating a problem. It will be down thru the weekend and many people are working on end of month deadlines!! Early next week will be very hectic!!!!!! Rosetta seems to need, and probably every project needs it too, a more accurate status reporting system. It is probably a Boinc Server side thing, IMO.
ID: 67431 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
nusbaumc

Send message
Joined: 4 Jun 10
Posts: 11
Credit: 3,747,410
RAC: 0
Message 67434 - Posted: 29 Aug 2010, 15:05:07 UTC - in response to Message 67402.  

Do I understand this correctly? The problems are so big that only grid computing can solve them, but they are not big enough to keep the grid busy?

deesy

No, there is a problem with the servers. Look on the server status page and you will see they have taken nearly all of them off line while they make a proper fix.


I hope so, this is the 3rd time in a month this has happened.

A good time to install all those queued updates on my auxiliary machines though. Windows without a mouse is painful...
ID: 67434 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 67441 - Posted: 29 Aug 2010, 19:56:30 UTC

Nusbaumc siad:

Windows without a mouse is painful...


Hummm. Windows. Mouse. Painful.

Seems like a redundant statement going on there...

ID: 67441 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Heidi1
Avatar

Send message
Joined: 11 Aug 07
Posts: 49
Credit: 1,786,248
RAC: 0
Message 67445 - Posted: 29 Aug 2010, 23:44:41 UTC
Last modified: 29 Aug 2010, 23:46:09 UTC

I haven't been able to get any WUs for about a half month now, and still can't. Einstein@Home should be grateful for the attention I've given it recently. Rosetta's servers are all reading green, but that teraflops number on the home page is reading around 45, which I just read is much lower than it's supposed to be. I just want a couple of units!

Edit: I just got one! Don't know if that will be all or if more are on the horizon.
ID: 67445 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Teck7

Send message
Joined: 23 Aug 10
Posts: 2
Credit: 198,527
RAC: 0
Message 67446 - Posted: 30 Aug 2010, 0:25:38 UTC

Does rosetta when it is running normally, hand out work units for GPU processors? I just joined recently and though am having trouble keeping my 52 cores worth of crunching power fed, they are all more or less well fed (increased the default run time to 6hrs and added a 1d buffer for work).
ID: 67446 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 67447 - Posted: 30 Aug 2010, 0:42:37 UTC - in response to Message 67446.  

Does rosetta when it is running normally, hand out work units for GPU processors? I just joined recently and though am having trouble keeping my 52 cores worth of crunching power fed, they are all more or less well fed (increased the default run time to 6hrs and added a 1d buffer for work).


Nope. No GPU.
ID: 67447 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 67449 - Posted: 30 Aug 2010, 1:24:58 UTC - in response to Message 67431.  

Well, I'm apparently receiving Work Units now, and my machine is happily crunching away.

The thing I'm not sure I agree with is having a Web page indicate that all servers are up and running when that doesn't really seem to be the case. As a systems guy, myself (for the past thirty plus years), I can't remember a time when my management would accept I.T. Department explanations that everything was really OK when they could see for themselves that work was not being done, and tasks were not being completed.

Perhaps it was because Manufacturing is a less tolerant environment for I.T. imprecision ... ??

deesy


It also doesn't help when the updates are not being done but the web pages say they are! The problem seems to be that the page showing the status is working, but the actual updating of the status is not, so it is showing old results that are far from accurate. This often happens when one web page depends on another for its data, the first is showing bad data so the second also shows bad data. The old 'gigo' has taken over, 'garbage in, garbage out'! My wifes works is currently having the same problem, the WHOLE STATE had a hard drive crash and the backup did not kick in, yet the status web page says all was okay. The problem is the web page was not getting updated and just kept sending out its prior data saying all was okay, when in fact it should have said 'I have no current data' thus indicating a problem. It will be down thru the weekend and many people are working on end of month deadlines!! Early next week will be very hectic!!!!!! Rosetta seems to need, and probably every project needs it too, a more accurate status reporting system. It is probably a Boinc Server side thing, IMO.


Hmm. IMO. if a Web page can't be made and kept accurate, it should be taken down. It serves no useful purpose if it can't be believed. Either ensure that the Server Status page is accurate, or remove it from the Web site because it helps nobody.

deesy
ID: 67449 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 67450 - Posted: 30 Aug 2010, 1:31:40 UTC - in response to Message 67449.  

IMO. if a Web page can't be made and kept accurate, it should be taken down. It serves no useful purpose if it can't be believed. Either ensure that the Server Status page is accurate, or remove it from the Web site because it helps nobody.

Earlier this evening the page was taken down and seems to be a little more representative now. Still no available work yet, but WiP is rising, which is promising. Fingers crossed.
ID: 67450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 67454 - Posted: 30 Aug 2010, 6:58:27 UTC - in response to Message 67449.  



Hmm. IMO. if a Web page can't be made and kept accurate, it should be taken down. It serves no useful purpose if it can't be believed. Either ensure that the Server Status page is accurate, or remove it from the Web site because it helps nobody.

deesy


To be fair to the project team, the page is probably completely accurate. All the servers seem to be running and work is being issued. The problem appears to be that the 2,028,571 jobs on the queue server aren't being passed quick enough to the distribution server(s) which means only a small proportion of tasks are ready to send. I have seen several thousand tasks ready to send at times yesterday, but they were gobbled up by hungry crunchers within a few minutes.

Even after the cause of the go-slow is identified and fixed it may be a week before we are back up to normal operations. The servers normally make work available in the tens of thousands of tasks per hour; right now idle crunchers are probably requesting at least 100,000 tasks. It will take time to clear that backlog even when running at full capacity.
ID: 67454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Warped

Send message
Joined: 15 Jan 06
Posts: 48
Credit: 1,788,185
RAC: 0
Message 67457 - Posted: 30 Aug 2010, 11:48:39 UTC

The West Coast of the USA should be waking up and getting back to work soon. Hopefully the issue can be resolved quickly.

I suspect the make_work servers need some attention.
ID: 67457 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
goraxan

Send message
Joined: 18 Jul 10
Posts: 6
Credit: 1,143,926
RAC: 0
Message 67461 - Posted: 30 Aug 2010, 13:55:48 UTC - in response to Message 67457.  

The West Coast of the USA should be waking up and getting back to work soon. Hopefully the issue can be resolved quickly.

I suspect the make_work servers need some attention.

How do you know that?
ID: 67461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tzpmrz

Send message
Joined: 20 Apr 09
Posts: 2
Credit: 4,356,912
RAC: 0
Message 67462 - Posted: 30 Aug 2010, 15:09:11 UTC - in response to Message 67352.  

I have two PC and both stopped getting work at the same time and it's been almost a week now.
ID: 67462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 67463 - Posted: 30 Aug 2010, 17:41:06 UTC - in response to Message 67454.  



Hmm. IMO. if a Web page can't be made and kept accurate, it should be taken down. It serves no useful purpose if it can't be believed. Either ensure that the Server Status page is accurate, or remove it from the Web site because it helps nobody.

deesy


To be fair to the project team, the page is probably completely accurate. All the servers seem to be running and work is being issued. The problem appears to be that the 2,028,571 jobs on the queue server aren't being passed quick enough to the distribution server(s) which means only a small proportion of tasks are ready to send. I have seen several thousand tasks ready to send at times yesterday, but they were gobbled up by hungry crunchers within a few minutes.

Even after the cause of the go-slow is identified and fixed it may be a week before we are back up to normal operations. The servers normally make work available in the tens of thousands of tasks per hour; right now idle crunchers are probably requesting at least 100,000 tasks. It will take time to clear that backlog even when running at full capacity.

Murasaki


The servers are going up and down. There is obviously a problem that the team are trying to fix. Problems like this occur once every six to twelve months but are just temporary glitches that last a few days or a couple of weeks at most.


Murasaki


Well, which is it? Is the Web page "probably completely accurate," or are the servers really going "up and down"? Aren't these assertions inconsistent?

deesy
ID: 67463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 67466 - Posted: 30 Aug 2010, 18:29:04 UTC - in response to Message 67463.  


Well, which is it? Is the Web page "probably completely accurate," or are the servers really going "up and down"? Aren't these assertions inconsistent?

deesy


Nope. Because at the time I said they were going up and down, the page was reporting the servers were up and then the same page at a different time was saying they were down. That seems fairly consistent with the purpose of a server status page.
ID: 67466 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 67467 - Posted: 30 Aug 2010, 18:37:52 UTC - in response to Message 67466.  


Well, which is it? Is the Web page "probably completely accurate," or are the servers really going "up and down"? Aren't these assertions inconsistent?

deesy


Nope. Because at the time I said they were going up and down, the page was reporting the servers were up and then the same page at a different time was saying they were down. That seems fairly consistent with the purpose of a server status page.


Sorry! Beg to differ. I checked the Server Status page numerous times (as did others), and all servers were reported by the page to be up and running. Yet, we were receiving no work to process.

I stand by my position that the Server Status page is of little or no use to contributors because the information being displayed is inaccurate. I also believe that your assertions are not consistent with observable facts. If the servers were really down, and the Server Status page indicated that they were up and running, one wonders how you might know that the servers were really down (or up and down). Are you the System Administrator, or are you just speculating?

deesy
ID: 67467 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 67468 - Posted: 30 Aug 2010, 18:46:10 UTC - in response to Message 67467.  
Last modified: 30 Aug 2010, 18:49:54 UTC

Sorry! Beg to differ. I checked the Server Status page numerous times...


Hmmm?


Are you the System Administrator, or are you just speculating?


Neither. I live almost halfway round the world from the servers, so my comments are based on observable evidence only. Where I am speculating I am careful to include qualifications like "perhaps", "maybe" or "probably".
ID: 67468 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 67469 - Posted: 30 Aug 2010, 19:23:23 UTC

I don't see how it is worth discussing further. Obviously one person's observation at one time of day and another person's observation at another time of day will possibly result in different observations.

Technically the server is up. It is responding to client requests, updating preferences and databases reflecting last contact with host etc. It is also redistributing tasks that as passing deadlines or reporting back as being aborted or errors etc.

The server being "up" does not mean it has any work to send out at the instant your host tries to get some. This is why many are reporting sporatically seeing tasks come down to their hosts. Rosetta has a queue of work that feeds in to the BOINC server, and so there are two different statuses of number of work units available. The backend queue apparently has 2 million available, but they have not been processed and fed into the BOINC server for some reason.

Yes, the wording, presentation and accuracy of the server status information on the homepage and on the server status page could be improved. I'm sure it doesn't rank very high on the "to do" list when compared to improving the science code and integrating the work flow system that has been discussed.
Rosetta Moderator: Mod.Sense
ID: 67469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 67470 - Posted: 30 Aug 2010, 21:22:20 UTC

Hey guys - I think that it's time to cool it a little bit - I'm sure the folks up in Washington are doing their best and unless you are an absolute credit whore what is this slow down really costing you?

Are you loosing data or are important results you absolutely must have right now being delayed?

I'm going to guess the answer to that one is no.

The only folks being hurt by the slow down are the researchers and the credit whores - and to be honest the slow down has not been all that bad - except for on the 13th I have been running within 10% of my normal daily average.

If I did things right you should be able to see a graph of that below - fresh off the free-dc site.

I'm not doing anything special - my systems are set for a six hour run time and a half day work queue - I am not doing anything to hoard work units in an attempt to bridge the gap. I'm just an average guy with systems set up to crunch, just like everyone else.

Like everyone else I have has a few cores idle up during the past week but is it worth the stress some of you seem to be dealing with?

I don't think so.

About the only thing I can think of that might make my systems a little different than those run by others is that I don't have a backup project - I am 100% dedicated to Rosetta@home - I do not know the internals of BOINC that well but could it be that if I were busy crunching on a backup project I might not notice it right away when Rosetta work units did become available?

Maybe, I don't know - but I can tell you from what I have seen this whole thing amounts to nothing more than a minor slowdown - and I suspect that a lot of the credit for it being a slowdown instead of an outage goes to the Admins for holding things together.

I'll climb down off my soapbox now, thank you.




ID: 67470 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 67472 - Posted: 30 Aug 2010, 22:20:50 UTC

Hey guys - I think that it's time to cool it a little bit - I'm sure the folks up in Washington are doing their best and unless you are an absolute credit whore what is this slow down really costing you?

Are you loosing data or are important results you absolutely must have right now being delayed?

I'm going to guess the answer to that one is no.

The only folks being hurt by the slow down are the researchers and the credit whores - and to be honest the slow down has not been all that bad - except for on the 13th I have been running within 10% of my normal daily average.

If I did things right you should be able to see a graph of that below - fresh off the free-dc site.

I'm not doing anything special - my systems are set for a six hour run time and a half day work queue - I am not doing anything to hoard work units in an attempt to bridge the gap. I'm just an average guy with systems set up to crunch, just like everyone else.

Like everyone else I have has a few cores idle up during the past week but is it worth the stress some of you seem to be dealing with?

I don't think so.

About the only thing I can think of that might make my systems a little different than those run by others is that I don't have a backup project - I am 100% dedicated to Rosetta@home - I do not know the internals of BOINC that well but could it be that if I were busy crunching on a backup project I might not notice it right away when Rosetta work units did become available?

Maybe, I don't know - but I can tell you from what I have seen this whole thing amounts to nothing more than a minor slowdown - and I suspect that a lot of the credit for it being a slowdown instead of an outage goes to the Admins for holding things together.

I'll climb down off my soapbox now, thank you.


So let's see, then. You're saying that, because you didn't see an interruption in WU processing, then nobody did! That makes perfect sense . . . in some other universe. Calling people "credit whores" and becoming an apologist for a system that appears to need attention doesn't accomplish much.

We who are trying to help in the search for the causes and cures for deadly diseases do a lot of things to increase our levels of contribution. Some of us purchase more powerful hardware than we might otherwise need. Many of us leave our computers on and running 24/7, paying for the electrical energy required to operate the machines, and the electricity needed to remove the heat byproduct from our homes and places of business. When those who request our assistance in the search for solutions then fail to hold up their end of our partnership, do we not have a right to express our concern?

To attempt to stifle any criticism of the appearance of questionable performance is simply arrogant. The Project should communicate clearly and accurately to its contributors, regardless of responsibility and blame. Not doing so is, IMO, irresponsible and unprofessional. If the servers are down, report them as down. If they are up and running, then tell us why the flow of Work Units has been interrupted. Is that so difficult?

deesy
ID: 67472 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : Number crunching : no work units



©2024 University of Washington
https://www.bakerlab.org