No Work Units

Message boards : Number crunching : No Work Units

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 58479 - Posted: 4 Jan 2009, 18:05:37 UTC - in response to Message 58353.  

Sid, I hold 3 days of reserve just for occasions like this.

I don't like to hog WU's and was reluctant to increase from the default 0.25, but maybe it's a wise move after all. I've been away a few days and seen 13 tasks download, all of which got sent back straight away with Cheksum errors like others have reported. Then nothing, then 7 more WUs with the same problem. Some came through ok, but my last 2 finished within 30 minutes of me getting home (how I'd appreciate some long-running models right now!)

On the plus side, now that my lockfile errors have disappeared, it may be that I should increase my runtime further to get more out of the few WUs that make it here successfully. That way, maybe I'll call for fewer WUs and help give everyone else access to what's left.

Do people think that's the best plan at the current time - so that even if only one server is running it'll stand a better chance of keeping us busy?
ID: 58479 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 58488 - Posted: 4 Jan 2009, 18:25:36 UTC - in response to Message 58479.  

Sid, I hold 3 days of reserve just for occasions like this.

I don't like to hog WU's and was reluctant to increase from the default 0.25, but maybe it's a wise move after all. I've been away a few days and seen 13 tasks download, all of which got sent back straight away with Cheksum errors like others have reported. Then nothing, then 7 more WUs with the same problem. Some came through ok, but my last 2 finished within 30 minutes of me getting home (how I'd appreciate some long-running models right now!)

On the plus side, now that my lockfile errors have disappeared, it may be that I should increase my runtime further to get more out of the few WUs that make it here successfully. That way, maybe I'll call for fewer WUs and help give everyone else access to what's left.

Do people think that's the best plan at the current time - so that even if only one server is running it'll stand a better chance of keeping us busy?



I would set up for 3 days of extra work and set a longer run time. I would expect a ton of error messages once the system comes back online. Every computer is going to be asking for work and I bet that the server won't be able to handle the crush. But when it is your turn, grab a bunch to keep your system busy for a few days while everyone is getting work for their system and the server is overloaded. I would think that there is more than enough work to go around when the system is running correctly, so I don't think grabbing 3 days extra work is hogging tasks by any means.
ID: 58488 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 58495 - Posted: 4 Jan 2009, 18:48:31 UTC - in response to Message 58488.  

I would set up for 3 days of extra work and set a longer run time. I would expect a ton of error messages once the system comes back online. Every computer is going to be asking for work and I bet that the server won't be able to handle the crush. But when it is your turn, grab a bunch to keep your system busy for a few days while everyone is getting work for their system and the server is overloaded. I would think that there is more than enough work to go around when the system is running correctly, so I don't think grabbing 3 days extra work is hogging tasks by any means.

I take your point, but I've increased runtimes from 3 to 4 hours for the moment and kept my buffer to one day. Once the servers are back running, everyone will be trawling to fill up their backlogs and they'll get swallowed up by the few, leaving others short. Only when the backlog has been taken up will I consider increasing the buffer.

It makes sense for everyone to reduce your buffer just to allow everyone to get something, then up it again once the rush is over. The rush will be shorter if that happens.

However, I understand human nature and fully realise that people have the attitude "all for one and I hope it's me" and sit on a pile of unused WUs for days while others remain out of work, so my idea will fall on deaf ears.
ID: 58495 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58496 - Posted: 4 Jan 2009, 18:59:04 UTC
Last modified: 4 Jan 2009, 19:02:13 UTC

It makes sense for everyone to reduce your buffer just to allow everyone to get something, then up it again once the rush is over. The rush will be shorter if that happens.


If I had a magic wand, that would be what I'd do. And when work becomes available, I'd give as much as requested to machines with rare internet connections, and nothing to machines that still have work to do. Then I'd catch up on the machines I shorted earlier.

...but the scheduler isn't that sophisticated, and 98% of the people don't read the message boards, so the server is going to be pounded no matter what. But Sid's got the right idea. Just take what you need. Then when work becomes plentiful again, take on a reserve.

Sid, in general, I'd say as long as your complete the work before the deadlines, noone is going to accuse you of hording. Especially since there is usually plenty of work to go around on Rosetta. On the other hand, the team likes to see results as soon as possible. So the 2 to 3 days buffer is a good compromise. It gives you enough work to ride through most all outages, and gives the completed results back in a timely mannar.

[edit]
Having the buffer of work also gives you some room to suspend your network connection after you see outages, and avoid hitting the server on it's busiest times. The problem I always have is remembering to set it back on again the next day.
Rosetta Moderator: Mod.Sense
ID: 58496 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 58498 - Posted: 4 Jan 2009, 19:32:42 UTC - in response to Message 58496.  

If I had a magic wand, that would be what I'd do. And when work becomes available, I'd give as much as requested to machines with rare internet connections, and nothing to machines that still have work to do. Then I'd catch up on the machines I shorted earlier.

...but the scheduler isn't that sophisticated, and 98% of the people don't read the message boards, so the server is going to be pounded no matter what. But Sid's got the right idea. Just take what you need. Then when work becomes plentiful again, take on a reserve.

There'd be a neat academic exercise for someone here relating to the number of cores available, then RAC etc, but it's not going to come to anything, like you say.

On the plus side, those who don't read the message boards are likely to stay at the default 0.25 buffer anyway and it's only the active crunchers who'll dial in too.

Sid, in general, I'd say as long as your complete the work before the deadlines, no-one is going to accuse you of hording. Especially since there is usually plenty of work to go around on Rosetta. On the other hand, the team likes to see results as soon as possible. So the 2 to 3 days buffer is a good compromise. It gives you enough work to ride through most all outages, and gives the completed results back in a timely manner.

I'm sure that's right, but I'd likely accuse myself of it. I'm a pretty screwed-up individual on that kind of thing! Having lots of WUs hanging round makes people feel warm and fluffy, until something goes wrong and it's several days before they get to the top of the pile and get reported, by which time everyone else is stacked with them too and it becomes a bigger issue.

Yes, that's right. I'm in manufacturing.... ;)

Having the buffer of work also gives you some room to suspend your network connection after you see outages, and avoid hitting the server on it's busiest times. The problem I always have is remembering to set it back on again the next day.

Good point, which I hadn't thought about.

These last few days have made me rethink my view on the thread about increasing default run-times site-wide. I still support the view that it should be done step by step (default 4hrs, 2hr minimum first etc) but the urgency of the issue has been highlighted for everyone now.
ID: 58498 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,430,969
RAC: 55,979
Message 58506 - Posted: 4 Jan 2009, 22:40:40 UTC - in response to Message 58498.  
Last modified: 4 Jan 2009, 22:41:15 UTC

Yes, that's right. I'm in manufacturing.... ;)

Maybe there should be some TPS implementation! Principle 5 might be a good place to start...

http://en.wikipedia.org/wiki/The_Toyota_Way
ID: 58506 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 58520 - Posted: 5 Jan 2009, 13:08:39 UTC - in response to Message 58506.  

Yes, that's right. I'm in manufacturing.... ;)

Maybe there should be some TPS implementation! Principle 5 might be a good place to start...

http://en.wikipedia.org/wiki/The_Toyota_Way

Yes, you've worked me out very quickly. It was 8-10 years ago I qualified in principles of world class manufacturing at UCE Birmingham, which only formalised what I'd been doing by second nature for the previous 20 years.

Some of these principles can work side by side, but I'd start with 1 and 12, 13 & 14, otherwise 5 becomes another problem rather than a route to a solution.

Far too big a subject to talk about here, but it's possible to see some aspects in action already.

All I'd add is that while the principles are always correct, as users here we need to bear in mind the resources available to put them into effect. We see lots of posts reinforcing the symptoms without giving realistic time to the solution coming through. Sometimes the solution is just temporary and doesn't go to the root cause because of money, time or available expertise. When I see people threatening to abandon the project due to a temporary situation it smacks of impatience, a lack of understanding and even a lack of respect.

Let's give the guys a break occasionally. Not every problem can be solved by flicking the appropriate switch.
ID: 58520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 58524 - Posted: 5 Jan 2009, 14:55:24 UTC - in response to Message 58520.  

Yes, that's right. I'm in manufacturing.... ;)

Maybe there should be some TPS implementation! Principle 5 might be a good place to start...

http://en.wikipedia.org/wiki/The_Toyota_Way

Yes, you've worked me out very quickly. It was 8-10 years ago I qualified in principles of world class manufacturing at UCE Birmingham, which only formalised what I'd been doing by second nature for the previous 20 years.

Some of these principles can work side by side, but I'd start with 1 and 12, 13 & 14, otherwise 5 becomes another problem rather than a route to a solution.

Far too big a subject to talk about here, but it's possible to see some aspects in action already.

All I'd add is that while the principles are always correct, as users here we need to bear in mind the resources available to put them into effect. We see lots of posts reinforcing the symptoms without giving realistic time to the solution coming through. Sometimes the solution is just temporary and doesn't go to the root cause because of money, time or available expertise. When I see people threatening to abandon the project due to a temporary situation it smacks of impatience, a lack of understanding and even a lack of respect.

Let's give the guys a break occasionally. Not every problem can be solved by flicking the appropriate switch.



you got graduates and students and full time staff working, but this is still a university project at the moment and still in its infancy. but some of us including myself get caught up in expecting perfection from the team. I wonder just how large this "team" is? I think the biggest irritation factor amongst the user group is the lack of any news. It seems that mod is the only one at times that has a vague idea as to what happened. Same goes for the rash of m5 errors or whatever that was. no news on that. but being its the holiday i suspect no one is around to explain that error.
ID: 58524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Heidi1
Avatar

Send message
Joined: 11 Aug 07
Posts: 49
Credit: 1,786,248
RAC: 0
Message 58531 - Posted: 5 Jan 2009, 18:16:01 UTC

I agree with both of the last two posts, it is a combination of lack of patience (and/or people's backup plans if something like this happens) and the lack of news whenever something like this happens. Also, keep in mind, right now, the public schools (and maybe the UofW) are resuming classes today, and that could be "forcing" the R@H project to briefly go on the backburner as everyone is settling back in with new classes, teachers, etc.

Another thing to keep in mind, also for this particular crisis, the states of Washington and Oregon have been in a weather crisis for the last three weeks, the likes of which we've not seen in over 40 years. Yes, the national news has been covering the storms more from the midwest and New England, but our snowfall has so far seen about 10 times our normal snow amounts. Don't know about Seattle, where R@H is based, but Portland practically shut down for about 5 days leading up to Christmas because we couldn't handle the amount of snow and ice we got. I only mention this, because there could have been server problems during that time that no one could fix because no one could physically drive to the server to fix it. Add to that two holidays a week apart, and that may have made the problems worse.

I'm just trying to aleviate at least a bit of the impatience that's going around. However, more news would be appreciated (and more than just Mod saying something vague).
ID: 58531 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58534 - Posted: 5 Jan 2009, 18:25:47 UTC

Yes, unfortunately, I've not heard anything either. So I can only infer from what we all observe and a bit of experience seeing similar holiday symptoms in the past.
Rosetta Moderator: Mod.Sense
ID: 58534 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 58535 - Posted: 5 Jan 2009, 19:33:03 UTC

As of 5 Jan 2009 19:35:31 UTC (updated every 10 minutes)

they have taken the whole system down with the exception of the scheduler and the web server! That looks serious enough.
ID: 58535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bono_vox

Send message
Joined: 5 Dec 05
Posts: 8
Credit: 371,092
RAC: 0
Message 58536 - Posted: 5 Jan 2009, 19:35:46 UTC

Right now (As of 5 Jan 2009 19:35:31 UTC), with the exception of "Data-driven web pages" and the "Scheduler", all programs are "Not Running".


ID: 58536 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 58537 - Posted: 5 Jan 2009, 19:43:51 UTC

Looks like your worries are at an end. I have just had 15 work units downloaded and there are about 19000 in the queue.
ID: 58537 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 58538 - Posted: 5 Jan 2009, 20:03:40 UTC - in response to Message 58536.  

Right now (As of 5 Jan 2009 19:35:31 UTC), with the exception of "Data-driven web pages" and the "Scheduler", all programs are "Not Running".

I saw the same thing earlier today - maybe 8 hours ago.

But like Evan says, both make_work servers are running now, though not fully delivering requests. I've got 2 WUs already for my quad, which weren't failures from other users, so that's a start. Second requests aren't being filled yet though.

Don't get greedy, guys. Take the first few to get you running, then hold off for a couple of hours until everyone gets working before filling up again.
ID: 58538 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bono_vox

Send message
Joined: 5 Dec 05
Posts: 8
Credit: 371,092
RAC: 0
Message 58539 - Posted: 5 Jan 2009, 20:17:26 UTC
Last modified: 5 Jan 2009, 20:42:08 UTC

Well, now I'm downloading 12 big files (from 2.13 to 12.46MB) named "homfragments_????.zip". Hopefully I'll have enough wu's until everything goes back to normal.

EDIT: Done! 12 new WU's and network activity set to "suspended". My other computer will be doing some WCG for now.
ID: 58539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FoldingSolutions
Avatar

Send message
Joined: 2 Apr 06
Posts: 129
Credit: 3,506,690
RAC: 0
Message 58541 - Posted: 5 Jan 2009, 20:54:21 UTC - in response to Message 58539.  
Last modified: 5 Jan 2009, 20:55:11 UTC

Well, now I'm downloading 12 big files (from 2.13 to 12.46MB) named "homfragments_????.zip". Hopefully I'll have enough wu's until everything goes back to normal.

EDIT: Done! 12 new WU's and network activity set to "suspended". My other computer will be doing some WCG for now.


I got 5 WU's now. Just keep pressing update, eventually you get something :)

EDIT: slow downloads though!!
ID: 58541 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 58542 - Posted: 5 Jan 2009, 21:05:49 UTC
Last modified: 5 Jan 2009, 21:16:37 UTC

It is still showing no work for me at the moment...i suppose that's due to system overload at the moment. well its just going to cycle for awhile until it does get work.

whoa...just 20 mins later i get 21 tasks!
ID: 58542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 58544 - Posted: 5 Jan 2009, 21:17:03 UTC

Morning all.

She's all GREEN now, if i can just get some!

pete.


ID: 58544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 58545 - Posted: 5 Jan 2009, 21:20:45 UTC - in response to Message 58544.  

Morning all.

She's all GREEN now, if i can just get some!

pete.



hang in there...i got assigned 21 tasks but getting them downloaded is another problem at the moment. probably comm overload at the moment.
ID: 58545 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 58546 - Posted: 5 Jan 2009, 21:54:10 UTC - in response to Message 58541.  

However, I understand human nature and fully realise that people have the attitude "all for one and I hope it's me"...

Having the buffer of work also gives you some room to suspend your network connection after you see outages, and avoid hitting the server on it's busiest times.

EDIT: Done! 12 new WU's and network activity set to "suspended". My other computer will be doing some WCG for now.

Not all human nature, then!
I got 5 WU's now. Just keep pressing update, eventually you get something :)

Ok, maybe some. :)
ID: 58546 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : No Work Units



©2024 University of Washington
https://www.bakerlab.org