Need help fixing problems or avoiding Rosetta Mini

Message boards : Number crunching : Need help fixing problems or avoiding Rosetta Mini

To post messages, you must log in.

AuthorMessage
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 54195 - Posted: 5 Jul 2008, 21:16:37 UTC

I've reached the end of my ability to keep up with failing Rosetta Mini work units.

I'm allowed the off-hours use of four dual-CPU Xeon servers (261560, 262128, 261547, and 262119) which crunch Beta jobs without problem, but every time I see their stats fall off and go look, a Mini job is "hung." Hung in this case means CPU time is being consumed way beyond limit, with no increase in percentage complete. Far worse, once a Mini job is in this state it stops obeying BOINC's time-of-day suspend rules. BOINC shows the job suspended, but Windows show it at load, churning away. This is unacceptable, I crunch on these machines off-hours with the permission of a business!

There are multiple Pentium D desktop systems that are seeing large numbers of Mini failures (260104, 259209, and 259666 are examples). I've only been able to spot-check results on these, but the failures seem to be Mini and they seem to complete Beta jobs. Some of these have so many failures that their daily WU quota has dropped, and they are going idle.

All of this gear is stock Dell hardware ... No overclocking or other pushing the envelope going on. In fact I have them configured to only utilize a single core/CPU, trying to make sure I avoid high system temperatures (and Beta jobs don't seem to be failing).

In order from most to least desirable:

  • If this is an understood problem and there is a work-around or fix (e.g., changing run times, scraping off the project and BOINC then reinstalling, etc), can someone tell me what the drill is?

  • If the symptoms described above are not understood, how can I get somebody more information so they can work the problem(s)?

  • If no one has time to work on fixes, is there any way I can tell BOINC to only download Beta jobs, avoiding Mini?


I think I know enough to build an external script that would execute: If job is Mini, then abort job; but that seems likely to just have all these machines lowering their daily WU quota and (if there aren't enough Beta tasks pending) going idle.

Please reply if there is anything known I can do to get back into stable, "set and forget" Rosetta operation. I hate to drop out during CASP, but the customer who volunteered these machines expects no-impact, off-hours operations.


ID: 54195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 54215 - Posted: 6 Jul 2008, 23:47:29 UTC - in response to Message 54195.  

I've reached the end of my ability to keep up with failing Rosetta Mini work units. [...]

  • If no one has time to work on fixes, is there any way I can tell BOINC to only download Beta jobs, avoiding Mini?


I think I know enough to build an external script that would execute: If job is Mini, then abort job; but that seems likely to just have all these machines lowering their daily WU quota and (if there aren't enough Beta tasks pending) going idle.


For a quick (temporary) help, take a look at these two messages.

Peter
ID: 54215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
funkydude

Send message
Joined: 15 Jun 08
Posts: 28
Credit: 397,934
RAC: 0
Message 54217 - Posted: 7 Jul 2008, 4:36:53 UTC - in response to Message 54215.  
Last modified: 7 Jul 2008, 4:37:56 UTC

Hey, out of curiosity, what BOINC client are you using? I had severe problems with failures on another project using the normal client, after installing the beta to take advantage of "protected mode" I've never had a problem since.

Link to betas https://boinc.berkeley.edu/download_all.php
ID: 54217 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 54223 - Posted: 7 Jul 2008, 8:50:28 UTC - in response to Message 54217.  

Hey, out of curiosity, what BOINC client are you using?

ATM 6.2.4+6.2.11. If curious, take a look at any person's recently returned result and (on a particular computer) the client number should be marked there in the stderr.out section as <core_client_version>N.N.N</core_client_version>. (If not faked, happens too on some private builds.)

I had severe problems with failures on another project using the normal client, after installing the beta to take advantage of "protected mode" I've never had a problem since.

If I may know, what type of problems and where? Might be worth tracking down and fixing.

Link to betas https://boinc.berkeley.edu/download_all.php

Beware that using test versions is often risky, if you do not listen to BOINC alpha information channels...

Peter
ID: 54223 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 54232 - Posted: 7 Jul 2008, 12:15:39 UTC

Thanks for the pointer. I've got the servers locked down to Beta jobs, and they are obeying BOINC's time-of-day suspends (helps with the geo-politics). I can deploy to the Pentium D desktops that are throwing all the errors on Mini jobs this evening.

Not sure if the previous poster was asking me about BOINC version, but I think everything is at 5.10.30 or higher. I've only been running BOINC updates when I happened to be visiting a machine at it was near a job boundary.

ID: 54232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
funkydude

Send message
Joined: 15 Jun 08
Posts: 28
Credit: 397,934
RAC: 0
Message 54233 - Posted: 7 Jul 2008, 13:04:06 UTC - in response to Message 54232.  
Last modified: 7 Jul 2008, 13:05:16 UTC

It was with the climateprediction project, had constant error after error until using the beta (6.x) to get protected mode.

Perhaps your problems are from an old beta? I saw 6.2.4 somewhere there.
ID: 54233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 54243 - Posted: 7 Jul 2008, 15:14:39 UTC - in response to Message 54233.  

(Off-topic here):
It was with the climateprediction project, had constant error after error until using the beta (6.x) to get protected mode.

Have you ever asked for help there? You've started crunching few WUs in mid June, but all with BOINC 5.10.45, no beta to be seen in your results.

Peter
ID: 54243 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,701,869
RAC: 2,154
Message 54248 - Posted: 7 Jul 2008, 16:54:39 UTC - in response to Message 54243.  

(Off-topic here):
It was with the climateprediction project, had constant error after error until using the beta (6.x) to get protected mode.

Have you ever asked for help there? You've started crunching few WUs in mid June, but all with BOINC 5.10.45, no beta to be seen in your results.

Peter



what has climate prediction got to do with Rosetta?
you should be posting this over in climate or over in the cafe boards here at RAH
ID: 54248 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 54256 - Posted: 7 Jul 2008, 21:54:19 UTC - in response to Message 54248.  

(Off-topic here):
It was with the climateprediction project, had constant error after error until using the beta (6.x) to get protected mode.

Have you ever asked for help there? You've started crunching few WUs in mid June, but all with BOINC 5.10.45, no beta to be seen in your results.

what has climate prediction got to do with Rosetta?

If you have not noticed yet, I'll reveal it for you: BOINC ;-)

you should be posting this over in climate or over in the cafe boards here at RAH

Maybe you're right (hence my "(Off-topic here)" note), but still, the applications and clients depend strongly one upon another and both suffer from the other's flaws.
Understand it as a search for a one more possible common source of errors.

Peter
ID: 54256 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
funkydude

Send message
Joined: 15 Jun 08
Posts: 28
Credit: 397,934
RAC: 0
Message 54259 - Posted: 7 Jul 2008, 23:02:54 UTC - in response to Message 54256.  

I gave up on climate prediction after changing to beta, I didn't like how long it took for 1 result. Rosetta was my first project after beta,
ID: 54259 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Need help fixing problems or avoiding Rosetta Mini



©2024 University of Washington
https://www.bakerlab.org