to many validate/compute errors

Message boards : Number crunching : to many validate/compute errors

To post messages, you must log in.

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 1,996
Message 53756 - Posted: 18 Jun 2008, 11:01:30 UTC

This is getting stupid

Combined total of 9 compute or validate errors in 3 days. Some within hours of each other.

I thought you guys would do a better job of filtering out bad tasks. I hate wasting cpu time/power on projects that will not validate or go down in flames at the end of their run.

I am tempted until CASP8 is over to set my run time to 2 hrs instead of 4 so I don't waste time on 'junk'

It is still another 6 hours before I get home and see what errors showed up this time.

You guys know that you are losing good computing power due to alot of hung tasks and validate/compute errors. The lack of anyone saying anything about these problems is not a good track record for RAH.
ID: 53756 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jeremy

Send message
Joined: 15 May 08
Posts: 13
Credit: 2,636
RAC: 0
Message 53761 - Posted: 18 Jun 2008, 12:27:18 UTC

100% agreed
ID: 53761 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile jaxom1
Avatar

Send message
Joined: 5 Jun 06
Posts: 180
Credit: 1,586,889
RAC: 0
Message 53773 - Posted: 18 Jun 2008, 15:51:35 UTC - in response to Message 53756.  

Although I am not as angry as you seem to be, this was one of the reasons Poem is getting more work from me recently.

This is getting stupid

Combined total of 9 compute or validate errors in 3 days. Some within hours of each other.

I thought you guys would do a better job of filtering out bad tasks. I hate wasting cpu time/power on projects that will not validate or go down in flames at the end of their run.

I am tempted until CASP8 is over to set my run time to 2 hrs instead of 4 so I don't waste time on 'junk'

It is still another 6 hours before I get home and see what errors showed up this time.

You guys know that you are losing good computing power due to alot of hung tasks and validate/compute errors. The lack of anyone saying anything about these problems is not a good track record for RAH.



ID: 53773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 1,996
Message 53777 - Posted: 18 Jun 2008, 17:35:11 UTC

i just aborted the rest of the rb06 tasks, i was getting about a 50% error rate, either compute or validate. time for some new work.
ID: 53777 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53783 - Posted: 18 Jun 2008, 18:29:16 UTC

Sorry for all the errors. You should still be granted credit.

The rb_06 tasks had a high failure/invalid rate with minirosetta version 1.25 but should have run okay with version 1.28. A relatively small number of rb_06 jobs were issued with 1.25 and it seems like you got a bunch of them. The validate errors were actually not a waste of cpu time though and I'll try to explain why. There are a few filters in place to conserve cpus. Basically, if the structure does not look protein-like at certain stages (do not pass the filters), the process continues on to the next model. The typical functionality which is in place in version 1.28 and rosetta++, is for the failed structure to get tagged and written to the final result file but in version 1.25, in an effort to push out the app too quickly for CASP, the handling of failed structures was not set up correctly and the failed structure was not being written out to the final result file. rb_06 was an unusually large CASP target for ab initio modeling so the pass rate was low and thus with version 1.25, models were often not being sent back to our servers causing validate errors. rb_06 was a Robetta target which is completely automated and does not use Ralph. Instead, it initially sends out a small batch of jobs and makes sure the success rate is high before sending out more jobs. You unfortunately got a bunch of the initial batch before we were able to update to version 1.28.

Sorry for the late response. We are working on getting minirosetta more stable and I'll be talking to the developers about the current status of CASP jobs using minirosetta on R@h.
ID: 53783 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 1,996
Message 53788 - Posted: 18 Jun 2008, 19:41:06 UTC

Thank you for taking the time to respond.
So I was just unlucky.

Well on to better things then.
ID: 53788 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : to many validate/compute errors



©2024 University of Washington
https://www.bakerlab.org