Compute error

Message boards : Number crunching : Compute error

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 69964 - Posted: 4 Apr 2011, 21:29:33 UTC

Hello,

I have a lot of compute errors today. A few this morning, 9 this afternoon and 5 in the evening (now).
My wingmen has them too, so it has to do with the jobs.
Is this known by the project?

Thanks.
Greetings,
TJ.
ID: 69964 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sparkler99

Send message
Joined: 13 Mar 11
Posts: 7
Credit: 5,469
RAC: 0
Message 69965 - Posted: 4 Apr 2011, 22:34:59 UTC

im getting them too client can't open cs_frags.9mers.gz
ID: 69965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 69967 - Posted: 5 Apr 2011, 4:51:58 UTC - in response to Message 69965.  

Thanks guys for the post - I was getting ready to tr=ear down my system on the bench to try and figure this out but since you to are getting the same errors I guess I can hold of on that.


ID: 69967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sparkler99

Send message
Joined: 13 Mar 11
Posts: 7
Credit: 5,469
RAC: 0
Message 69968 - Posted: 5 Apr 2011, 5:36:30 UTC

alots of people are getting them if you look at the wu's on your account page you should see the other people failing the wu as well ive had 10 so far fail and only 3 good units that don't contain cs_frags.9mers.gz
ID: 69968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bob

Send message
Joined: 23 Apr 09
Posts: 2
Credit: 8,854,738
RAC: 0
Message 69970 - Posted: 5 Apr 2011, 7:59:46 UTC

I am still getting error

client can't open cs_frags.9mers.gz

guess some of secret sauce leaked out?
ID: 69970 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 69973 - Posted: 5 Apr 2011, 11:33:27 UTC

I think the admins have to investigate this, as I have approx. 55% errors and 45% good ones. However they error out after a few seconds.
Lots of other crunchers, if not all, have the same.
Greetings,
TJ.
ID: 69973 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Saenger
Avatar

Send message
Joined: 19 Sep 05
Posts: 271
Credit: 824,883
RAC: 0
Message 69977 - Posted: 5 Apr 2011, 17:23:45 UTC

Count me in to the victims ;)

Grüße vom Sänger
ID: 69977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sparkler99

Send message
Joined: 13 Mar 11
Posts: 7
Credit: 5,469
RAC: 0
Message 69978 - Posted: 5 Apr 2011, 18:54:47 UTC

well ive got another error after detaching/reattaching project and redownloading project files so that doesn't help
ID: 69978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 69979 - Posted: 5 Apr 2011, 19:40:00 UTC

I'm sure the project devs are all over this like white on rice.
ID: 69979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 69980 - Posted: 5 Apr 2011, 19:56:21 UTC - in response to Message 69979.  

I'm sure the project devs are all over this like white on rice.


But what about brown rice?
ID: 69980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,214,047
RAC: 1,450
Message 69981 - Posted: 6 Apr 2011, 9:16:02 UTC - in response to Message 69980.  

I'm sure the project devs are all over this like white on rice.


But what about brown rice?


Doesn't seem like it! I actually put some pc's back on Rosie and have lots of errors, it's time for them to move on again! Credits are low enough, wasting time is just not worth it, too many fish in the sea!
ID: 69981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sparkler99

Send message
Joined: 13 Mar 11
Posts: 7
Credit: 5,469
RAC: 0
Message 69983 - Posted: 6 Apr 2011, 15:15:41 UTC

this guy's pc ranked 3rd on top hosts wu's/day has been knocked down to just 30/day due to this https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1419881 and it doesn't seem like anyone knows there's a problem as the new on the from page hasn't been updated
ID: 69983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 69984 - Posted: 6 Apr 2011, 19:01:49 UTC - in response to Message 69983.  
Last modified: 6 Apr 2011, 19:02:04 UTC

and it doesn't seem like anyone knows there's a problem as the new on the from page hasn't been updated


Unfortunately, past experience shows that the Rosetta team often go quiet when things start going wrong. They may be aware of the problems but just not posting any updates.

The most likely cause of the problem is a badly designed batch of work units that they should have tested on Ralph@home (Rosetta Alpha) before launching here.
ID: 69984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,214,047
RAC: 1,450
Message 69988 - Posted: 7 Apr 2011, 10:18:31 UTC - in response to Message 69984.  

and it doesn't seem like anyone knows there's a problem as the new on the from page hasn't been updated


Unfortunately, past experience shows that the Rosetta team often go quiet when things start going wrong. They may be aware of the problems but just not posting any updates.

The most likely cause of the problem is a badly designed batch of work units that they should have tested on Ralph@home (Rosetta Alpha) before launching here.


Confidence, what they are doing, ie going silent, does NOT inspire confidence!! Something along the lines of 'we see there is a problem and are working on it, please bear with us' would go a long way towards massaging the egos of those that are providing FREE computing power in a World where there are A TON of other Boinc projects that also need our pc's!! BUT it is THEIR project and sink or swim they can do it any way they want to!
ID: 69988 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 69996 - Posted: 9 Apr 2011, 5:37:10 UTC

Well I am still getting a ** few ** of the tasks with client errors and validate errors (with matching wingman results) their rate of arrival has greatly fallen off.

Sure wish someone on the project cared enough / respected us enough / or just simply had the "hardware" to communicate with us when things go south.

From my perspective this project has more value to society as a whole than the rest of the pack. The research done here has the potential to drive great leaps in medical science and may just be key to keeping us alive and kicking until "that other group" finally locates My Favorite Martian.

But the success of this project is and will continue to be limited by the lack of responsiveness on the part of the project managers.

If you doubt what I say take a look at the stagnant project TeraFLOPS estimate. It never really recovered after the big server outage we saw a few months back.

I vote we put mod.sense in charge of the whole damn thing. At least he (I assume mod.sense is a "he") makes a real attempt to communicate. Or at least give him a pay raise.

ID: 69996 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile fatbozz

Send message
Joined: 10 Dec 05
Posts: 5
Credit: 1,762,734
RAC: 0
Message 69999 - Posted: 9 Apr 2011, 10:25:01 UTC
Last modified: 9 Apr 2011, 10:26:29 UTC

This error cause only on WUs that starts T0xxx
Boinc 64bit @ Sandy Bridge
ID: 69999 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 70000 - Posted: 9 Apr 2011, 16:15:04 UTC
Last modified: 9 Apr 2011, 16:19:23 UTC

I vote we put mod.sense in charge of the whole damn thing. At least he (I assume mod.sense is a "he") makes a real attempt to communicate. Or at least give him a pay raise.


Ha ha. Perhaps you've struck upon the key points there. I am a volunteer, so your two cents is an infinite percentage raise as compared to zero :) ...and because I am a volunteer and otherwise have no expectations to perform a role for the team, I have the time to make the effort. ...and yes, you have your pronouns correct.

As you already suspect, not being on the team nor in their data center, I simply do not know WU naming pattern nor cause nor pervasiveness of recent reported problems. I can only judge by the overall project TFLOPS, and renewed supply of work, that the failure ratio is not extreme. But I know they watch all of the results and failure rates pretty closely. BOINC's abilities to remove a failing batch are fairly cumbersome and incomplete, and so sometimes it is probably just simpler to let them run their course when they are not consuming CPU time before failure, and not to generate more with the same flaw.

I see they've been testing a mini version 3 over on Ralph, so perhaps that effort has made it more difficult to pretest work units for 2.17 over there.
Rosetta Moderator: Mod.Sense
ID: 70000 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70001 - Posted: 9 Apr 2011, 17:09:26 UTC

@fatbozz - you are correct - the "Client Error / Fail Almost Right Away" issue seems to be limited to work units whose name starts out with the prefix T0xxx. However, I am also getting "Validate Errors" (with matching wingman results thank you) on ** some ** of the work units starting out with ilv_ and IF3_.

@mod.sense - we know and appreciate that you are an unpaid volunteer. And while I don't know about the other crunchers here, I for one also understand just how much fun it is to try and tap dance around a problem when you are getting little or no input from those responsible for the platform that is struggling.

A couple of samples of work units with the Validate Errors I mentioned would be:

377197127
377200299
374781916

In response to you comment that the "failure rate is note extreme" - well at least the facts seem to be converging with your statement. I just did a quick "eyeball survey" of my system and see about 5% of the jobs getting the Validate type of error and 10% to 15% getting the Client Error failure.

At the high point I was seeing about 40% of my work units failing with the Client Error issue.

The Final Four is over and done with - gently put, I did not even wee WU mentioned. So tell the staff to get back to work! ;)





ID: 70001 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hank Barta

Send message
Joined: 6 Feb 11
Posts: 14
Credit: 3,943,460
RAC: 0
Message 70002 - Posted: 9 Apr 2011, 18:18:49 UTC
Last modified: 9 Apr 2011, 18:19:56 UTC

I suppose that when there are problems, the folks whose job it is to solve them spend time on that rather than getting into potentially endless conversations on the forum. ;)

I'm happy to know that they monitor the compute errors and will soon become aware of problems.

A new one I just saw is:


Setting up graphics native ...
Setting up folding (abrelax) ...
std::cerr: Exception was thrown:
Atom HE1 90 not found

(See https://boinc.bakerlab.org/rosetta/result.php?resultid=412361359 for further info.)

on Ross3X3_SAVE_ALL_OUT_k034_CS_frag_NOE_cst_005_23917_1066_0
ID: 70002 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 70003 - Posted: 9 Apr 2011, 22:31:14 UTC - in response to Message 70002.  

I suppose that when there are problems, the folks whose job it is to solve them spend time on that rather than getting into potentially endless conversations on the forum. ;)

I'm happy to know that they monitor the compute errors and will soon become aware of problems.

A new one I just saw is:


Setting up graphics native ...
Setting up folding (abrelax) ...
std::cerr: Exception was thrown:
Atom HE1 90 not found

(See https://boinc.bakerlab.org/rosetta/result.php?resultid=412361359 for further info.)

on Ross3X3_SAVE_ALL_OUT_k034_CS_frag_NOE_cst_005_23917_1066_0


But what mikey say's is true. A few words on he main page is enough for us to know waht is happening. Its for the sake of medicine that I am here, not for the credits they are a laugh bu I don't care. I care about medice for cancer.
And I (we) volunteer on that with power, pc's-time/wear and such and then are a few words (tyoped in a minute) not a big deal.
Greetings,
TJ.
ID: 70003 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Compute error



©2024 University of Washington
https://www.bakerlab.org