Message boards : Number crunching : Compute error
Author | Message |
---|---|
TJ Send message Joined: 29 Mar 09 Posts: 127 Credit: 4,799,890 RAC: 0 |
Hello, I have a lot of compute errors today. A few this morning, 9 this afternoon and 5 in the evening (now). My wingmen has them too, so it has to do with the jobs. Is this known by the project? Thanks. Greetings, TJ. |
sparkler99 Send message Joined: 13 Mar 11 Posts: 7 Credit: 5,469 RAC: 0 |
im getting them too client can't open cs_frags.9mers.gz |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Thanks guys for the post - I was getting ready to tr=ear down my system on the bench to try and figure this out but since you to are getting the same errors I guess I can hold of on that. |
sparkler99 Send message Joined: 13 Mar 11 Posts: 7 Credit: 5,469 RAC: 0 |
alots of people are getting them if you look at the wu's on your account page you should see the other people failing the wu as well ive had 10 so far fail and only 3 good units that don't contain cs_frags.9mers.gz |
bob Send message Joined: 23 Apr 09 Posts: 2 Credit: 8,854,738 RAC: 0 |
I am still getting error client can't open cs_frags.9mers.gz guess some of secret sauce leaked out? |
TJ Send message Joined: 29 Mar 09 Posts: 127 Credit: 4,799,890 RAC: 0 |
I think the admins have to investigate this, as I have approx. 55% errors and 45% good ones. However they error out after a few seconds. Lots of other crunchers, if not all, have the same. Greetings, TJ. |
Saenger Send message Joined: 19 Sep 05 Posts: 271 Credit: 824,883 RAC: 0 |
|
sparkler99 Send message Joined: 13 Mar 11 Posts: 7 Credit: 5,469 RAC: 0 |
well ive got another error after detaching/reattaching project and redownloading project files so that doesn't help |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
I'm sure the project devs are all over this like white on rice. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,450 |
I'm sure the project devs are all over this like white on rice. Doesn't seem like it! I actually put some pc's back on Rosie and have lots of errors, it's time for them to move on again! Credits are low enough, wasting time is just not worth it, too many fish in the sea! |
sparkler99 Send message Joined: 13 Mar 11 Posts: 7 Credit: 5,469 RAC: 0 |
this guy's pc ranked 3rd on top hosts wu's/day has been knocked down to just 30/day due to this https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1419881 and it doesn't seem like anyone knows there's a problem as the new on the from page hasn't been updated |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
and it doesn't seem like anyone knows there's a problem as the new on the from page hasn't been updated Unfortunately, past experience shows that the Rosetta team often go quiet when things start going wrong. They may be aware of the problems but just not posting any updates. The most likely cause of the problem is a badly designed batch of work units that they should have tested on Ralph@home (Rosetta Alpha) before launching here. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,450 |
and it doesn't seem like anyone knows there's a problem as the new on the from page hasn't been updated Confidence, what they are doing, ie going silent, does NOT inspire confidence!! Something along the lines of 'we see there is a problem and are working on it, please bear with us' would go a long way towards massaging the egos of those that are providing FREE computing power in a World where there are A TON of other Boinc projects that also need our pc's!! BUT it is THEIR project and sink or swim they can do it any way they want to! |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Well I am still getting a ** few ** of the tasks with client errors and validate errors (with matching wingman results) their rate of arrival has greatly fallen off. Sure wish someone on the project cared enough / respected us enough / or just simply had the "hardware" to communicate with us when things go south. From my perspective this project has more value to society as a whole than the rest of the pack. The research done here has the potential to drive great leaps in medical science and may just be key to keeping us alive and kicking until "that other group" finally locates My Favorite Martian. But the success of this project is and will continue to be limited by the lack of responsiveness on the part of the project managers. If you doubt what I say take a look at the stagnant project TeraFLOPS estimate. It never really recovered after the big server outage we saw a few months back. I vote we put mod.sense in charge of the whole damn thing. At least he (I assume mod.sense is a "he") makes a real attempt to communicate. Or at least give him a pay raise. |
fatbozz Send message Joined: 10 Dec 05 Posts: 5 Credit: 1,762,734 RAC: 0 |
This error cause only on WUs that starts T0xxx Boinc 64bit @ Sandy Bridge |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I vote we put mod.sense in charge of the whole damn thing. At least he (I assume mod.sense is a "he") makes a real attempt to communicate. Or at least give him a pay raise. Ha ha. Perhaps you've struck upon the key points there. I am a volunteer, so your two cents is an infinite percentage raise as compared to zero :) ...and because I am a volunteer and otherwise have no expectations to perform a role for the team, I have the time to make the effort. ...and yes, you have your pronouns correct. As you already suspect, not being on the team nor in their data center, I simply do not know WU naming pattern nor cause nor pervasiveness of recent reported problems. I can only judge by the overall project TFLOPS, and renewed supply of work, that the failure ratio is not extreme. But I know they watch all of the results and failure rates pretty closely. BOINC's abilities to remove a failing batch are fairly cumbersome and incomplete, and so sometimes it is probably just simpler to let them run their course when they are not consuming CPU time before failure, and not to generate more with the same flaw. I see they've been testing a mini version 3 over on Ralph, so perhaps that effort has made it more difficult to pretest work units for 2.17 over there. Rosetta Moderator: Mod.Sense |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
@fatbozz - you are correct - the "Client Error / Fail Almost Right Away" issue seems to be limited to work units whose name starts out with the prefix T0xxx. However, I am also getting "Validate Errors" (with matching wingman results thank you) on ** some ** of the work units starting out with ilv_ and IF3_. @mod.sense - we know and appreciate that you are an unpaid volunteer. And while I don't know about the other crunchers here, I for one also understand just how much fun it is to try and tap dance around a problem when you are getting little or no input from those responsible for the platform that is struggling. A couple of samples of work units with the Validate Errors I mentioned would be: 377197127 377200299 374781916 In response to you comment that the "failure rate is note extreme" - well at least the facts seem to be converging with your statement. I just did a quick "eyeball survey" of my system and see about 5% of the jobs getting the Validate type of error and 10% to 15% getting the Client Error failure. At the high point I was seeing about 40% of my work units failing with the Client Error issue. The Final Four is over and done with - gently put, I did not even wee WU mentioned. So tell the staff to get back to work! ;) |
Hank Barta Send message Joined: 6 Feb 11 Posts: 14 Credit: 3,943,460 RAC: 0 |
I suppose that when there are problems, the folks whose job it is to solve them spend time on that rather than getting into potentially endless conversations on the forum. ;) I'm happy to know that they monitor the compute errors and will soon become aware of problems. A new one I just saw is: Setting up graphics native ... (See https://boinc.bakerlab.org/rosetta/result.php?resultid=412361359 for further info.) on Ross3X3_SAVE_ALL_OUT_k034_CS_frag_NOE_cst_005_23917_1066_0 |
TJ Send message Joined: 29 Mar 09 Posts: 127 Credit: 4,799,890 RAC: 0 |
I suppose that when there are problems, the folks whose job it is to solve them spend time on that rather than getting into potentially endless conversations on the forum. ;) But what mikey say's is true. A few words on he main page is enough for us to know waht is happening. Its for the sake of medicine that I am here, not for the credits they are a laugh bu I don't care. I care about medice for cancer. And I (we) volunteer on that with power, pc's-time/wear and such and then are a few words (tyoped in a minute) not a big deal. Greetings, TJ. |
Message boards :
Number crunching :
Compute error
©2024 University of Washington
https://www.bakerlab.org