90% failure rate

Message boards : Rosetta@home Science : 90% failure rate

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 7446 - Posted: 24 Dec 2005, 0:11:59 UTC

I am showing a 90% PLUS failure rate on all jobs done in the last three days on 9 machines. I am headed overseas and have no time for program thats such high maintenance. My remotes are always shutting down the project. It is simply not worth doing when it becomes nothing but extra wear and tear on my machines running a CPU at 100% non-stop for NOTHING. Never mind the increase in an electric bill. Rosetta is one big disappointment and does not deserve my effort.
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 7446 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,507,064
RAC: 1,089
Message 7451 - Posted: 24 Dec 2005, 0:50:33 UTC - in response to Message 7446.  

It is simply not worth doing when it becomes nothing but extra wear and tear on my machines running a CPU at 100% non-stop for NOTHING.


From the project perspective: Continuing to run accellerates the removal of the "short WUs" from the system. Some of these appear (to me!) to even be returning valid data in the first few passes before they hit the random-number-glitch that causes them to error out.

From the participant's perspective: If a "short" WU has taken some amount of CPU time without giving credit, unlike the majority, which fail in just a few seconds, there is the _possibility_ that these will receive credit, after the staff returns from the holidays. The staff has _already_ said that they would grant credit for any of the "DEFAULT_xxxxx_205" results, whether aborted by the participant, or allowed to run until they hit the "maximum CPU" limit. The "short WUs", if you are on a high-speed continuous connection, are really not much of a problem. They come, they run for a few seconds, they error out. It is possible that even _those_ may be granted credit, as the project staff feels pretty bad about having them slip out the door. If you are on dial-up, or pay for your bandwidth, then these _ARE_ a problem, and the simple solution is to suspend Rosetta until they are gone.

Maybe it's just me... but if I were _only_ interested in the credit, I would have my PC running CPDN non-stop right now, as that's the project that grants the highest "credit per hour". Instead, it's running it's standard share of Rosetta, CPDN, and Einstein, because those are the three projects I am most interested in at the moment. I understand the "competitive urge", but I guess it's just not _that_ big a deal to me. The project is worthwhile, I'm already donating both my CPU time _and_ my personal time, so the fact that I'm unlikely to earn very many credits in the next few days for that time... who cares?

ID: 7451 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 7454 - Posted: 24 Dec 2005, 0:59:40 UTC

We should be clear of the "bad" work units by now. There still is a 7% chance of getting a bad random number seed but it should in no way be at 90%. Batch 205 is most definitely done by now.

I am on holiday break, but when I and a few others get back, we will fix the seed problem and grant credit to those affected by the recent issues.
ID: 7454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 7476 - Posted: 24 Dec 2005, 2:57:42 UTC - in response to Message 7454.  

We should be clear of the "bad" work units by now. There still is a 7% chance of getting a bad random number seed but it should in no way be at 90%. Batch 205 is most definitely done by now.

I am on holiday break, but when I and a few others get back, we will fix the seed problem and grant credit to those affected by the recent issues.

I could care less about credit. I want to know that my efforts are doing something for a worthwhile project. I have no interest in projects that are not medical science based. Medical science should have the priority in any project of this type. Life should always be the first consideration. As a Battalion Commander going to Iraq soon that attitude is foremost on my mind. I want to know something of value is being done. If I find myself in my command throwing resources away on something that is not working I change what I am doing.

http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 7476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 7478 - Posted: 24 Dec 2005, 3:21:16 UTC - in response to Message 7476.  

I want to know that my efforts are doing something for a worthwhile project.


Personally...I would say that Rosetta is, easily, the best maintained, smoothest running, most worthwhile distributed computing project currently available. Read the boards some more and the entire site and compare to the other projects and I believe you will soon agree.
ID: 7478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7513 - Posted: 24 Dec 2005, 12:05:24 UTC - in response to Message 7476.  

I have no interest in projects that are not medical science based. Medical science should have the priority in any project of this type.

Boy are you in luck ...

You have 3 projects with many more in testing. So Rosetta@Home, Predictor@Home, and WCG are all available for you to run (I have run, or am running all three).

I want to know something of value is being done. If I find myself in my command throwing resources away on something that is not working I change what I am doing.

Paul is a little contrarian with respect to the way many feel about problems like what we see here at Rosetta@Home. I cannot, in my wildest imagination, understand why anyone would think that the project is not more upset by the loss of work effort than we could ever be ...

This is their ticket to fame as it were.

With that in mind. The staff has told us a number of times that even the failures are interesting. Same thing at CPDN. And, though some of the errors were from mal-formed work units, this has happened on other projects (SETI@Home created 40-60K zero length work units the last outage) and will happen again.

You do have a little bit of a unique situation with many remotes. In this case perhaps Rosetta@Home is not for you at this time. THough I will point out that Predictor@Home has issued work that will pop-up a FORTRAN error dialog that stops the computer from running any other BOINC process until "Ok" is pressed (I lost several days processing time over that one). Again, no project is without flaw or will always issue computable work. Rosetta@Home just hit a bad spot, and we hope it is cleared up ...

ID: 7513 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7520 - Posted: 24 Dec 2005, 13:59:00 UTC - in response to Message 7476.  


I could care less about credit. I want to know that my efforts are doing something for a worthwhile project. I have no interest in projects that are not medical science based. Medical science should have the priority in any project of this type. ...


absolutely agree with this.

The reason the project rushed out some new style work units just before the hols was to get some new interesting science crunched while they were away. Worthy motive.

As a rush job, they overlooked some vital points. They have apologised.

Meanwhile they are all off on vacation (or at least home with their families). This means that it will take longer to fix than if it happended at another time. Except at another time it would not have been a rush job in the first place.

When you are out in Iraq, as a commander there will be times when you have some kind of target of opportunity and you will need to make a snap judgment: go for it unprepared or hang back till you are fully ready but by then the target may have gone. Sometimes you will get it right. Sometimes you will bet it wrong and then you will have that awful feeling after the troops have committed to the action that it is all going wrong and it is too late to recall.

People who explain to you where things went wrong will be doing you a favour.

People who go on and on about it once you have understood and admitted the error will not be helping at all. To turn the analogy back to here, I am suggesting to you that you have crossed the line into the latter kind of unhelpful criticsm, though clearly you mean to be helpful.

Please, you clearly are not happy crunching for Rosetta in the current situation. Please, suspend work fetch till (say) mid Jan and then check back. I am nothing to this project but a new member, so I am not speaking on behalf of Rosetta but just from myself.

We understand that it was a mistake, we understand that you and many others are upset about it. You are yearning to do something useful and it is all going pear shaped and it is not your fault it is the project's mistake.

The project has shown it understands that. The project plans to do better when they return after the break. They, like you, care about the medical science and with or without your further criticism they will already be regretting the lost science.

Whether we go elsewhere till they have a chance to get things sorted, or whether we stay here, please lets move on. We have had the repentance it is time for the forgiveness -- especially some will feel at this time of year.

Please?

River~~
ID: 7520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 7549 - Posted: 24 Dec 2005, 20:21:24 UTC

River do this project a favour and stop talking to me. Now you are going to lecture a multiple combat vet on operations?

It is more than one incident with this project, I will do a wait and see until after the holidays. I cant believe some people, I have one guy telling me things do not always go right, 3 days ago we bury a man killed in a training accident and this guy thinks I need a lecture on how things do not always go right. Reminded myself that I am not talking to people who run the project.
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 7549 · Rating: -2 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7561 - Posted: 25 Dec 2005, 1:13:15 UTC - in response to Message 7549.  

River do this project a favour and stop talking to me. Now you are going to lecture a multiple combat vet on operations?

Sorry. You are right, I know nothing about being a soldier.

But it seems fair to me. You have been lecturing computer professionals in dozens of postings about how their job should be done, and what their priorities should be.

One little lecture back from me and you can't hack it.

I will do a deal: if you stop telling the programmers what their priorities should be, I will stop pretending I know anything about your job.

Fair?
ID: 7561 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 7601 - Posted: 25 Dec 2005, 13:37:23 UTC

I am going to ignore your utter arogance because at no time did I ever tell anyone how to do their job. I want to know why I am contributing less than 10% to this project even now. The machines I left on have done nothing but client errors. I mean right now a success is rare. Bad batch my postier portion of my body......

One of my Captians called me up to tell me just before midnight he became a Daddy to a girl...3 minutes before midnight....WE....my team has a new membern in our family.


SHUT UP RIVER....Everytime you say something to me its an arrogant comment I would not say to a child let alone to an adult.

http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 7601 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 7623 - Posted: 25 Dec 2005, 17:52:45 UTC - in response to Message 7601.  

Merry Christmas, and peace and goodwill to all! :)
Regards,
Bob P.
ID: 7623 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Deamiter

Send message
Joined: 9 Nov 05
Posts: 26
Credit: 3,793,650
RAC: 0
Message 7625 - Posted: 25 Dec 2005, 19:04:10 UTC

As a bit of a sanity check, do you have each of your BOINC projects set to "leave in memory" when they switch projects?

It's a known issue with the Rosetta alpha project that some machines will error out without this changed. Since you're obviously running multiple projects, you'd have to set it on each project or when you updated, it'd keep switching the setting back and forth.

It's not an ideal situation (they don't claim to be gold or anything) and they're working on the problem. It doesn't affect EVERY computer configuration (as I've never seen the problem on any of my four Dell computers) but it sounds like that could be your problem.

It's something to try if you're still erroring out all your WUs. I haven't seen any problems in the last couple days, so obviously I haven't gotten any of the longer problem WUs...

If that doesn't fix it, or if it's too much for you, I'd say don't hesitate to drop the project altogether. We'll certainly miss your cycles, but running pre-release alpha and beta tests isn't for everybody. I hope you won't hold it against Rosetta in general, and if you do end up dropping Rosetta for now, it'd be nice to see you back when a totally stable version is released.
ID: 7625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7870 - Posted: 29 Dec 2005, 9:33:39 UTC - in response to Message 7601.  

at no time did I ever tell anyone how to do their job
it seemed to me at the time that you were.

Looking back on your posts now I can't see how I thought that.

Maybe partly frustration from my own boxes' problems - several of my remote boxes crunched nothing useful over Christmas - but others were working fine and it wasn't clear why.

Anyway, however it happened I'm sorry.
Have a good 2006

River~~
ID: 7870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 7967 - Posted: 30 Dec 2005, 6:31:19 UTC - in response to Message 7870.  

at no time did I ever tell anyone how to do their job
it seemed to me at the time that you were.

Looking back on your posts now I can't see how I thought that.

Maybe partly frustration from my own boxes' problems - several of my remote boxes crunched nothing useful over Christmas - but others were working fine and it wasn't clear why.

Anyway, however it happened I'm sorry.
Have a good 2006

River~~

No problems I am a little on edge here. Most of my job failures are not the 205 series and I have one machine in its second day on one job and it hasnt budge from 10%. Guess I should abort it but dont know.
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 7967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Deamiter

Send message
Joined: 9 Nov 05
Posts: 26
Credit: 3,793,650
RAC: 0
Message 7993 - Posted: 30 Dec 2005, 16:19:33 UTC - in response to Message 7967.  


No problems I am a little on edge here. Most of my job failures are not the 205 series and I have one machine in its second day on one job and it hasnt budge from 10%. Guess I should abort it but dont know.

I don't mean to sound repetative -- especially if you DIDN'T miss my last post. Have you made sure to set each project on the machine to "leave in memory when suspended?"

I haven't seen anybody post that this has been fixed yet -- it seems to be a long-term problem that affects some computers. It's something that'll have to be fixed before the project goes gold, but for now, to run the project you'll just have to leave it in memory.
ID: 7993 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
lats

Send message
Joined: 12 Feb 06
Posts: 1
Credit: 1,673,666
RAC: 0
Message 11303 - Posted: 24 Feb 2006, 9:51:33 UTC

This is all well and good but does it survive a reboot? A number of failures appear after rebooting. Is someone fixing the problem?
ID: 11303 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UsedBits
Avatar

Send message
Joined: 18 Feb 06
Posts: 1
Credit: 650
RAC: 0
Message 11754 - Posted: 7 Mar 2006, 16:31:09 UTC

The four systems I have running Rosetta produce mostly errors. They will crunch hours, maybe days, then throw an error.

I'm going to start the process of removing Rosetta and running something else. This is in the hope that more work and fewer errors are produced.

Besides, Rosetta was chosen in error (through ignorance) - in the hopes that my contribution might benefit Parkinson's disease. I had imbarked on loading Seti@Home after a long absense from them and discovered Rosetta and the others. It just seemed more worthwhile to contribute to medicine than little green men.

However, that my systems contribute nothing to Rosetta due to the near 100% failure rate, it loses nothing by my absense.

Regards,
UsedBits
ID: 11754 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul Smith

Send message
Joined: 2 Dec 05
Posts: 1
Credit: 7,954
RAC: 0
Message 11757 - Posted: 7 Mar 2006, 19:38:46 UTC

Hi, I'm havibg very similar problems. Hours of computer time then "unrecoverable error for result ......". What sticks in my craw though is that you rceive no credit for all the time spent. The fault must lie with Rosetta as every time I look at the results of other participants crunching the same unit, same problem. Rosetta seems to just shrug and say its not our problem, its your computer. Well I hope that attitude will change soon or I'm taking my compter time elsewhere.
ID: 11757 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11758 - Posted: 7 Mar 2006, 20:24:49 UTC - in response to Message 11757.  

... Rosetta seems to just shrug and say its not our problem, its your computer. Well I hope that attitude will change soon or I'm taking my computer time elsewhere.


That is not quite a true picture of what the project is/has done. This project is about some very complex science. They cannot just simply with a wave of the hand fix all problems for all systems attached to the project. Lets remember that the configuration of those 40,000 odd client systems out there is completely beyond the control of the project team, and that at least some of the problems are related to that issue.

The Rosetta project is very aware of the reliability issues. They have gone so far as to establish a test project (RALPH) just to solve the issues. This project is making progress. While the progress may not be fast enough to suit everyone in the user community, and it may not solve all problems for all users when the work is complete, they ARE trying.

If you wish to help solve the problems, please feel free to attach to the RALPH project and help find the bugs. If Credit is your main concern, then Ralph may not be for you either. But as a Moderator on this forum, I see every post, and in a lot of cases, people refuse to read the instructions or follow the suggestions offered to solve problems they may be having. Not one member of the Rosetta project has abandoned one single user or avoided trying to help when asked. So to say that the Rosetta team has just shrugged anyone off is just not a fair comment or a correct picture of what is actually going on behind the scenes.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11758 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 11819 - Posted: 9 Mar 2006, 12:58:06 UTC
Last modified: 9 Mar 2006, 13:06:33 UTC

Seems to me that the error rate reduces when I run only one project by pc
*Thus no app suspend/resume , less swap usage , etc.

to get this, click on "no more work" for all projects,
except the one u choose to run on this pc.
*into some time, u will running only 1 project.
and the benefits of increased stability , by less swap, no suspend/resume...

*If u have two pcs u can run either the same project on both pcs,
or run two projects, one project into 1 pc, other project into another pc.

btw: There is a new "Life sciences project", I believe is worth trying
http://qah.uni-muenster.de/scientific.php
http://qah.uni-muenster.de/create_account_form.php?teamid=37
visit our team forum
http://www.fadbeens.co.uk/phpBB2/

Click signature for global team stats
ID: 11819 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Rosetta@home Science : 90% failure rate



©2024 University of Washington
https://www.bakerlab.org