90% failure rate

Message boards : Rosetta@home Science : 90% failure rate

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 12147 - Posted: 17 Mar 2006, 13:08:10 UTC - in response to Message 11758.  

... Rosetta seems to just shrug and say its not our problem, its your computer. Well I hope that attitude will change soon or I'm taking my computer time elsewhere.


That is not quite a true picture of what the project is/has done. [...] While the progress may not be fast enough to suit everyone in the user community, and it may not solve all problems for all users when the work is complete, they ARE trying.


Totally agree.

I'd like to add some perspective on this. When this thread was started around last Xmas, Rosetta was going through a very bad spot. While the programmers involved with Rosetta were experienced in building and running software for mainframes and/or small grids of computers, I think it is fair to say that none of the team had experience in the complexities that come in when running on a grid of tens of thousands of differing machines. It is also fair to say that the project was over-ambitious in the work it tried to put into the system around last Xmas.

In the three months between then and now, the team seem to have worked hard on addressing these issues. The Ralph project has been created from scratch as a first-filter to catch the more serious problems before they reach the production level project. That project is beginning to deliver, but not all the outstanding issues from last Xmas have been caught yet, if I am not mistaken.

It is still true that some machines are more vulnerable than others to the outstanding problems. In the short term the only advice that can be given to someone with such a machine is that it seems to be a machine specific issue - that does not mean that the solution is not being looked for, it means that in all fairness it is one of many outstanding issues and the project team do not advise you to try to hold your breath while they find that particular issue.

The project has always said it aims to be the best BOINC project. In terms of feedback on the scientific meaning of the work, I believe Rosetta has almost always met that target. In terms of delivering a robust app across diverse platforms they have not done so well. There are reasons for that. One reason last year was being over ambitious - they've realised that and back tracked a bit, and quite rightly so.

There is another reason that will never go away.

Rosetta aims to develop and test a multiplicity of different approaches to the same protein folding problem. Rosetta therefore has one more degree of diversity than all the other projects - it is running diverse apps where they are running single apps, and to compare that diversity is part of the point.

Even if the Rosetta team were as experienced as the SETI crew, we'd therefore not expect the Rosetta code to cope with as many corner cases as the SETI code does, for example.

The team are on a learning curve. The first and most important thing is that they were willing to learn from users from day one, and they have done so, and (it seems to me) are continuing to do so. As m9 says, it takes time.

If it seems to take too long for your needs, then you are right: maybe this is not the project that best suits you. I think the folks at Rosetta would respect that choice, even though they'd like the benefit of your cpu power. But please, if you do leave, don't think that the team here are ignoring the users, they are working hard and trying to balance priorities.

There was a poster on the wall one place I worked, about how when you are up to your neck in alligators it is easy to forget that you came to drain the swamp. Well Rosetta found a fair few alligators last Xmas. The point for me is not that some of them are still there, but that some of them have gone, and that some drainage has been going on as well. That, for me, is far more important than quibbling over whether they have got the exact balance right.

River
ID: 12147 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
sharder8
Avatar

Send message
Joined: 2 Feb 06
Posts: 7
Credit: 15,648,378
RAC: 0
Message 12345 - Posted: 20 Mar 2006, 19:24:04 UTC

Of the 20 computers that I'm running/have run Rosetta on, only one has had a 90% + failure rate. That one is a dual Xeon 450 running @ 500MHz. Consequently, that one was moved to another project, as I thought/felt if was/is a machine problem. That box has crunched [FAD], DIMES, and RC5-72 without any problems. In this case, it probably isn't much of a loss to the Rosetta project.

Recently though, another machine started having problems and would end up with an error message containing the message "daily quota met". The only way I was able to recover was to do a complete un-install, followed by a clean install. Unfortunately, now it continues to get the error regardless of what I do. That machine is a Mobile 2800+ Semperon. It's currently running DIMES and RC5 without any problems.

Finally, I've run into the 1% "stuck" problem. This one is starting to get real tiring and I've stopped Rosetta on 2 machines that seemed to get by far the majority of jobs stuck at 1%, that I've had. I understand that this problem is being worked on and will continue crunching Rosetta on the remainder of my machines.

Harder
ID: 12345 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R/B

Send message
Joined: 8 Dec 05
Posts: 195
Credit: 28,095
RAC: 0
Message 12349 - Posted: 20 Mar 2006, 20:52:34 UTC

What is the RC-5 project? I've looked but didn't see anything about it. Thank you.
Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers.


ID: 12349 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 12351 - Posted: 20 Mar 2006, 21:44:29 UTC - in response to Message 12349.  

What is the RC-5 project? I've looked but didn't see anything about it. Thank you.


It's a project trying to crack encryption algorithm.

RC5

Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 12351 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13022 - Posted: 4 Apr 2006, 3:24:34 UTC

Well after comming home from overseas I still see a problem with the program and looking at other threads hear no explanation. Been awhile since winter holidays. I have shut down 7 of my machines. I have a machine turning in result after result with no cpu time shown and no points showing no errors on the Client. That will number 8 I am shutting down on this project.

Rosetta is not only about to lose me forever on this project but my whole team. I have talk to friends on other teams and you guys would not believe the real dislike thats brewing out there for this program. The attitude around her seems to be so what? Well guess what happens when you get people out there calling Roseetta a lousy DC project in the forums?

Explain this to me?
https://boinc.bakerlab.org/rosetta/results.php?hostid=58422
Results for computer

This machine used to do a good job on every DC project on it.....Its doing nothing now worth anything.
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13025 - Posted: 4 Apr 2006, 4:21:46 UTC

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1323#12948
is a thread where some others are discussing your problem showing up on their Win98 machines. (No time, no credit).

Which means we need more Win98 machines testing out Ralph; and monitored by those that keep track of their machines.

The 90% failure rate that happened prior to you leaving was described elsewhere as a batch of failing WUs.

For this problem.. do you have the option of upgrading to Win2k or WinXP or jumping to Linux? (To help prove that it's an OS issue, not hardware.)

Keep in mind that this client is undergoing the same types of problems that other medical apps had in their early days, and those of us lucky to have come in after the problems were ironed out - never got to see. (This is my first time experiencing the "early stage".) But things are improving. Although it looks like we'll need a 4.84 client update for the Win98 users..

David(s)/Rom, etc: How can we help the programmers track down this problem?
ID: 13025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13030 - Posted: 4 Apr 2006, 5:30:23 UTC - in response to Message 13025.  
Last modified: 4 Apr 2006, 5:31:09 UTC

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1323#12948
is a thread where some others are discussing your problem showing up on their Win98 machines. (No time, no credit).

Which means we need more Win98 machines testing out Ralph; and monitored by those that keep track of their machines.

The 90% failure rate that happened prior to you leaving was described elsewhere as a batch of failing WUs.

For this problem.. do you have the option of upgrading to Win2k or WinXP or jumping to Linux? (To help prove that it's an OS issue, not hardware.)

Keep in mind that this client is undergoing the same types of problems that other medical apps had in their early days, and those of us lucky to have come in after the problems were ironed out - never got to see. (This is my first time experiencing the "early stage".) But things are improving. Although it looks like we'll need a 4.84 client update for the Win98 users..

David(s)/Rom, etc: How can we help the programmers track down this problem?

I want you to bare in mind that this machine ran Rosetta perfect as it is set now. Then I had the high failure rate with all nine and not one of those machines are identicle. I do not have the option of using a newer windows OS. Matter of what I use the money for, another cruncher or buying licenses just for machines doing DC projects.

Thank you for your response....
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13030 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Whl.

Send message
Joined: 29 Dec 05
Posts: 203
Credit: 275,802
RAC: 0
Message 13032 - Posted: 4 Apr 2006, 5:38:54 UTC

I dont have time to attach and report back to Ralph right now, or babysit this thing anymore (too much else happening). My machines were working fine up till 4.83 was released. I will let the existing jobs in the cache run and empty and try back here in a month or so. Hope you sort out all the bugs guys. Good luck and all the best.
ID: 13032 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13048 - Posted: 4 Apr 2006, 17:27:43 UTC
Last modified: 4 Apr 2006, 17:45:14 UTC

Pphlan wrote:

https://boinc.bakerlab.org/rosetta/results.php?hostid=58422

This machine used to do a good job on every DC project on it.....Its doing nothing now worth anything.


Just so we're all on the same page here, from what I understand, on the Win98 PCs in question the SCIENTIFIC computations work fine (from what I can tell by watching the results output of Pphalan's PC), but NO CREDITS are granted, because BOINC reports 0 seconds and claims 0 credits.

Also, AFAIK, everything credit-related (timing, claiming etc) is still done IN BOINC, not in the science application for ALL BOINC projects except SETI-Beta. Apparently the fixes for 4.83 had an effect on BOINC's timing under Win98.

I guess the project can run a script to correct the credits for WUs which complete correctly, yet due to Win98/BOINC/R interaction time spent is mis-reported.

So the big fuss is (again) about (temporary?) credits. Personally I'd be upset if my PCs spent the time without producing any useful results. I guess everyone is entitled to his priorities.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13051 - Posted: 4 Apr 2006, 18:44:52 UTC - in response to Message 13048.  

Pphlan wrote:

https://boinc.bakerlab.org/rosetta/results.php?hostid=58422

This machine used to do a good job on every DC project on it.....Its doing nothing now worth anything.


Just so we're all on the same page here, from what I understand, on the Win98 PCs in question the SCIENTIFIC computations work fine (from what I can tell by watching the results output of Pphalan's PC), but NO CREDITS are granted, because BOINC reports 0 seconds and claims 0 credits.

Also, AFAIK, everything credit-related (timing, claiming etc) is still done IN BOINC, not in the science application for ALL BOINC projects except SETI-Beta. Apparently the fixes for 4.83 had an effect on BOINC's timing under Win98.

I guess the project can run a script to correct the credits for WUs which complete correctly, yet due to Win98/BOINC/R interaction time spent is mis-reported.

So the big fuss is (again) about (temporary?) credits. Personally I'd be upset if my PCs spent the time without producing any useful results. I guess everyone is entitled to his priorities.

I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13053 - Posted: 4 Apr 2006, 18:46:17 UTC - in response to Message 7476.  

We should be clear of the "bad" work units by now. There still is a 7% chance of getting a bad random number seed but it should in no way be at 90%. Batch 205 is most definitely done by now.

I am on holiday break, but when I and a few others get back, we will fix the seed problem and grant credit to those affected by the recent issues.

I could care less about credit. I want to know that my efforts are doing something for a worthwhile project. I have no interest in projects that are not medical science based. Medical science should have the priority in any project of this type. Life should always be the first consideration. As a Battalion Commander going to Iraq soon that attitude is foremost on my mind. I want to know something of value is being done. If I find myself in my command throwing resources away on something that is not working I change what I am doing.

My second post in this thread.
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13057 - Posted: 4 Apr 2006, 20:20:34 UTC
Last modified: 4 Apr 2006, 20:23:45 UTC

I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL


Assuming you're not joking, it's rather easy to tell whether a machine is doing the scientific work or not, you can just clicking on the resultid URL, e.g.:

https://boinc.bakerlab.org/rosetta/result.php?resultid=15867586

Exit status 0 (0x0)
stderr out
<core_client_version>5.3.1</core_client_version>
<stderr_txt>
# random seed: 1822271
# cpu_run_time_pref: 7200
# DONE :: 1 starting structures built 11 (nstruct) times
# This process generated 11 decoys from 11 attempts

</stderr_txt>


So you can see that your PC computed 11 predicted protein structures, within the 2hrs (7200sec) it ran on this particular WorkUnit and exited with a status of 0 (success). On WUs/PCs with problems, there are lots of different error codes, which people report in the various specific error-reporting threads in "Number Crunching".

This particular issue is a glitch with how BOINC can track process time under Win98 and I've seen it discussed in various other BOINC projects.

My 2 cents...

Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13057 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andrew

Send message
Joined: 19 Sep 05
Posts: 162
Credit: 105,512
RAC: 0
Message 13063 - Posted: 5 Apr 2006, 2:27:53 UTC - in response to Message 13057.  
Last modified: 5 Apr 2006, 2:29:14 UTC

This particular issue is a glitch with how BOINC can track process time under Win98 and I've seen it discussed in various other BOINC projects.


This is a known issue with boinc, not rosetta. It is one reason why the official supported Windows platforms are only XP, 2000, and 2003 server.
https://boinc.bakerlab.org/rosetta/rah_requirements.php

Some people don't have any issue running win98, others do... you unfortunately are one of the unlucky ones.
ID: 13063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13065 - Posted: 5 Apr 2006, 5:11:02 UTC

Does the error show up in Win98SE, or just Win98? (Or the reverse?)


ID: 13065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Johnathon

Send message
Joined: 5 Nov 05
Posts: 120
Credit: 138,226
RAC: 0
Message 13070 - Posted: 5 Apr 2006, 6:57:11 UTC

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1177#13069
ID: 13070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Whl.

Send message
Joined: 29 Dec 05
Posts: 203
Credit: 275,802
RAC: 0
Message 13072 - Posted: 5 Apr 2006, 8:04:21 UTC

I see Dr Baker says the science is unaffected with the Win98 problem, so I will continue with those machines.
ID: 13072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 13252 - Posted: 8 Apr 2006, 17:38:00 UTC - in response to Message 13051.  

I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL


I suppose that pointing out that error results are extremely useful, doesn't matter to you. Even if a WU errors out it helps to identify which WU's have bugs. As any programmer will tell you, it's impossible to fix a bug you can't find.

I've had my fair share of error WU's and as far as I'm concerned, they were useful. They did not return any scientific results, but they DID help the Rosetta team debug and improve the application.

ID: 13252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 13285 - Posted: 8 Apr 2006, 22:44:55 UTC - in response to Message 13252.  

I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL


I suppose that pointing out that error results are extremely useful, doesn't matter to you. Even if a WU errors out it helps to identify which WU's have bugs. As any programmer will tell you, it's impossible to fix a bug you can't find.

I've had my fair share of error WU's and as far as I'm concerned, they were useful. They did not return any scientific results, but they DID help the Rosetta team debug and improve the application.


Your assesment is correct. Workunits that error before completing a model are very useful in finding errors in the software. BUT if they finish at least one model before they fail, they are also useful for the science.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 13285 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pphalan
Avatar

Send message
Joined: 5 Nov 05
Posts: 53
Credit: 291,580
RAC: 0
Message 13318 - Posted: 9 Apr 2006, 14:13:35 UTC - in response to Message 13285.  

I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL


I suppose that pointing out that error results are extremely useful, doesn't matter to you. Even if a WU errors out it helps to identify which WU's have bugs. As any programmer will tell you, it's impossible to fix a bug you can't find.

I've had my fair share of error WU's and as far as I'm concerned, they were useful. They did not return any scientific results, but they DID help the Rosetta team debug and improve the application.


Your assesment is correct. Workunits that error before completing a model are very useful in finding errors in the software. BUT if they finish at least one model before they fail, they are also useful for the science.

As I understand it now the problem is with boinc not rosetta. So hows an error with boinc doing any good for rosetta? Oh my primary machine uploaded some more errors for you....its XP Pro. And all my remotes are XP that keep dropping the program. They have not been added back, just to much of a pain.
http://www.christianboards.org/forum.php
http://usalug.org/phpBB2/index.php
ID: 13318 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 13341 - Posted: 9 Apr 2006, 18:13:52 UTC - in response to Message 13318.  

I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL


I suppose that pointing out that error results are extremely useful, doesn't matter to you. Even if a WU errors out it helps to identify which WU's have bugs. As any programmer will tell you, it's impossible to fix a bug you can't find.

I've had my fair share of error WU's and as far as I'm concerned, they were useful. They did not return any scientific results, but they DID help the Rosetta team debug and improve the application.


Your assesment is correct. Workunits that error before completing a model are very useful in finding errors in the software. BUT if they finish at least one model before they fail, they are also useful for the science.

As I understand it now the problem is with boinc not rosetta. So hows an error with boinc doing any good for rosetta? Oh my primary machine uploaded some more errors for you....its XP Pro. And all my remotes are XP that keep dropping the program. They have not been added back, just to much of a pain.


Some of the issues are BOINC related, but that does not mean that the models completed in a work unit that errors out are not useful. Moreover, ANY errors that are identified (BOINC or otherwise) help improve the application. If it is a BOINC issue, in some cases the application can be modified to work around the problem. But ONLY if the errors can be examined. That is why all of the returned results are useful. Aborted by GUI results are less useful that the ones that are allowed to crash on their own, but they are all useful.

In some cases the project asks people who are having errors to connect to the Ralph project. In Ralph the application returns more detailed error results which are used to improve the application. Basically the same code, but used to find and kill the bugs.

Try to remember that unlike most every other BOINC project, Rosetta is trying to find the correct computing approach to the protein problem while at the same time modeling the proteins. In other words they are researching the type of computing required to model proteins. This means that the application itself is part of what is being researched. In practical terms that means that the project feels more like a test environment than say SETI or Einstein. On most BOINC projects the application code required is very clear and stable. That is not the case where the research is focused on determining in part what processing must actually be performed to accomplish the goal.

That is why there is no such thing as "wasted" CPU time on Rosetta. Even the errors are valuable to the research. It does result in lost credit from time to time for some users. But that is why the Rosetta team (unlike most BOINC projects) will frequently go back to award credits. They view the errors as being important to the research. In a lot of cases these awards have been to single users for a problem unique to their situation. If you read the boards from the other projects, credit awards after the fact are a very rare thing, and I have never seen credits awarded to individual users for a unique problem. That is not the case here. While there is some delay in the awards due to the time demands placed on the project team, the credit is granted in almost every case where people have asked.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 13341 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Rosetta@home Science : 90% failure rate



©2024 University of Washington
https://www.bakerlab.org