Message boards : Rosetta@home Science : 90% failure rate
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
... Rosetta seems to just shrug and say its not our problem, its your computer. Well I hope that attitude will change soon or I'm taking my computer time elsewhere. Totally agree. I'd like to add some perspective on this. When this thread was started around last Xmas, Rosetta was going through a very bad spot. While the programmers involved with Rosetta were experienced in building and running software for mainframes and/or small grids of computers, I think it is fair to say that none of the team had experience in the complexities that come in when running on a grid of tens of thousands of differing machines. It is also fair to say that the project was over-ambitious in the work it tried to put into the system around last Xmas. In the three months between then and now, the team seem to have worked hard on addressing these issues. The Ralph project has been created from scratch as a first-filter to catch the more serious problems before they reach the production level project. That project is beginning to deliver, but not all the outstanding issues from last Xmas have been caught yet, if I am not mistaken. It is still true that some machines are more vulnerable than others to the outstanding problems. In the short term the only advice that can be given to someone with such a machine is that it seems to be a machine specific issue - that does not mean that the solution is not being looked for, it means that in all fairness it is one of many outstanding issues and the project team do not advise you to try to hold your breath while they find that particular issue. The project has always said it aims to be the best BOINC project. In terms of feedback on the scientific meaning of the work, I believe Rosetta has almost always met that target. In terms of delivering a robust app across diverse platforms they have not done so well. There are reasons for that. One reason last year was being over ambitious - they've realised that and back tracked a bit, and quite rightly so. There is another reason that will never go away. Rosetta aims to develop and test a multiplicity of different approaches to the same protein folding problem. Rosetta therefore has one more degree of diversity than all the other projects - it is running diverse apps where they are running single apps, and to compare that diversity is part of the point. Even if the Rosetta team were as experienced as the SETI crew, we'd therefore not expect the Rosetta code to cope with as many corner cases as the SETI code does, for example. The team are on a learning curve. The first and most important thing is that they were willing to learn from users from day one, and they have done so, and (it seems to me) are continuing to do so. As m9 says, it takes time. If it seems to take too long for your needs, then you are right: maybe this is not the project that best suits you. I think the folks at Rosetta would respect that choice, even though they'd like the benefit of your cpu power. But please, if you do leave, don't think that the team here are ignoring the users, they are working hard and trying to balance priorities. There was a poster on the wall one place I worked, about how when you are up to your neck in alligators it is easy to forget that you came to drain the swamp. Well Rosetta found a fair few alligators last Xmas. The point for me is not that some of them are still there, but that some of them have gone, and that some drainage has been going on as well. That, for me, is far more important than quibbling over whether they have got the exact balance right. River |
sharder8 Send message Joined: 2 Feb 06 Posts: 7 Credit: 15,648,378 RAC: 0 |
Of the 20 computers that I'm running/have run Rosetta on, only one has had a 90% + failure rate. That one is a dual Xeon 450 running @ 500MHz. Consequently, that one was moved to another project, as I thought/felt if was/is a machine problem. That box has crunched [FAD], DIMES, and RC5-72 without any problems. In this case, it probably isn't much of a loss to the Rosetta project. Recently though, another machine started having problems and would end up with an error message containing the message "daily quota met". The only way I was able to recover was to do a complete un-install, followed by a clean install. Unfortunately, now it continues to get the error regardless of what I do. That machine is a Mobile 2800+ Semperon. It's currently running DIMES and RC5 without any problems. Finally, I've run into the 1% "stuck" problem. This one is starting to get real tiring and I've stopped Rosetta on 2 machines that seemed to get by far the majority of jobs stuck at 1%, that I've had. I understand that this problem is being worked on and will continue crunching Rosetta on the remainder of my machines. Harder |
R/B Send message Joined: 8 Dec 05 Posts: 195 Credit: 28,095 RAC: 0 |
What is the RC-5 project? I've looked but didn't see anything about it. Thank you. Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
What is the RC-5 project? I've looked but didn't see anything about it. Thank you. It's a project trying to crack encryption algorithm. RC5 Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Pphalan Send message Joined: 5 Nov 05 Posts: 53 Credit: 291,580 RAC: 0 |
Well after comming home from overseas I still see a problem with the program and looking at other threads hear no explanation. Been awhile since winter holidays. I have shut down 7 of my machines. I have a machine turning in result after result with no cpu time shown and no points showing no errors on the Client. That will number 8 I am shutting down on this project. Rosetta is not only about to lose me forever on this project but my whole team. I have talk to friends on other teams and you guys would not believe the real dislike thats brewing out there for this program. The attitude around her seems to be so what? Well guess what happens when you get people out there calling Roseetta a lousy DC project in the forums? Explain this to me? https://boinc.bakerlab.org/rosetta/results.php?hostid=58422 Results for computer This machine used to do a good job on every DC project on it.....Its doing nothing now worth anything. http://www.christianboards.org/forum.php http://usalug.org/phpBB2/index.php |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1323#12948 is a thread where some others are discussing your problem showing up on their Win98 machines. (No time, no credit). Which means we need more Win98 machines testing out Ralph; and monitored by those that keep track of their machines. The 90% failure rate that happened prior to you leaving was described elsewhere as a batch of failing WUs. For this problem.. do you have the option of upgrading to Win2k or WinXP or jumping to Linux? (To help prove that it's an OS issue, not hardware.) Keep in mind that this client is undergoing the same types of problems that other medical apps had in their early days, and those of us lucky to have come in after the problems were ironed out - never got to see. (This is my first time experiencing the "early stage".) But things are improving. Although it looks like we'll need a 4.84 client update for the Win98 users.. David(s)/Rom, etc: How can we help the programmers track down this problem? |
Pphalan Send message Joined: 5 Nov 05 Posts: 53 Credit: 291,580 RAC: 0 |
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1323#12948 I want you to bare in mind that this machine ran Rosetta perfect as it is set now. Then I had the high failure rate with all nine and not one of those machines are identicle. I do not have the option of using a newer windows OS. Matter of what I use the money for, another cruncher or buying licenses just for machines doing DC projects. Thank you for your response.... http://www.christianboards.org/forum.php http://usalug.org/phpBB2/index.php |
Whl. Send message Joined: 29 Dec 05 Posts: 203 Credit: 275,802 RAC: 0 |
I dont have time to attach and report back to Ralph right now, or babysit this thing anymore (too much else happening). My machines were working fine up till 4.83 was released. I will let the existing jobs in the cache run and empty and try back here in a month or so. Hope you sort out all the bugs guys. Good luck and all the best. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
Pphlan wrote:
Just so we're all on the same page here, from what I understand, on the Win98 PCs in question the SCIENTIFIC computations work fine (from what I can tell by watching the results output of Pphalan's PC), but NO CREDITS are granted, because BOINC reports 0 seconds and claims 0 credits. Also, AFAIK, everything credit-related (timing, claiming etc) is still done IN BOINC, not in the science application for ALL BOINC projects except SETI-Beta. Apparently the fixes for 4.83 had an effect on BOINC's timing under Win98. I guess the project can run a script to correct the credits for WUs which complete correctly, yet due to Win98/BOINC/R interaction time spent is mis-reported. So the big fuss is (again) about (temporary?) credits. Personally I'd be upset if my PCs spent the time without producing any useful results. I guess everyone is entitled to his priorities. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Pphalan Send message Joined: 5 Nov 05 Posts: 53 Credit: 291,580 RAC: 0 |
Pphlan wrote: I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL http://www.christianboards.org/forum.php http://usalug.org/phpBB2/index.php |
Pphalan Send message Joined: 5 Nov 05 Posts: 53 Credit: 291,580 RAC: 0 |
We should be clear of the "bad" work units by now. There still is a 7% chance of getting a bad random number seed but it should in no way be at 90%. Batch 205 is most definitely done by now. My second post in this thread. http://www.christianboards.org/forum.php http://usalug.org/phpBB2/index.php |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL Assuming you're not joking, it's rather easy to tell whether a machine is doing the scientific work or not, you can just clicking on the resultid URL, e.g.: https://boinc.bakerlab.org/rosetta/result.php?resultid=15867586 Exit status 0 (0x0) stderr out <core_client_version>5.3.1</core_client_version> <stderr_txt> # random seed: 1822271 # cpu_run_time_pref: 7200 # DONE :: 1 starting structures built 11 (nstruct) times # This process generated 11 decoys from 11 attempts </stderr_txt> So you can see that your PC computed 11 predicted protein structures, within the 2hrs (7200sec) it ran on this particular WorkUnit and exited with a status of 0 (success). On WUs/PCs with problems, there are lots of different error codes, which people report in the various specific error-reporting threads in "Number Crunching". This particular issue is a glitch with how BOINC can track process time under Win98 and I've seen it discussed in various other BOINC projects. My 2 cents... Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Andrew Send message Joined: 19 Sep 05 Posts: 162 Credit: 105,512 RAC: 0 |
This particular issue is a glitch with how BOINC can track process time under Win98 and I've seen it discussed in various other BOINC projects. This is a known issue with boinc, not rosetta. It is one reason why the official supported Windows platforms are only XP, 2000, and 2003 server. https://boinc.bakerlab.org/rosetta/rah_requirements.php Some people don't have any issue running win98, others do... you unfortunately are one of the unlucky ones. |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
Does the error show up in Win98SE, or just Win98? (Or the reverse?) |
Johnathon Send message Joined: 5 Nov 05 Posts: 120 Credit: 138,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1177#13069 |
Whl. Send message Joined: 29 Dec 05 Posts: 203 Credit: 275,802 RAC: 0 |
I see Dr Baker says the science is unaffected with the Win98 problem, so I will continue with those machines. |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL I suppose that pointing out that error results are extremely useful, doesn't matter to you. Even if a WU errors out it helps to identify which WU's have bugs. As any programmer will tell you, it's impossible to fix a bug you can't find. I've had my fair share of error WU's and as far as I'm concerned, they were useful. They did not return any scientific results, but they DID help the Rosetta team debug and improve the application. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL Your assesment is correct. Workunits that error before completing a model are very useful in finding errors in the software. BUT if they finish at least one model before they fail, they are also useful for the science. Moderator9 ROSETTA@home FAQ Moderator Contact |
Pphalan Send message Joined: 5 Nov 05 Posts: 53 Credit: 291,580 RAC: 0 |
I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL As I understand it now the problem is with boinc not rosetta. So hows an error with boinc doing any good for rosetta? Oh my primary machine uploaded some more errors for you....its XP Pro. And all my remotes are XP that keep dropping the program. They have not been added back, just to much of a pain. http://www.christianboards.org/forum.php http://usalug.org/phpBB2/index.php |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I said at the begining of this thread what my priorities are. I have no measure if the machine is doing anything useful though...For all I know its turned in nothing. LOL Some of the issues are BOINC related, but that does not mean that the models completed in a work unit that errors out are not useful. Moreover, ANY errors that are identified (BOINC or otherwise) help improve the application. If it is a BOINC issue, in some cases the application can be modified to work around the problem. But ONLY if the errors can be examined. That is why all of the returned results are useful. Aborted by GUI results are less useful that the ones that are allowed to crash on their own, but they are all useful. In some cases the project asks people who are having errors to connect to the Ralph project. In Ralph the application returns more detailed error results which are used to improve the application. Basically the same code, but used to find and kill the bugs. Try to remember that unlike most every other BOINC project, Rosetta is trying to find the correct computing approach to the protein problem while at the same time modeling the proteins. In other words they are researching the type of computing required to model proteins. This means that the application itself is part of what is being researched. In practical terms that means that the project feels more like a test environment than say SETI or Einstein. On most BOINC projects the application code required is very clear and stable. That is not the case where the research is focused on determining in part what processing must actually be performed to accomplish the goal. That is why there is no such thing as "wasted" CPU time on Rosetta. Even the errors are valuable to the research. It does result in lost credit from time to time for some users. But that is why the Rosetta team (unlike most BOINC projects) will frequently go back to award credits. They view the errors as being important to the research. In a lot of cases these awards have been to single users for a problem unique to their situation. If you read the boards from the other projects, credit awards after the fact are a very rare thing, and I have never seen credits awarded to individual users for a unique problem. That is not the case here. While there is some delay in the awards due to the time demands placed on the project team, the credit is granted in almost every case where people have asked. Moderator9 ROSETTA@home FAQ Moderator Contact |
Message boards :
Rosetta@home Science :
90% failure rate
©2024 University of Washington
https://www.bakerlab.org