Information on Ver 4.97 errors

Author	Message
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 13290 - Posted: 8 Apr 2006, 23:17:32 UTC Last modified: 8 Apr 2006, 23:17:49 UTC I have just recieved this essage from David Kim who is working on the version 4.97 error issue as I write this message. I just reverted back to the previous app. You should notice a version 4.98 now, which is really version 4.83 for windows and mac, and 4.82 for linux. You should all see some relief very soon. Your systems should update by them selves when the version change takes place, but if not please do a manual update. Moderator9 ROSETTA@home FAQ Moderator Contact ID: 13290 · Rating: -1 · rate: / Reply Quote

Dave Wilson Send message Joined: 8 Jan 06 Posts: 35 Credit: 379,049 RAC: 0	Message 13296 - Posted: 9 Apr 2006, 2:08:38 UTC Should we abort the work units that are going to use 4.97? ID: 13296 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 13298 - Posted: 9 Apr 2006, 3:16:04 UTC Sounds like "reset project" from the projects tab. This basically aborts any WUs and reloads the application code. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 13298 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 13300 - Posted: 9 Apr 2006, 3:29:38 UTC I notice that the HBLR_* WUs have been cancelled. That keeps them from being sent out again, but doesn't remove them from my computers. If my Linux machines successfully crunch and upload them, will the results be useful, or will they automatically be thrown away? ID: 13300 · Rating: 0 · rate: / Reply Quote

Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 13302 - Posted: 9 Apr 2006, 3:52:30 UTC - in response to Message 13300. Last modified: 9 Apr 2006, 4:33:07 UTC I notice that the HBLR_* WUs have been cancelled. That keeps them from being sent out again, but doesn't remove them from my computers. If my Linux machines successfully crunch and upload them, will the results be useful, or will they automatically be thrown away? They will be used. For what it is worth the Mac computers are not having any of these problems, so resetting the project is not universally required. There are also some Windows and Linux system that are not having trouble at this time. Moderator9 ROSETTA@home FAQ Moderator Contact ID: 13302 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 28	Message 13309 - Posted: 9 Apr 2006, 8:51:47 UTC My machines both run Windows, (one NT4, the other XP), both have seen errors, but both have also run 4.97 to normal completion. Before I disabled Rosetta, I had 6 failures and 4 normal with 4.97. It's running again now with 4.98, good job team. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 13309 · Rating: 0 · rate: / Reply Quote

simpe73 Send message Joined: 20 Feb 06 Posts: 4 Credit: 438,570 RAC: 0	Message 13310 - Posted: 9 Apr 2006, 9:28:52 UTC What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs. ID: 13310 · Rating: 0 · rate: / Reply Quote

Jimi@0wned.org.uk Send message Joined: 10 Mar 06 Posts: 29 Credit: 335,252 RAC: 0	Message 13314 - Posted: 9 Apr 2006, 12:02:53 UTC Tried a project reset, any new WU fails immediately with: core_client_version>5.2.13</core_client_version> <message>CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20) </message> What's happening there? ID: 13314 · Rating: 0 · rate: / Reply Quote

Cureseekers~Kristof Send message Joined: 5 Nov 05 Posts: 80 Credit: 689,603 RAC: 0	Message 13315 - Posted: 9 Apr 2006, 12:03:13 UTC Last modified: 9 Apr 2006, 12:03:40 UTC What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs. As I've read, these jobs and engine is tested on the test-environment (RALPH). But, the latter, when moving these to the normal Rosetta environment, the errors came up. So it was unforseen ... Every application, every DC project, every environment has its problems. We can only thank David (and others?), to react that quick, to reset the previous version. This even during a weekend! I guess we'll get more comments by David on Monday in his weblog? Member of Dutch Power Cows ID: 13315 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0	Message 13317 - Posted: 9 Apr 2006, 12:10:52 UTC - in response to Message 13315. Last modified: 9 Apr 2006, 12:16:43 UTC As I've read, these jobs and engine is tested on the test-environment (RALPH). But, the latter, when moving these to the normal Rosetta environment, the errors came up. So it was unforseen ... Every application, every DC project, every environment has its problems. We can only thank David (and others?), to react that quick, to reset the previous version. This even during a weekend! I guess we'll get more comments by David on Monday in his weblog? AMEN to that. ID: 13317 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 13325 - Posted: 9 Apr 2006, 15:05:14 UTC - in response to Message 13315. As I've read, these jobs and engine is tested on the test-environment (RALPH). But, the latter, when moving these to the normal Rosetta environment, the errors came up. So it was unforseen ... People crunching Ralph saw and reported the same high error rate that people crunching Rosetta are seeing. I have no idea why they went ahead and released this stuff on Rosetta. ID: 13325 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 28	Message 13327 - Posted: 9 Apr 2006, 15:43:58 UTC What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs. Reading the other thread, it would seem that the 4.97 app worked fine with the wu's it had been given. It was then released. It was not until a different set of wu's hit that code that the problems first appeared, both in RALPH, and sadly, in the production project. It is quite possible the new wu's hit a thread of code that had not been run before. These things happen in the best software, testing for absolutely every eventuality tends to add serious delays, and is really only justifiable in safety critical applications, which this is not. We are here to help these guys with their science. If the new science app delivers better results, then we all win! I'm sure they'll fix this quickly. The suggestion to roll out application changes early in the week is a decent idea though. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 13327 · Rating: 0 · rate: / Reply Quote

IceQueen41 Send message Joined: 24 Jan 06 Posts: 1 Credit: 65,113 RAC: 0	Message 13328 - Posted: 9 Apr 2006, 16:07:35 UTC Not so sure that everything is working with 4.98... I've got 2 WUs going (both of the "7449_largescale..." type) that have been going for about an hour and a half, and are still only at 1.14% and 1.40% (my WU time is set to 2 hours). At this rate they won't finish even in a week. Anyone else having these problems or have any idea what's going on with these? ID: 13328 · Rating: 0 · rate: / Reply Quote

Buffalo Bill Send message Joined: 25 Mar 06 Posts: 71 Credit: 1,630,458 RAC: 0	Message 13334 - Posted: 9 Apr 2006, 17:05:27 UTC Last modified: 9 Apr 2006, 17:36:46 UTC I'm running one of those too. The protein is rather large. I believe that regardless of the time you have set for your target cpu time, it will complete one full model before it uploads. This seems to be a relax only model. I don't know why but hey, I don't have a PhD in microbiology either. :) Edit: The above post by Moderator9 is exactly why I will be staying with this project. Stuff happens with this kind of research and it's "all about the science". A little instability and a few lost credits are nothing compared to the big picture here. ID: 13334 · Rating: 0 · rate: / Reply Quote

Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 13335 - Posted: 9 Apr 2006, 17:05:36 UTC - in response to Message 13328. Last modified: 9 Apr 2006, 17:08:17 UTC Not so sure that everything is working with 4.98... I've got 2 WUs going (both of the "7449_largescale..." type) that have been going for about an hour and a half, and are still only at 1.14% and 1.40% (my WU time is set to 2 hours). At this rate they won't finish even in a week. Anyone else having these problems or have any idea what's going on with these? A large number of the errors are work unit related. As a result the application release will fix a lot of the issues, but there will be some time required for everything to settle out. David Kim is working the problem, and I would expect a statement from Dr. Baker on Monday with more details. The application was very stable in Ralph for a number of the original bug issues and that is why they released it to the production environment. For some reason the problems have not affected all machines equally. For instance Mac OS is not having any real problems, and the majority of windows machines are working with some increase in error rate. The problem seems to be a mixed bag of issues with the new work unit types, and some issue with the application for particular systems. This kind of problem is why what Rosetta is try to achieve has not been done before. Many BOINC projects are quite stable because the nature of what they are doing is well established, understood and remains the same across ALL of the work they do. Rosetta is not like that. This is a true research project, where everything from the approach to the work, to the actual work itself, and the design of the application is changing to accommodate new concepts and theories. While there are other protein research projects, the entire approach at Rosetta is different. Rosetta is trying to model whole proteins. The simple ones work fine, but the complex ones are tricky and that is where the problems come in. Last years CASP competition showed that Rosetta is on the right track. But there will always be issues that arise in pure research such as this. Thanks to those of you who contacted the project directly through the moderator e-mail, the project team was able to jump on this and implement a repair. Moderator9 ROSETTA@home FAQ Moderator Contact ID: 13335 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 13359 - Posted: 9 Apr 2006, 20:42:57 UTC Moderator9: Last year's Casp CASP happens every 2 years. The last one finished in Oct of 2004. The results were released in December. Then they give the researchers a year to work on improvements, and they hold another competition. The DC project that I was involved in during CASP 5 and CASP 6 has been shut down since Oct 2004 while they work on improved energy scoring functions. And after all the HBLR failures on Windows client 4.97, I picked up HB_BARCODE_30_1aiu__351_20403_1 and it's worked fine for the last 19ish hours. So I haven't been upgraded to 4.98 (4.83) yet. ID: 13359 · Rating: 0 · rate: / Reply Quote

Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0	Message 13361 - Posted: 9 Apr 2006, 21:00:50 UTC - in response to Message 13359. Moderator9: Last year's Casp CASP happens every 2 years. The last one finished in Oct of 2004. The results were released in December. Then they give the researchers a year to work on improvements, and they hold another competition. The DC project that I was involved in during CASP 5 and CASP 6 has been shut down since Oct 2004 while they work on improved energy scoring functions. And after all the HBLR failures on Windows client 4.97, I picked up HB_BARCODE_30_1aiu__351_20403_1 and it's worked fine for the last 19ish hours. So I haven't been upgraded to 4.98 (4.83) yet. Not my first typo of the day. You are correct. I meant to say "the last CASP'. Sorry. Moderator9 ROSETTA@home FAQ Moderator Contact ID: 13361 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 13367 - Posted: 9 Apr 2006, 22:28:58 UTC - in response to Message 13334. I'm running one of those too. The protein is rather large. I believe that regardless of the time you have set for your target cpu time, it will complete one full model before it uploads. This seems to be a relax only model. I don't know why but hey, I don't have a PhD in microbiology either. :) You are absolutely right - these 7447_largescale_** jobs are relax only jobs of some relatively larger proteins. Since these proteins are larger, each job will take longer to finish. According to our current statistics, the average CPU time to finish such a job can be anywhere from 2 to 4 hours. ID: 13367 · Rating: 0 · rate: / Reply Quote

ecafkid Send message Joined: 5 Oct 05 Posts: 40 Credit: 15,177,319 RAC: 0	Message 13389 - Posted: 10 Apr 2006, 13:18:09 UTC 4/9/2006 10:03:52 PM\|rosetta@home\|Unrecoverable error for result HBLR_1.0_1di2_425_4170_0 ( - exit code -1073741819 (0xc0000005)) 4/10/2006 12:42:42 AM\|rosetta@home\|Unrecoverable error for result HBLR_1.0_2reb_426_3929_0 ( - exit code -1073741819 (0xc0000005)) these 2 errored on 4.97. I have graphics turned off and leave in memory on. This is the only DC project I run. Since turning off graphics these are the first errors I have encountered. Ecaf ID: 13389 · Rating: 0 · rate: / Reply Quote

Jeff Gilchrist Send message Joined: 7 Oct 05 Posts: 33 Credit: 2,398,990 RAC: 0	Message 13390 - Posted: 10 Apr 2006, 13:33:55 UTC - in response to Message 13359. The DC project that I was involved in during CASP 5 and CASP 6 has been shut down since Oct 2004 while they work on improved energy scoring functions. Which one is that, distributed folding? I'm not sure if they are ever coming back... ID: 13390 · Rating: 0 · rate: / Reply Quote