Information on Ver 4.97 errors

Author	Message
Dave Wilson Send message Joined: 8 Jan 06 Posts: 35 Credit: 379,049 RAC: 0	Message 13296 - Posted: 9 Apr 2006, 2:08:38 UTC Should we abort the work units that are going to use 4.97? ID: 13296 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 13298 - Posted: 9 Apr 2006, 3:16:04 UTC Sounds like "reset project" from the projects tab. This basically aborts any WUs and reloads the application code. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 13298 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 13300 - Posted: 9 Apr 2006, 3:29:38 UTC I notice that the HBLR_* WUs have been cancelled. That keeps them from being sent out again, but doesn't remove them from my computers. If my Linux machines successfully crunch and upload them, will the results be useful, or will they automatically be thrown away? ID: 13300 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 655 Credit: 11,899,569 RAC: 2,569	Message 13309 - Posted: 9 Apr 2006, 8:51:47 UTC My machines both run Windows, (one NT4, the other XP), both have seen errors, but both have also run 4.97 to normal completion. Before I disabled Rosetta, I had 6 failures and 4 normal with 4.97. It's running again now with 4.98, good job team. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 13309 · Rating: 0 · rate: / Reply Quote

simpe73 Send message Joined: 20 Feb 06 Posts: 4 Credit: 438,570 RAC: 0	Message 13310 - Posted: 9 Apr 2006, 9:28:52 UTC What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs. ID: 13310 · Rating: 0 · rate: / Reply Quote

Jimi@0wned.org.uk Send message Joined: 10 Mar 06 Posts: 29 Credit: 335,252 RAC: 0	Message 13314 - Posted: 9 Apr 2006, 12:02:53 UTC Tried a project reset, any new WU fails immediately with: core_client_version>5.2.13</core_client_version> <message>CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20) </message> What's happening there? ID: 13314 · Rating: 0 · rate: / Reply Quote

Cureseekers~Kristof Send message Joined: 5 Nov 05 Posts: 80 Credit: 689,603 RAC: 0	Message 13315 - Posted: 9 Apr 2006, 12:03:13 UTC Last modified: 9 Apr 2006, 12:03:40 UTC What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs. As I've read, these jobs and engine is tested on the test-environment (RALPH). But, the latter, when moving these to the normal Rosetta environment, the errors came up. So it was unforseen ... Every application, every DC project, every environment has its problems. We can only thank David (and others?), to react that quick, to reset the previous version. This even during a weekend! I guess we'll get more comments by David on Monday in his weblog? Member of Dutch Power Cows ID: 13315 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0	Message 13317 - Posted: 9 Apr 2006, 12:10:52 UTC - in response to Message 13315. Last modified: 9 Apr 2006, 12:16:43 UTC As I've read, these jobs and engine is tested on the test-environment (RALPH). But, the latter, when moving these to the normal Rosetta environment, the errors came up. So it was unforseen ... Every application, every DC project, every environment has its problems. We can only thank David (and others?), to react that quick, to reset the previous version. This even during a weekend! I guess we'll get more comments by David on Monday in his weblog? AMEN to that. ID: 13317 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 13325 - Posted: 9 Apr 2006, 15:05:14 UTC - in response to Message 13315. As I've read, these jobs and engine is tested on the test-environment (RALPH). But, the latter, when moving these to the normal Rosetta environment, the errors came up. So it was unforseen ... People crunching Ralph saw and reported the same high error rate that people crunching Rosetta are seeing. I have no idea why they went ahead and released this stuff on Rosetta. ID: 13325 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 655 Credit: 11,899,569 RAC: 2,569	Message 13327 - Posted: 9 Apr 2006, 15:43:58 UTC What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs. Reading the other thread, it would seem that the 4.97 app worked fine with the wu's it had been given. It was then released. It was not until a different set of wu's hit that code that the problems first appeared, both in RALPH, and sadly, in the production project. It is quite possible the new wu's hit a thread of code that had not been run before. These things happen in the best software, testing for absolutely every eventuality tends to add serious delays, and is really only justifiable in safety critical applications, which this is not. We are here to help these guys with their science. If the new science app delivers better results, then we all win! I'm sure they'll fix this quickly. The suggestion to roll out application changes early in the week is a decent idea though. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 13327 · Rating: 0 · rate: / Reply Quote

IceQueen41 Send message Joined: 24 Jan 06 Posts: 1 Credit: 65,113 RAC: 0	Message 13328 - Posted: 9 Apr 2006, 16:07:35 UTC Not so sure that everything is working with 4.98... I've got 2 WUs going (both of the "7449_largescale..." type) that have been going for about an hour and a half, and are still only at 1.14% and 1.40% (my WU time is set to 2 hours). At this rate they won't finish even in a week. Anyone else having these problems or have any idea what's going on with these? ID: 13328 · Rating: 0 · rate: / Reply Quote

Buffalo Bill Send message Joined: 25 Mar 06 Posts: 71 Credit: 1,630,458 RAC: 0	Message 13334 - Posted: 9 Apr 2006, 17:05:27 UTC Last modified: 9 Apr 2006, 17:36:46 UTC I'm running one of those too. The protein is rather large. I believe that regardless of the time you have set for your target cpu time, it will complete one full model before it uploads. This seems to be a relax only model. I don't know why but hey, I don't have a PhD in microbiology either. :) Edit: The above post by Moderator9 is exactly why I will be staying with this project. Stuff happens with this kind of research and it's "all about the science". A little instability and a few lost credits are nothing compared to the big picture here. ID: 13334 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 13359 - Posted: 9 Apr 2006, 20:42:57 UTC Moderator9: Last year's Casp CASP happens every 2 years. The last one finished in Oct of 2004. The results were released in December. Then they give the researchers a year to work on improvements, and they hold another competition. The DC project that I was involved in during CASP 5 and CASP 6 has been shut down since Oct 2004 while they work on improved energy scoring functions. And after all the HBLR failures on Windows client 4.97, I picked up HB_BARCODE_30_1aiu__351_20403_1 and it's worked fine for the last 19ish hours. So I haven't been upgraded to 4.98 (4.83) yet. ID: 13359 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 13367 - Posted: 9 Apr 2006, 22:28:58 UTC - in response to Message 13334. I'm running one of those too. The protein is rather large. I believe that regardless of the time you have set for your target cpu time, it will complete one full model before it uploads. This seems to be a relax only model. I don't know why but hey, I don't have a PhD in microbiology either. :) You are absolutely right - these 7447_largescale_** jobs are relax only jobs of some relatively larger proteins. Since these proteins are larger, each job will take longer to finish. According to our current statistics, the average CPU time to finish such a job can be anywhere from 2 to 4 hours. ID: 13367 · Rating: 0 · rate: / Reply Quote

ecafkid Send message Joined: 5 Oct 05 Posts: 40 Credit: 15,177,319 RAC: 0	Message 13389 - Posted: 10 Apr 2006, 13:18:09 UTC 4/9/2006 10:03:52 PM\|rosetta@home\|Unrecoverable error for result HBLR_1.0_1di2_425_4170_0 ( - exit code -1073741819 (0xc0000005)) 4/10/2006 12:42:42 AM\|rosetta@home\|Unrecoverable error for result HBLR_1.0_2reb_426_3929_0 ( - exit code -1073741819 (0xc0000005)) these 2 errored on 4.97. I have graphics turned off and leave in memory on. This is the only DC project I run. Since turning off graphics these are the first errors I have encountered. Ecaf ID: 13389 · Rating: 0 · rate: / Reply Quote

Jeff Gilchrist Send message Joined: 7 Oct 05 Posts: 33 Credit: 2,398,990 RAC: 0	Message 13390 - Posted: 10 Apr 2006, 13:33:55 UTC - in response to Message 13359. The DC project that I was involved in during CASP 5 and CASP 6 has been shut down since Oct 2004 while they work on improved energy scoring functions. Which one is that, distributed folding? I'm not sure if they are ever coming back... ID: 13390 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 13421 - Posted: 10 Apr 2006, 20:23:27 UTC Jeff Gilchrist spoketh: distributed folding? Yep.. that's the one. ID: 13421 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 13437 - Posted: 11 Apr 2006, 4:38:28 UTC - in response to Message 13300. I notice that the HBLR_* WUs have been cancelled. That keeps them from being sent out again, but doesn't remove them from my computers. If my Linux machines successfully crunch and upload them, will the results be useful, or will they automatically be thrown away? Please don't throw them away if they run fine--I'm very curious about the results! ID: 13437 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 13438 - Posted: 11 Apr 2006, 5:38:40 UTC I just got back into town an hour ago, and have not yet been able to pinpoint the source of the recent problems. But I want to apologize in any event, the scale of the problems certainly was my fault. Here is what happened: I wanted to test the effects of an improvement in sampling alternative sidechain conformations during the high resolution stage of the search. Tests on our in house computers showed that this improvement resulted in consistently lower energy structures being found, and there were absolutely no signs of any run time problems. David K. sent out the new version of the code to RALPH thursday, and we submitted some test jobs. Friday afternoon we talked, and as there seemed to be no problems on ralph, and the code change was relatively minor, David sent the new version out to rosetta@home. I was very eager to see how the improvement in sampling would affect the searches I had been carrying out in the HBLR_1.0 series of runs you all had been doing over the past month, and as I was going out of town for a few days I submitted a large number of jobs friday evening so that there would be a clear picture when I returned. You can imagine my horror on checking up on rosetta and ralph in the few minutes before leaving saturday morning! It was clear by saturday that the test jobs I had sent out on ralph had a high error rate on windows, and that I had totoally jumped the gun by sending out the very large set of runs on rosetta on friday. I'm very sorry that I did this, and about the waste of resources and confusion this caused, and definitely learned my lesson--always make sure the ralph tests are complete and 100% positive before submitting large scale on rosetta. ID: 13438 · Rating: 0 · rate: / Reply Quote

Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0	Message 13439 - Posted: 11 Apr 2006, 7:10:02 UTC - in response to Message 13438. All I know is that almost 2 days of my computer time have resulted in errors of the kind you describe, To wit: 16811046 13764140 9 Apr 2006 10:36:23 UTC 11 Apr 2006 7:01:19 UTC Over Client error Computing 12,238.19 37.94 --- 16697013 13665863 8 Apr 2006 21:09:49 UTC 9 Apr 2006 3:20:07 UTC Over Client error Computing 18,578.25 57.60 --- 16613497 13627278 8 Apr 2006 13:05:47 UTC 8 Apr 2006 22:54:25 UTC Over Client error Computing 25,537.47 79.18 --- 16564691 13587556 8 Apr 2006 5:48:01 UTC 8 Apr 2006 15:46:50 UTC Over Client error Computing 23,689.95 73.45 To say the least it has been frustrating. This and no other is the root from which a Tyrant springs; when he first appears he is a protector.â€ Plato ID: 13439 · Rating: 0 · rate: / Reply Quote