Information on Ver 4.97 errors

Message boards : Number crunching : Information on Ver 4.97 errors

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 13290 - Posted: 8 Apr 2006, 23:17:32 UTC
Last modified: 8 Apr 2006, 23:17:49 UTC

I have just recieved this essage from David Kim who is working on the version 4.97 error issue as I write this message.

I just reverted back to the previous app. You should notice a version
4.98 now, which is really version 4.83 for windows and mac, and 4.82
for linux.


You should all see some relief very soon. Your systems should update by them selves when the version change takes place, but if not please do a manual update.


Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 13290 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dave Wilson

Send message
Joined: 8 Jan 06
Posts: 35
Credit: 379,049
RAC: 0
Message 13296 - Posted: 9 Apr 2006, 2:08:38 UTC

Should we abort the work units that are going to use 4.97?
ID: 13296 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 13298 - Posted: 9 Apr 2006, 3:16:04 UTC

Sounds like "reset project" from the projects tab. This basically aborts any WUs and reloads the application code.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 13298 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13300 - Posted: 9 Apr 2006, 3:29:38 UTC

I notice that the HBLR_* WUs have been cancelled. That keeps them from being sent out again, but doesn't remove them from my computers. If my Linux machines successfully crunch and upload them, will the results be useful, or will they automatically be thrown away?
ID: 13300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 13302 - Posted: 9 Apr 2006, 3:52:30 UTC - in response to Message 13300.  
Last modified: 9 Apr 2006, 4:33:07 UTC

I notice that the HBLR_* WUs have been cancelled. That keeps them from being sent out again, but doesn't remove them from my computers. If my Linux machines successfully crunch and upload them, will the results be useful, or will they automatically be thrown away?


They will be used. For what it is worth the Mac computers are not having any of these problems, so resetting the project is not universally required. There are also some Windows and Linux system that are not having trouble at this time.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 13302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 51
Message 13309 - Posted: 9 Apr 2006, 8:51:47 UTC

My machines both run Windows, (one NT4, the other XP), both have seen errors, but both have also run 4.97 to normal completion. Before I disabled Rosetta, I had 6 failures and 4 normal with 4.97.

It's running again now with 4.98, good job team.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
simpe73

Send message
Joined: 20 Feb 06
Posts: 4
Credit: 438,570
RAC: 0
Message 13310 - Posted: 9 Apr 2006, 9:28:52 UTC

What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs.
ID: 13310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jimi@0wned.org.uk

Send message
Joined: 10 Mar 06
Posts: 29
Credit: 335,252
RAC: 0
Message 13314 - Posted: 9 Apr 2006, 12:02:53 UTC

Tried a project reset, any new WU fails immediately with:

core_client_version>5.2.13</core_client_version>
<message>CreateProcess() failed - The process cannot access the file because it is being used by another process. (0x20)
</message>

What's happening there?
ID: 13314 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cureseekers~Kristof

Send message
Joined: 5 Nov 05
Posts: 80
Credit: 689,603
RAC: 0
Message 13315 - Posted: 9 Apr 2006, 12:03:13 UTC
Last modified: 9 Apr 2006, 12:03:40 UTC

What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs.

As I've read, these jobs and engine is tested on the test-environment (RALPH).
But, the latter, when moving these to the normal Rosetta environment, the errors came up.
So it was unforseen ...

Every application, every DC project, every environment has its problems.
We can only thank David (and others?), to react that quick, to reset the previous version. This even during a weekend!

I guess we'll get more comments by David on Monday in his weblog?
Member of Dutch Power Cows
ID: 13315 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Betting Slip

Send message
Joined: 26 Sep 05
Posts: 71
Credit: 5,702,246
RAC: 0
Message 13317 - Posted: 9 Apr 2006, 12:10:52 UTC - in response to Message 13315.  
Last modified: 9 Apr 2006, 12:16:43 UTC

As I've read, these jobs and engine is tested on the test-environment (RALPH).
But, the latter, when moving these to the normal Rosetta environment, the errors came up.
So it was unforseen ...

Every application, every DC project, every environment has its problems.
We can only thank David (and others?), to react that quick, to reset the previous version. This even during a weekend!

I guess we'll get more comments by David on Monday in his weblog?




AMEN to that.
ID: 13317 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13325 - Posted: 9 Apr 2006, 15:05:14 UTC - in response to Message 13315.  

As I've read, these jobs and engine is tested on the test-environment (RALPH).
But, the latter, when moving these to the normal Rosetta environment, the errors came up.
So it was unforseen ...


People crunching Ralph saw and reported the same high error rate that people crunching Rosetta are seeing. I have no idea why they went ahead and released this stuff on Rosetta.
ID: 13325 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 51
Message 13327 - Posted: 9 Apr 2006, 15:43:58 UTC

What is the idea of RALPH? Should these applivation be tested there? I'm running RALPH on 3/40 computers i maintain. Is it only wasting of time to run RALPH? I'm not very happy to reset all those 37 computer. You find out that you will not ever reach 150 teraflops, if you keep delivering bugs.

Reading the other thread, it would seem that the 4.97 app worked fine with the wu's it had been given. It was then released. It was not until a different set of wu's hit that code that the problems first appeared, both in RALPH, and sadly, in the production project.

It is quite possible the new wu's hit a thread of code that had not been run before. These things happen in the best software, testing for absolutely every eventuality tends to add serious delays, and is really only justifiable in safety critical applications, which this is not.

We are here to help these guys with their science. If the new science app delivers better results, then we all win! I'm sure they'll fix this quickly.

The suggestion to roll out application changes early in the week is a decent idea though.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13327 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
IceQueen41
Avatar

Send message
Joined: 24 Jan 06
Posts: 1
Credit: 65,113
RAC: 0
Message 13328 - Posted: 9 Apr 2006, 16:07:35 UTC

Not so sure that everything is working with 4.98... I've got 2 WUs going (both of the "7449_largescale..." type) that have been going for about an hour and a half, and are still only at 1.14% and 1.40% (my WU time is set to 2 hours). At this rate they won't finish even in a week. Anyone else having these problems or have any idea what's going on with these?
ID: 13328 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Buffalo Bill
Avatar

Send message
Joined: 25 Mar 06
Posts: 71
Credit: 1,630,458
RAC: 0
Message 13334 - Posted: 9 Apr 2006, 17:05:27 UTC
Last modified: 9 Apr 2006, 17:36:46 UTC

I'm running one of those too. The protein is rather large. I believe that regardless of the time you have set for your target cpu time, it will complete one full model before it uploads. This seems to be a relax only model. I don't know why but hey, I don't have a PhD in microbiology either. :)

Edit: The above post by Moderator9 is exactly why I will be staying with this project. Stuff happens with this kind of research and it's "all about the science". A little instability and a few lost credits are nothing compared to the big picture here.
ID: 13334 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 13335 - Posted: 9 Apr 2006, 17:05:36 UTC - in response to Message 13328.  
Last modified: 9 Apr 2006, 17:08:17 UTC

Not so sure that everything is working with 4.98... I've got 2 WUs going (both of the "7449_largescale..." type) that have been going for about an hour and a half, and are still only at 1.14% and 1.40% (my WU time is set to 2 hours). At this rate they won't finish even in a week. Anyone else having these problems or have any idea what's going on with these?


A large number of the errors are work unit related. As a result the application release will fix a lot of the issues, but there will be some time required for everything to settle out. David Kim is working the problem, and I would expect a statement from Dr. Baker on Monday with more details.

The application was very stable in Ralph for a number of the original bug issues and that is why they released it to the production environment. For some reason the problems have not affected all machines equally. For instance Mac OS is not having any real problems, and the majority of windows machines are working with some increase in error rate. The problem seems to be a mixed bag of issues with the new work unit types, and some issue with the application for particular systems.

This kind of problem is why what Rosetta is try to achieve has not been done before. Many BOINC projects are quite stable because the nature of what they are doing is well established, understood and remains the same across ALL of the work they do. Rosetta is not like that. This is a true research project, where everything from the approach to the work, to the actual work itself, and the design of the application is changing to accommodate new concepts and theories. While there are other protein research projects, the entire approach at Rosetta is different. Rosetta is trying to model whole proteins. The simple ones work fine, but the complex ones are tricky and that is where the problems come in. Last years CASP competition showed that Rosetta is on the right track. But there will always be issues that arise in pure research such as this.

Thanks to those of you who contacted the project directly through the moderator e-mail, the project team was able to jump on this and implement a repair.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 13335 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13359 - Posted: 9 Apr 2006, 20:42:57 UTC

Moderator9: Last year's Casp

CASP happens every 2 years. The last one finished in Oct of 2004. The results were released in December. Then they give the researchers a year to work on improvements, and they hold another competition. The DC project that I was involved in during CASP 5 and CASP 6 has been shut down since Oct 2004 while they work on improved energy scoring functions.

And after all the HBLR failures on Windows client 4.97, I picked up HB_BARCODE_30_1aiu__351_20403_1 and it's worked fine for the last 19ish hours. So I haven't been upgraded to 4.98 (4.83) yet.

ID: 13359 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 13361 - Posted: 9 Apr 2006, 21:00:50 UTC - in response to Message 13359.  

Moderator9: Last year's Casp

CASP happens every 2 years. The last one finished in Oct of 2004. The results were released in December. Then they give the researchers a year to work on improvements, and they hold another competition. The DC project that I was involved in during CASP 5 and CASP 6 has been shut down since Oct 2004 while they work on improved energy scoring functions.

And after all the HBLR failures on Windows client 4.97, I picked up HB_BARCODE_30_1aiu__351_20403_1 and it's worked fine for the last 19ish hours. So I haven't been upgraded to 4.98 (4.83) yet.

Not my first typo of the day. You are correct. I meant to say "the last CASP'. Sorry.
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 13361 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 13367 - Posted: 9 Apr 2006, 22:28:58 UTC - in response to Message 13334.  

I'm running one of those too. The protein is rather large. I believe that regardless of the time you have set for your target cpu time, it will complete one full model before it uploads. This seems to be a relax only model. I don't know why but hey, I don't have a PhD in microbiology either. :)


You are absolutely right - these 7447_largescale_** jobs are relax only jobs of some relatively larger proteins. Since these proteins are larger, each job will take longer to finish. According to our current statistics, the average CPU time to finish such a job can be anywhere from 2 to 4 hours.
ID: 13367 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ecafkid

Send message
Joined: 5 Oct 05
Posts: 40
Credit: 15,177,319
RAC: 0
Message 13389 - Posted: 10 Apr 2006, 13:18:09 UTC

4/9/2006 10:03:52 PM|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_425_4170_0 ( - exit code -1073741819 (0xc0000005))
4/10/2006 12:42:42 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_2reb_426_3929_0 ( - exit code -1073741819 (0xc0000005))
these 2 errored on 4.97. I have graphics turned off and leave in memory on. This is the only DC project I run. Since turning off graphics these are the first errors I have encountered.

Ecaf



ID: 13389 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jeff Gilchrist

Send message
Joined: 7 Oct 05
Posts: 33
Credit: 2,398,990
RAC: 0
Message 13390 - Posted: 10 Apr 2006, 13:33:55 UTC - in response to Message 13359.  

The DC project that I was involved in during CASP 5 and CASP 6 has been shut down since Oct 2004 while they work on improved energy scoring functions.


Which one is that, distributed folding? I'm not sure if they are ever coming back...

ID: 13390 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Information on Ver 4.97 errors



©2024 University of Washington
https://www.bakerlab.org