Problems with Rosetta version 5.59

Message boards : Number crunching : Problems with Rosetta version 5.59

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 7 · Next

AuthorMessage
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 38870 - Posted: 2 Apr 2007, 20:09:48 UTC
Last modified: 3 Apr 2007, 0:49:07 UTC

Please post here about issues with 5.59. For information on what's changed, check out
this thread!
ID: 38870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 38874 - Posted: 2 Apr 2007, 20:22:35 UTC

My problem is not any kind of errors, but with the steady drop in granted credit over the past two weeks along with a five month RAC of 225+ to 206.

I've heard that these even out. However, mine appears to be a steady consistant drop.

There have been code efficiency problems in the past, and I'm wondering if any inefficiencies exist in 5.5X.

My system goes 24/7 except for a periodic system check every two weeks (aprox..5 to 10 minutes).

I don't have any garbage running as I check XP's task manager regularly. This only started about two weeks ago.
ID: 38874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 38878 - Posted: 2 Apr 2007, 20:44:25 UTC
Last modified: 2 Apr 2007, 20:45:40 UTC

Rhiju, the announcement on the home page of v5.59 received a date of March 20. So the date is wrong, just like the last time. Wonder what's up with that?

Oh, and could you add a link to your description of v5.59 to this "problems with...5.59" thread?
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 38878 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 38900 - Posted: 3 Apr 2007, 0:51:26 UTC - in response to Message 38878.  

Well, the wrong date is just silliness; I typed it in wrong!

I'll talk to other team developers about the drop in credit, Keith; based on your post, it can't be inefficiencies in preempting/resuming, so we'll look for inefficiencies in the code as you suggest.

Rhiju, the announcement on the home page of v5.59 received a date of March 20. So the date is wrong, just like the last time. Wonder what's up with that?

Oh, and could you add a link to your description of v5.59 to this "problems with...5.59" thread?


ID: 38900 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 38920 - Posted: 3 Apr 2007, 8:38:09 UTC

OK. I've done a comlete re-install of my system and the CPU is now running a consistent 99% on task manager. Before I had some unusual background process activity eating as much as 4%.

Give me a few days to see if my RAC comes back around.

Talk about Spring Cleaning!
ID: 38920 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
niko

Send message
Joined: 1 Apr 07
Posts: 3
Credit: 22,789
RAC: 0
Message 38934 - Posted: 3 Apr 2007, 15:37:05 UTC

i have no more problems of "process *** not found" with the latest version on my Macs!
ID: 38934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 38948 - Posted: 3 Apr 2007, 22:54:37 UTC
Last modified: 3 Apr 2007, 22:56:02 UTC

Break free of the (side) chains the bind you!!

:)

Just a nit. I'm noticing that there is about a 5 minute period mid model where no visible changes to the graphic occurs. On my workunits called:

s029__BOINC_SYMM_FOLD_AND_DOCK_RELAX-s029_-truncate_hom014__1638_1775_0
This seems to hit around step 70,000. Just prior to that, the last redraw seems to get bad data, and you will see that much of the sidechains appear not to be connected to anything. If you get curious (as I always do) you rotate things around and find... well... yeah, THAT would NOT be connected to ANYTHING! Here's a link to my screenshot. Note, the first box seems to retain the sidechains in tact.

The very next step brings everything back in to line... but it takes more then 5min. to reach that next step at this point in the model. So, people have a much higher then average chance of catching this malformed frame when they pop in to admire their beautiful protein and RNA structures.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 38948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ai-Leng

Send message
Joined: 14 Oct 06
Posts: 8
Credit: 4,715
RAC: 0
Message 38953 - Posted: 4 Apr 2007, 0:51:02 UTC

Well I was a one of the small number of people on a mac with issues with v5.54.

Now, something odd has occurred. Boinc was crunching away data for s029__BOINC_SYMM_FOLD_AND_DOCK_RELAX-s029_-truncate_hom014__1638_3094_0 and everything was going well and had reached a progress of 95%. The next unit for Rosetta had even been downloaded ready to start once this one had finished.

It was at this point that I switched my Powerbook off to head to work.

When I started my Powerbook up again, I noticed that the Progress of the same work unit is 0% and the data processing has since restarted.

I didn't see any error messages at the time of when I switched my mac back on.
ID: 38953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 38954 - Posted: 4 Apr 2007, 1:23:56 UTC
Last modified: 4 Apr 2007, 1:24:16 UTC

A problem with my first unit with the new app, when i shut down

for the night it had 4hrs, 36min at 46% for a 10hr runtime.

When i started up this morning the time was the same but

the % had gone back to 0%, now it's at 6hrs, 36min and showing

37.3% complete never had that before.

P.S. 10hr runtime.

ID: 38954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 38955 - Posted: 4 Apr 2007, 2:54:54 UTC

Feet1st, I've noticed that along with the first three docking models completing within the first hour with model four-on taking from 45 minutes to over an hour.
ID: 38955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38956 - Posted: 4 Apr 2007, 3:30:03 UTC
Last modified: 4 Apr 2007, 3:50:07 UTC

Mags, if you happen to have a short runtime preference, and you crunching did not complete the first model, then what happend is just that the task never reached a checkpoint and the work up to that point is lost. This is why improved checkpointing is on the list for including in the next release.

[edit]The other possibility is that you are seeing the 0% complete issue, even though the graphic shows models crunched and the total CPU time. I'm not positive if this issue that was observed on Ralph still existing in the 5.59 verison or not. This may be what Rhiju was referring to when he said sometimes at the beginning of the model the estimate can be "a little off". Even if that issue still exists, the tasks completed OK. It was just a bit confusing to monitor the % complete after powering down like that. You will find it increases quickly and will complete the task at the expected time.

Basically, if a task is restarted, work begins from the last checkpoint, or the last completed model. And it begins with the amount of CPU time you had at the time of the model completion or checkpoint... but for some reason, BOINC is not reporting that total CPU time to Rosetta. It is only reporting the time since this restart. So anyway, this is a kink that needs to be worked out. But it's an issue that the Project Team is already aware of.
Rosetta Moderator: Mod.Sense
ID: 38956 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38957 - Posted: 4 Apr 2007, 3:35:53 UTC

Peter, this is how it worked all along. The difference now is the improved indication on the % completed. I mean what you would have observed before this would have been that you were crunching at 37.3% for the short period of time just before you shutdown, and then when you powered back on you were back at 37.3%. But now... even though everything else is running the same, you see that progress indicator tick up from 37.3% through to 46% before you power down.

Some work is always lost when BOINC shuts down, or when the Rosetta application is otherwise removed from memory. With the planned improved checkpointing, you will still lose work, but significantly less will be lost. It is simply the way computers work. If you want to preserve any given piece of information, you must write to disk. If you write to disk all the time, from an application that runs on your machine 24x7, or at least all the time the machine is powered on, then you will be using the disk drive too much. Over time that's not a good idea. The tradeoff for sparing the disk drive is that some work is lost. Sometimes more then an hour can be lost. This is why improved checkpointing is important.
Rosetta Moderator: Mod.Sense
ID: 38957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 38958 - Posted: 4 Apr 2007, 5:11:05 UTC - in response to Message 38948.  
Last modified: 4 Apr 2007, 5:11:56 UTC

Thanks Mod.Sense for your explanations in the previous posts! David K. and I are working on checkpointing. So many users have expressed concerns about shutting off and starting BOINC -- I'm thinking about ways for Rosetta to be more independent of the cpu run time estimated provided by BOINC. During checkpoints, for example, we can record the time so far in the run.

Feet1st, that's a hilarious graphic! I noticed the same thing on test runs here, though it wouldn't freeze for 5 minutes ... just a few seconds. I wonder what's causing the craziness to continue for so long! Anyway, I'll look into fixing it -- didn't have time before due to all the urgent problems last week. If its any consolation, those FOLD_AND_DOCK workunits are returning some pretty amazing results (which I'll probably start posting next week -- its been a while since we've seen some "Top Predictions", huh?).

Wow, things are quiet on this thread. I'll take that as a good sign.


Break free of the (side) chains the bind you!!

:)

Just a nit. I'm noticing that there is about a 5 minute period mid model where no visible changes to the graphic occurs. On my workunits called:

s029__BOINC_SYMM_FOLD_AND_DOCK_RELAX-s029_-truncate_hom014__1638_1775_0
This seems to hit around step 70,000. Just prior to that, the last redraw seems to get bad data, and you will see that much of the sidechains appear not to be connected to anything. If you get curious (as I always do) you rotate things around and find... well... yeah, THAT would NOT be connected to ANYTHING! Here's a link to my screenshot. Note, the first box seems to retain the sidechains in tact.

The very next step brings everything back in to line... but it takes more then 5min. to reach that next step at this point in the model. So, people have a much higher then average chance of catching this malformed frame when they pop in to admire their beautiful protein and RNA structures.


ID: 38958 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ai-Leng

Send message
Joined: 14 Oct 06
Posts: 8
Credit: 4,715
RAC: 0
Message 38970 - Posted: 4 Apr 2007, 11:04:22 UTC - in response to Message 38956.  

Mags, if you happen to have a short runtime preference, and you crunching did not complete the first model, then what happend is just that the task never reached a checkpoint and the work up to that point is lost. This is why improved checkpointing is on the list for including in the next release.

[edit]The other possibility is that you are seeing the 0% complete issue, even though the graphic shows models crunched and the total CPU time. I'm not positive if this issue that was observed on Ralph still existing in the 5.59 verison or not. This may be what Rhiju was referring to when he said sometimes at the beginning of the model the estimate can be "a little off". Even if that issue still exists, the tasks completed OK. It was just a bit confusing to monitor the % complete after powering down like that. You will find it increases quickly and will complete the task at the expected time.

Basically, if a task is restarted, work begins from the last checkpoint, or the last completed model. And it begins with the amount of CPU time you had at the time of the model completion or checkpoint... but for some reason, BOINC is not reporting that total CPU time to Rosetta. It is only reporting the time since this restart. So anyway, this is a kink that needs to be worked out. But it's an issue that the Project Team is already aware of.



The thing is, the total CPU time also resets to zero and continues to crunch. This same work unit is still being worked on as it hasn't been completed. I switched my Powerbook off again to come home from work and also during a fire drill at work. Will be leaving it on all night tonight for completion and reporting. Not sure if this additional information helps you at all.
ID: 38970 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jimbreed

Send message
Joined: 7 May 06
Posts: 1
Credit: 90,298
RAC: 0
Message 38975 - Posted: 4 Apr 2007, 12:08:49 UTC
Last modified: 4 Apr 2007, 12:11:28 UTC

This morning I was looking at the progress on a SYMM_FOLD_AND_DOCK_RELAX work unit that had been running for almost 6 hours and was only on the second model. I have an 8 hour preference. (I have a 1.6GHz Pentium 4 running XP-Home.) I clicked in the Low Energy pane to rotate the model and when I moved the mouse, the graphics window disappeared, no error messages, no sign of completion, just poof, it was gone. Boinc downloaded another work unit and started processing.

The only message I got in Boinc was:
4/4/2007 6:52:45 AM|rosetta@home|Computation for task s029__BOINC_SYMM_FOLD_AND_DOCK_RELAX-s029_-truncate_hom005__1638_6637_0 finished


The result is 71154279. (Edited to correct the result id.)

I never saw any graphics problems with earlier versions of Rosetta.
ID: 38975 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 38976 - Posted: 4 Apr 2007, 12:20:07 UTC - in response to Message 38975.  

This morning I was looking at the progress on a SYMM_FOLD_AND_DOCK_RELAX work unit that had been running for almost 6 hours and was only on the second model. I have an 8 hour preference. (I have a 1.6GHz Pentium 4 running XP-Home.) I clicked in the Low Energy pane to rotate the model and when I moved the mouse, the graphics window disappeared, no error messages, no sign of completion, just poof, it was gone. Boinc downloaded another work unit and started processing.

The only message I got in Boinc was:
4/4/2007 6:52:45 AM|rosetta@home|Computation for task s029__BOINC_SYMM_FOLD_AND_DOCK_RELAX-s029_-truncate_hom005__1638_6637_0 finished


The result is 71154279. (Edited to correct the result id.)

I never saw any graphics problems with earlier versions of Rosetta.


If the Wu is finished the graficswindow shut down by it self.
Anders n

ID: 38976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38980 - Posted: 4 Apr 2007, 13:26:21 UTC

Mags, since your CPU time shown also dropped to zero, then it means your work did not reach a checkpoint nor model end. i.e. the first case I was talking about. I've now confirmed on my machine that the CPU % upon restart drops back to zero., even when models have been completed.

Jimbreed, since these models take considerable time to complete, Rosetta is only able to get you within about 90min. of your runtime preference. To begin another model at that point would take you past your preference. And so I suspect it was not your attempt to manipulate the graphics, but rather just the model reaching the end that caused it to take down the graphic window. But the %complete is really just based on your CPU runtime preference, so the % complete was not aware we were going to be ending a bit early on this one. It looks like the task reported normally. You say there was no sign of completion. Not sure what you were expecting, but the message you posted is the normal sign of completion.
Rosetta Moderator: Mod.Sense
ID: 38980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tim Kunz

Send message
Joined: 27 Dec 05
Posts: 9
Credit: 1,037,399
RAC: 1,713
Message 38985 - Posted: 4 Apr 2007, 15:07:10 UTC
Last modified: 4 Apr 2007, 15:34:13 UTC

I just had to reboot computer for Windows updates...a computation that was over 95% complete apparently was not saved...it reset and is recomputing from start.

And another PC that was shut down normally and rebooted twice apparently restarted its task from zero each time also. (These were earlier in their computations... < 20%).

This appears not to be checkpointing.
ID: 38985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 38994 - Posted: 4 Apr 2007, 19:41:02 UTC

Tim, I've noticed that too. Did your CPU time reset or did it remain the same?

% complete by itself will not affect work completed that has been checkpointed. If this happens again, double check your CPU time and model number. That will tell the story.
ID: 38994 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tim Kunz

Send message
Joined: 27 Dec 05
Posts: 9
Credit: 1,037,399
RAC: 1,713
Message 38996 - Posted: 4 Apr 2007, 20:08:08 UTC - in response to Message 38994.  

The CPU time reset also....complete restart.

I'm allowing completion of current computations and redirecting CPUs to other projects until this is resolved.

---------------------------------------------------------------------------------

Tim, I've noticed that too. Did your CPU time reset or did it remain the same?

% complete by itself will not affect work completed that has been checkpointed. If this happens again, double check your CPU time and model number. That will tell the story.

ID: 38996 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 7 · Next

Message boards : Number crunching : Problems with Rosetta version 5.59



©2024 University of Washington
https://www.bakerlab.org