Hanging Rosetta??? (sorry for the crosspost)

Message boards : Number crunching : Hanging Rosetta??? (sorry for the crosspost)

To post messages, you must log in.

AuthorMessage
ephman

Send message
Joined: 21 Dec 05
Posts: 4
Credit: 1,410,336
RAC: 0
Message 7521 - Posted: 24 Dec 2005, 14:03:39 UTC

hi,

i'm running the latest linux version of boinc on a pretty quick machine. what i'm noticing is that when it's about 20% done with a rosetta unit, my cpu stops running at 100% and goes back down to normal levels. i've tried a couple different units and the samething happens. is this normal? any ideas how i can fix this?

thanks for the bandwidth,
ephman
ID: 7521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7523 - Posted: 24 Dec 2005, 14:31:59 UTC - in response to Message 7521.  
Last modified: 24 Dec 2005, 14:37:58 UTC

...when it's about 20% done with a rosetta unit, my cpu stops runing at 100% and goes back down to normal levels...


for a few seconds? a few minutes? half an hour? or stays like that till the end of the job?

If it is just for a few minutes, this is what I think is happening, and I've seen it on my slow Linux box too. If it never recovers, then after maybe an hour I'd be feeling like aborting that WU and trying my luck with another.

If I am right about what is happening, the short answer is that there is nothing you can do about it. Now for the long answer:


What happens at each of the 'round number' steps in Rosetta's progress is that it is changing from analysing one part of the job to go on to do the next. It is almost like starting Rosetta again. It will be calling on parts of the program that have not been used since startup, or since the start of the previous stage. These may have been swapped out of RAM into virtual memory. (Cunningly, they don't go into the swap file, program code is 'swapped' using it's original DLL file, as the code cannot have changed).

In addition, Rosetta may need new data from the data files. These may be in the cache, but if these parts of the file have not been accessed before are more likely still onthe hard disk. These have to be read.

All of this means that Rosetta is waiting for the virtual memory manager and the disk cache manager to figure out what to do, and then waiting for the hard disk to actually do it. While the process is disk-limited the CPU usage drops to normal levels as you have observed.

River~~

edit: PS, welcome to Rosetta and congrats on getting your first few credits!
ID: 7523 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ephman

Send message
Joined: 21 Dec 05
Posts: 4
Credit: 1,410,336
RAC: 0
Message 7528 - Posted: 24 Dec 2005, 15:34:31 UTC - in response to Message 7523.  

hi,

i have a 2.4ghz linux (2.6.12-10-686) box, it's not too slow. the program seems to hang for more then just a few minutes but i've never timed it. i'm going to try and be patient and let it go longer maybe a few hours then. basically you're telling me that there is nothing i can do but wait it out right?

thanks
ephman


...when it's about 20% done with a rosetta unit, my cpu stops runing at 100% and goes back down to normal levels...


for a few seconds? a few minutes? half an hour? or stays like that till the end of the job?

If it is just for a few minutes, this is what I think is happening, and I've seen it on my slow Linux box too. If it never recovers, then after maybe an hour I'd be feeling like aborting that WU and trying my luck with another.

If I am right about what is happening, the short answer is that there is nothing you can do about it. Now for the long answer:


What happens at each of the 'round number' steps in Rosetta's progress is that it is changing from analysing one part of the job to go on to do the next. It is almost like starting Rosetta again. It will be calling on parts of the program that have not been used since startup, or since the start of the previous stage. These may have been swapped out of RAM into virtual memory. (Cunningly, they don't go into the swap file, program code is 'swapped' using it's original DLL file, as the code cannot have changed).

In addition, Rosetta may need new data from the data files. These may be in the cache, but if these parts of the file have not been accessed before are more likely still onthe hard disk. These have to be read.

All of this means that Rosetta is waiting for the virtual memory manager and the disk cache manager to figure out what to do, and then waiting for the hard disk to actually do it. While the process is disk-limited the CPU usage drops to normal levels as you have observed.

River~~

edit: PS, welcome to Rosetta and congrats on getting your first few credits!


ID: 7528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
N7QLT

Send message
Joined: 19 Dec 05
Posts: 2
Credit: 3,753,965
RAC: 0
Message 7531 - Posted: 24 Dec 2005, 17:03:40 UTC

I'm am brand new here, so advice from long time users is probably more pertinent. On my Linux box I am running 2 project. Rosetta was not switching back and forth gracefully. It seemed to hang as you are describing. I found a hint here somewhere that recommended setting the "Leave in memory" setting to YES.

FWIW,

Gene
ID: 7531 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ephman

Send message
Joined: 21 Dec 05
Posts: 4
Credit: 1,410,336
RAC: 0
Message 7535 - Posted: 24 Dec 2005, 17:57:51 UTC - in response to Message 7531.  

i'm very new to this, and i searched for the answer for this question, but how do i actually make that setting change?

thanks
ephman

I'm am brand new here, so advice from long time users is probably more pertinent. On my Linux box I am running 2 project. Rosetta was not switching back and forth gracefully. It seemed to hang as you are describing. I found a hint here somewhere that recommended setting the "Leave in memory" setting to YES.

FWIW,

Gene


ID: 7535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7538 - Posted: 24 Dec 2005, 18:44:01 UTC - in response to Message 7535.  

i'm very new to this, and i searched for the answer for this question, but how do i actually make that setting change?

thanks
ephman

... I found a hint here somewhere that recommended setting the "Leave in memory" setting to YES.


Thanks Gene you are absolutely right. Sometimes being new is a great help as you've just grappled with the same issue yourself.

ephman - sorry about my previous post, it was a complete red herring :-(

To set this pref, from this page click on the following links

[ home ]
My Account
Edit/View general preferences
Edit preferences

then select YES for the appropriate option, and click save (or is it update?)

THEN, go to your BOINCmanager, choose the projects tab, highlight Rosetta, and click the Update button.

You may still see a minute or three when Rosetta goes quiet, but it should come back to 100% fairly quickly each time - a few minutes no longer.

River~~
ID: 7538 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 7574 - Posted: 25 Dec 2005, 4:18:19 UTC

I am not sure if it was clear from the earlier posts.

If you have the power save option turned on for the disk drives they WILL spin down. Then, when it is time to checkpoint, the program is going to have to wait for the drives to "spin-up" which takes about 30 secconds.

One of the current issues with Rosetta@Home is that it only checkpoints at the 10/20/etc. % points. So, either live with the pauses, or leave the drive spun up.

Programs like SETI@Home checkpoint at very frequent intervals so they will not let the drives sping down. Programs like Rosetta@Home and CPDN checkpoint less frequently, CPDN because the checkpoint file is so large, Rosetta@Home, um, not sure why...

But, this IS one of the things on the developers list of things to do ...
ID: 7574 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Hanging Rosetta??? (sorry for the crosspost)



©2024 University of Washington
https://www.bakerlab.org