Rosetta work units freeze up on Mac OS 10.5.2 dual core machine

Message boards : Number crunching : Rosetta work units freeze up on Mac OS 10.5.2 dual core machine

To post messages, you must log in.

AuthorMessage
hrebml

Send message
Joined: 17 Feb 07
Posts: 5
Credit: 50,521
RAC: 0
Message 52164 - Posted: 28 Mar 2008, 15:24:05 UTC

We've just starting using a MacBook Pro running OSX 10.5.2 with 2.4 GHz Intel Core 2 Duo processor and BOINC appears to be having trouble processing Rosetta work units. Right now we have several WUs ready to start, two "running, high priority", but only one of these is showing any activity and it's been that way for 3-4 hours. This has been going on like this since we set up the machine at the start of the week, even when there are no other available work units for the project, once it's been frozen, it doesn't seem to restart. The third time we had this trouble, I suspended all the other Rosetta work units, thinking this might force the stalled WU to start again, but instead the WUs that had been suspended all froze up after they'd been resumed. Most of the problematic WUs have been Rosetta Beta 5.95, I think one or two were 5.96 .

We're also processing for another project, but haven't noticed the same problem there.

Any thoughts on what might be going on here?
ID: 52164 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 52165 - Posted: 28 Mar 2008, 17:00:43 UTC

...one of these is showing any activity...


...and two of your three machines only show one CPU available. BOINC only runs one task per CPU. And you can configure the maximum number of CPUs you wish for BOINC to use as well.
Rosetta Moderator: Mod.Sense
ID: 52165 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hrebml

Send message
Joined: 17 Feb 07
Posts: 5
Credit: 50,521
RAC: 0
Message 52167 - Posted: 28 Mar 2008, 20:43:07 UTC - in response to Message 52165.  

...one of these is showing any activity...


...and two of your three machines only show one CPU available. BOINC only runs one task per CPU. And you can configure the maximum number of CPUs you wish for BOINC to use as well.


Thanks for your response. I'm still not clear so forgive me for adding some more details.

My preferences for machines with multiple CPUs is to use at most two CPUs. I get the concept that the maximum number of CPUs available for use is not a minimum number, that sometimes only one will be running BOINC projects. But several times in the past week, there've been times when noting was running on the dual core machine except BOINC and a WU for another project was being processed while a Rosetta WU marked "running, high priority" was not running.

Further, when WUs for the other project cycled off, another WU for Rosetta would start up rather than the stalled one marked as "running, high priority", which remained locked up.

This seems like an unusual use of the terms "running" and "high priority", but I'm sure there's something else going on that I don't understand.

ID: 52167 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 52168 - Posted: 29 Mar 2008, 0:22:59 UTC

Oh, thanks for clarifying. I wasn't clear how you were defining "running" and "freeze up".

I believe what you are saying is that the status shown in the BOINC manager is not matching up with what you are seeing in the Mac equivelant of the Windows task manager.

...and so on your dual core machine, we'd expect to see BOINC running two tasks, perhaps from different projects, at a time. But you are seeing that two show a status of running in the BOINC manager, and yet only one is actually getting any CPU time. Have I got it now?

What version of BOINC are you running? It looks like 5.10.45 on that machine?
Rosetta Moderator: Mod.Sense
ID: 52168 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hrebml

Send message
Joined: 17 Feb 07
Posts: 5
Credit: 50,521
RAC: 0
Message 52170 - Posted: 29 Mar 2008, 11:31:38 UTC - in response to Message 52168.  

Oh, thanks for clarifying. I wasn't clear how you were defining "running" and "freeze up".

I believe what you are saying is that the status shown in the BOINC manager is not matching up with what you are seeing in the Mac equivelant of the Windows task manager.

...and so on your dual core machine, we'd expect to see BOINC running two tasks, perhaps from different projects, at a time. But you are seeing that two show a status of running in the BOINC manager, and yet only one is actually getting any CPU time. Have I got it now?

What version of BOINC are you running? It looks like 5.10.45 on that machine?


Yes, that machine is running BOINC 5.10.45 &, yes, I'm basing my quetion on what I'm seeing in the BOINC Manager. Before I went to sleep last night, there was a WU for the other project running, a WU for Rosetta running and a WU unit for Rosetta with a status of "waiting to run". The Rosetta WU that was "waiting to run" has been stuck or frozen at the CPU time and percent of progress since some time yesterday morning while its status has been variously listed as "running", running, high priority" or "waiting to run". While it has been so listed, other WUs, both for Rosetta and my other project, have been listed with a status of "running" and these WUs have shown changes in CPU time and percentage of progress while the "stuck" WU has shown no change in its CPU time and percentage of progress, regardless of the status shown.

This morning the BOINC Manager on this machine shows two WUs completed, one each for Rosetta (the one that had been "running" last night) and one for the other project. There is a WU "running" for the other project & the Rosetta WU that had been "waiting to run" last night while the Rosetta WU that's now completed was "running" is now listed as "running, high priority" but has the exact same CPU time and percentage of progress that it has shown for about the last 18 hours.

Is this any more clear?
ID: 52170 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,888,320
RAC: 0
Message 52173 - Posted: 29 Mar 2008, 19:09:33 UTC - in response to Message 52170.  

Oh, thanks for clarifying. I wasn't clear how you were defining "running" and "freeze up".

I believe what you are saying is that the status shown in the BOINC manager is not matching up with what you are seeing in the Mac equivelant of the Windows task manager.

...and so on your dual core machine, we'd expect to see BOINC running two tasks, perhaps from different projects, at a time. But you are seeing that two show a status of running in the BOINC manager, and yet only one is actually getting any CPU time. Have I got it now?

What version of BOINC are you running? It looks like 5.10.45 on that machine?


Yes, that machine is running BOINC 5.10.45 &, yes, I'm basing my quetion on what I'm seeing in the BOINC Manager. Before I went to sleep last night, there was a WU for the other project running, a WU for Rosetta running and a WU unit for Rosetta with a status of "waiting to run". The Rosetta WU that was "waiting to run" has been stuck or frozen at the CPU time and percent of progress since some time yesterday morning while its status has been variously listed as "running", running, high priority" or "waiting to run". While it has been so listed, other WUs, both for Rosetta and my other project, have been listed with a status of "running" and these WUs have shown changes in CPU time and percentage of progress while the "stuck" WU has shown no change in its CPU time and percentage of progress, regardless of the status shown.

This morning the BOINC Manager on this machine shows two WUs completed, one each for Rosetta (the one that had been "running" last night) and one for the other project. There is a WU "running" for the other project & the Rosetta WU that had been "waiting to run" last night while the Rosetta WU that's now completed was "running" is now listed as "running, high priority" but has the exact same CPU time and percentage of progress that it has shown for about the last 18 hours.

Is this any more clear?


If that wu is still stuck you might try stopping and restarting boinc completely instead of just suspending the one task. That worked for me when I had a similar situation enabling the stuck wu to complete successfully. I would then post about the experience in the appropriate "Problems with x" thread to be sure the project staff gets the report. I wish I had an answer for why some tasks get stuck, though, with my limited knowledge, I suspect the answer would be behind my understanding;)

Snags


ID: 52173 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 52174 - Posted: 29 Mar 2008, 20:59:31 UTC - in response to Message 52170.  

There is a WU "running" for the other project & the Rosetta WU that had been "waiting to run" last night while the Rosetta WU that's now completed was "running" is now listed as "running, high priority" but has the exact same CPU time and percentage of progress that it has shown for about the last 18 hours.


I run BOINC on 60+ computers. Most of them are Linux, some are Windows XP, and a few are Mac OS. I have seen the problem that you describe occur frequently on Mac OS, less often on Linux and rarely on the Windows machines. It is probably a bug in the BOINC Manager. But in my experience Rosetta seems to trigger the problem much more frequently than other projects that I run, more often on Mac OS than other platforms, and more often when there are multiple projects running on the machine.
ID: 52174 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hrebml

Send message
Joined: 17 Feb 07
Posts: 5
Credit: 50,521
RAC: 0
Message 52185 - Posted: 30 Mar 2008, 11:51:29 UTC - in response to Message 52173.  

If that wu is still stuck you might try stopping and restarting boinc completely instead of just suspending the one task. That worked for me when I had a similar situation enabling the stuck wu to complete successfully. I would then post about the experience in the appropriate "Problems with x" thread to be sure the project staff gets the report. I wish I had an answer for why some tasks get stuck, though, with my limited knowledge, I suspect the answer would be behind my understanding;)

Snags





Doh! Thanks for the suggestion. I checked this board late last night and tried this - stopping and restarting the BOINC manager seems to have worked just fine.

'Cause the problem was persistent but intermittent, I'll try this again if work units freeze up again and report back if tis method fails to solve the problem problem consistently.
ID: 52185 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Neil
Avatar

Send message
Joined: 7 Mar 07
Posts: 25
Credit: 135,539
RAC: 0
Message 52486 - Posted: 15 Apr 2008, 18:24:29 UTC - in response to Message 52185.  

- stopping and restarting the BOINC manager seems to have worked just fine.


Me too. I'm on Win XP. "Waiting to run" and "Waiting for memory" are two new somethings on my computer. They suddenly started popping up over the last day or two.

My main concern is that "stopping and restarting the BOINC manager" to give a kick to the stalled Work Unit is going to cause work to be lost. If it's been a while since a "Checkpoint" has been saved (or created, or whatever Checkpoints do...), then it could be hours of "CPU Time / Progress" that gets thrown away with capricious stops-and-restarts.

The name of one Work Unit sitting there with "Waiting to run" begins with "FRA_t038..."

-----

Scranton, Pennsylvania, where it's gray and dreary, even on the sunniest of days
ID: 52486 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,888,320
RAC: 0
Message 52617 - Posted: 19 Apr 2008, 15:59:08 UTC - in response to Message 52486.  

- stopping and restarting the BOINC manager seems to have worked just fine.


Me too. I'm on Win XP. "Waiting to run" and "Waiting for memory" are two new somethings on my computer. They suddenly started popping up over the last day or two.

My main concern is that "stopping and restarting the BOINC manager" to give a kick to the stalled Work Unit is going to cause work to be lost. If it's been a while since a "Checkpoint" has been saved (or created, or whatever Checkpoints do...), then it could be hours of "CPU Time / Progress" that gets thrown away with capricious stops-and-restarts.

The name of one Work Unit sitting there with "Waiting to run" begins with "FRA_t038..."

-----

Scranton, Pennsylvania, where it's gray and dreary, even on the sunniest of days


I'm probably not the best person to respond here but it's been three days and I hate to see you left hanging. Let's see if I can't at least get you pointed in the right direction.

"waiting to run" is not an indicator of a "stuck" wu, at least not of the type Herb was experiencing. It simply means boinc has suspended work on one wu to allow another project's wu run for a while. (Remember boinc is designed to handle wus from many different projects at the same time). There is a setting "switch between applications every x minutes" that controls this. It looks like you only run R@h which would explain why you have not seen this status before.

"waiting for memory" means boinc has bumped up against memory limits either because you simply don't have enough for all the things you are trying to run on your computer or because of the specific limits you have placed on memory usage in the boinc preferences.

Memory requirements for rosatta wus varies and in the last few months there have been some whose requirements have been considerably greater than other rosetta wus. This could explain why you haven't seen the "waiting for memory" status before. (It may have shown up but if it was while you were running some other program on your computer you might not have noticed it since it would have gone away by the time you closed that program and opened the boinc manager).

So here's my speculation on what's happening. Boinc starts a high memory task on your computer, hits the limit and suspends that wu with a "waiting for memory" status. It starts a second wu. The switch between applications interval is met, and provided some memory has been freed up in the meantime, boinc suspends computation on the second wu putting it in "waiting to run" status and resumes crunching the first wu.

There is another relevant preferences setting, "leave applications in memory while suspended". If you have this set to no, then when you shut down boinc you will lose any work after the last checkpoint. If you are closing boinc to try to "unstick" wus that are in "waiting for memory" you could lose a lot of crunching time. (If you do this repeatedly the watchdog will decide this wu isn't compatible with your computer, end the run and report it back as an error). Boinc won't switch to another application until a checkpoint has been reached regardless of the "switch between applications" interval (at least for the more recent clients) so the "waiting to run" task should be just fine.

Hope this helps.

Snags
ID: 52617 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Rosetta work units freeze up on Mac OS 10.5.2 dual core machine



©2024 University of Washington
https://www.bakerlab.org