Too many restarts with no progress

Message boards : Number crunching : Too many restarts with no progress

To post messages, you must log in.

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,499,576
RAC: 3,223
Message 67260 - Posted: 18 Aug 2010, 11:01:26 UTC
Last modified: 18 Aug 2010, 11:02:12 UTC

A new member of my team has just started crunching, but I'm not sure if they're having problems or not. I'm guessing they are even though all tasks are reporting success.

This task barely ran 30 minutes instead of 3 hours. Lots of messages came back as if it was struggling to run successfully and had to restart repeatedly, culminating in the message "Too many restarts with no progress. Keep application in memory while preempted" before closing down cleanly. I have lots of RAM and one of my jobs looks like this. I've asked him to ensure he has "Leave applications in memory while suspended" ticked - many fewer messages in the task.

He has a dual core with 1Gb RAM, which may be tight on RAM. He's adjusted RAM to use 90% (from 60%) while the computer is in use and more recent jobs have run longer and with fewer messages.

Does this seem to be a memory-related issue, as I suspect, or could it indicate some other problem? 1Gb RAM (on an XP machine) ought to be plenty really.

Any further suggestions I could look at? All advice appreciated.
ID: 67260 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 67261 - Posted: 18 Aug 2010, 11:33:26 UTC

Any further suggestions I could look at?


Morning Sid - from time to time I too have seen restarts on tasks with no other associated error messages - just something like "restarting from checkpoint ..."

My first thought is that this is NOT related to the amount of available memory - but I will admit that not having the source to BOINC or the application I have not seen the logic behind triggering a restart.

You make no mention of any system related error messages - such as a segfault or a processor exception so I would not rush to jump on a hardware issue.

As far as available memory resources go, I would think that if real memory was not available you would either page fault and swap or the user would see the task go into the "waiting for memory" state - I saw that on my systems a few times when I first started using hex core processors without upgrading memory first.

One gig minus the OS overhead may not leave much for the Rosetta tasks but I used to successfully run a tri-core AMD with two gig on Linux.

If you still suspect system resources are the root cause of this issue then why not suggest that he bring up the system monitor and sort of watch things for a while - he should be able to see free memory and swap activity.

Another thing to try to isolate the issue to a shortage of memory would be to go to the Computing Preferences page on his account and set the maximum number of processors to use to 1. If it is being caused by a memory shortage that should help - if it dogs out other systems on the account oh well, at least you know what it is and what needs to be done to resolve it.

Let us know what you come up with.

ID: 67261 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1898
Credit: 12,723,752
RAC: 682
Message 67264 - Posted: 18 Aug 2010, 12:26:09 UTC - in response to Message 67261.  

Any further suggestions I could look at?


Morning Sid - from time to time I too have seen restarts on tasks with no other associated error messages - just something like "restarting from checkpoint ..."

My first thought is that this is NOT related to the amount of available memory - but I will admit that not having the source to BOINC or the application I have not seen the logic behind triggering a restart.

You make no mention of any system related error messages - such as a segfault or a processor exception so I would not rush to jump on a hardware issue.

As far as available memory resources go, I would think that if real memory was not available you would either page fault and swap or the user would see the task go into the "waiting for memory" state - I saw that on my systems a few times when I first started using hex core processors without upgrading memory first.

One gig minus the OS overhead may not leave much for the Rosetta tasks but I used to successfully run a tri-core AMD with two gig on Linux.

If you still suspect system resources are the root cause of this issue then why not suggest that he bring up the system monitor and sort of watch things for a while - he should be able to see free memory and swap activity.

Another thing to try to isolate the issue to a shortage of memory would be to go to the Computing Preferences page on his account and set the maximum number of processors to use to 1. If it is being caused by a memory shortage that should help - if it dogs out other systems on the account oh well, at least you know what it is and what needs to be done to resolve it.

Let us know what you come up with.


Do you think it would help if he changed the 25% level to stop crunching to 0%? I have done that on all of my machines and Boinc no longer stops crunching at all. All of my machines also have at least 2 gig of memory in them.
ID: 67264 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 67266 - Posted: 18 Aug 2010, 13:14:21 UTC

One potential cause of the message that we sometimes forget is people rebooting their machine. If you reboot before a checkpoint is reached, that is "no progress" on the next restart. Do this several times and the application figures something isn't going well for this combination of task and host, so that task is sent home (I think it takes 5 times restarting with no progress).

Another thought, have you reviewed the BOINC settings for disk page file ("swap") space? If this were set very low, perhaps odd problems would arise.
Rosetta Moderator: Mod.Sense
ID: 67266 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,499,576
RAC: 3,223
Message 67268 - Posted: 18 Aug 2010, 13:33:00 UTC - in response to Message 67261.  

Morning Sid - from time to time I too have seen restarts on tasks with no other associated error messages - just something like "restarting from checkpoint..."

It doesn't say that. It's just the repetition of previous messages.

You make no mention of any system related error messages - such as a segfault or a processor exception so I would not rush to jump on a hardware issue.

Fair comment. I'll ask if there are any other clues from the messages tab. At the moment I'm only going by the reported task details.

As far as available memory resources go, I would think that if real memory was not available you would either page fault and swap or the user would see the task go into the "waiting for memory" state - I saw that on my systems a few times when I first started using hex core processors without upgrading memory first.

Understood, but I'm not sure if Boinc uses the swapfiles too well. Still guessing here.

If you still suspect system resources are the root cause of this issue then why not suggest that he bring up the system monitor and sort of watch things for a while - he should be able to see free memory and swap activity.

I'm in recruitment mode and I'm reluctant to indicate BoincRosetta needs this level of babysitting. I think I'd scare off more people than I recruit.

Another thing to try to isolate the issue to a shortage of memory would be to go to the Computing Preferences page on his account and set the maximum number of processors to use to 1. If it is being caused by a memory shortage that should help - if it dogs out other systems on the account oh well, at least you know what it is and what needs to be done to resolve it.

Nice idea. I'll keep that one up my sleeve for the moment if things don't settle down.
ID: 67268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,499,576
RAC: 3,223
Message 67269 - Posted: 18 Aug 2010, 13:56:08 UTC - in response to Message 67264.  

Do you think it would help if he changed the 25% level to stop crunching to 0%? I have done that on all of my machines and Boinc no longer stops crunching at all. All of my machines also have at least 2 gig of memory in them.

Very possible. This may explain better why it only seems to get so far then go back to the start unexpectedly.

Good suggestions from everyone. I've pointed out this thread to the user - hopefully one or all of the suggestions makes the difference. Thanks.
ID: 67269 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,499,576
RAC: 3,223
Message 67280 - Posted: 20 Aug 2010, 3:21:16 UTC - in response to Message 67266.  

I've asked him to ensure he has "Leave applications in memory while suspended" ticked...

It wasn't ticked.

One potential cause of the message that we sometimes forget is people rebooting their machine. If you reboot before a checkpoint is reached, that is "no progress" on the next restart. Do this several times and the application figures something isn't going well for this combination of task and host, so that task is sent home (I think it takes 5 times restarting with no progress).

This may be an ongoing issue too. No task has outright failed yet, but being aware of possibilities always helps. Task details look much tidier now so I'm happy enough to close this issue now.

Thanks all.
ID: 67280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Too many restarts with no progress



©2025 University of Washington
https://www.bakerlab.org