Tasks getting stuck - calculating for hours without progress

Message boards : Number crunching : Tasks getting stuck - calculating for hours without progress

To post messages, you must log in.

AuthorMessage
Profile LigH
Avatar

Send message
Joined: 7 Sep 09
Posts: 23
Credit: 8,895,348
RAC: 3,304
Message 68569 - Posted: 11 Nov 2010, 8:33:42 UTC

I know that this has been reported in a thread about version 2.16 already, but I believe it is not that much specific to only one version, but rather a generic issue - and people may find it easier if it can be found as an own thread...

Several times now I found tasks which need a suspiciously long time for just few percents of progress (like 13 hours for 1%). Displaying the properties of this task shows e.g. (translated from german messages):

Processor time at last checkpoint: ---
Processor time: 00:04:25
Elapsed time: 13:22:49

Such a task is possibly caught in an infinite loop. Is there anything we can do to help you solving such issues? Should we report the names? Is there a log with issues we can send you? Or is all we can do to cancel this task?
Fun and success!

Jobs: holzon + 12angebote
Hobbies: doom9/Gleitz + PlaneShift
ID: 68569 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68571 - Posted: 11 Nov 2010, 12:21:31 UTC

LigH -

I'm not telling you that this is "the problem" but it is worth checking - check to see how much memory the running tasks are taking (highlight the job and then click on the properties job)

Recently there have been a bunch of tasks pass through the pipeline which used an incredible amount of memory - I routinely saw tasks gobble up almost 2 gigabytes each.

I had so much paging activity that jobs which normally would have been cut off at 8 hours elapsed time, were still going strong at 30 hours.

CH
ID: 68571 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,809,072
RAC: 254
Message 68576 - Posted: 11 Nov 2010, 15:54:53 UTC

Quitting and restarting BOINC is usually a workaround for this problem. For the hung tasks, check the Task Manager if you're in Windows to see what %age of cpu time they're getting. If the task is hung it'll be 0%, for working tasks it'll be 50% (on a dual core machine).
ID: 68576 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68577 - Posted: 11 Nov 2010, 16:21:55 UTC
Last modified: 11 Nov 2010, 16:23:19 UTC

"elapsed time" is not the key point. The key point is CPU time used. If the task is not getting any CPU time (possibly because your machine is busy processing an infinite loop in some other program) then it's not a loop in Rosetta.

As for reporting such issues, that is best done with a link to the specific task(s) in question in the thread for the Rosetta version that is running the task (which you can view in the tasks tab in the "application" column).
Rosetta Moderator: Mod.Sense
ID: 68577 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile LigH
Avatar

Send message
Joined: 7 Sep 09
Posts: 23
Credit: 8,895,348
RAC: 3,304
Message 68776 - Posted: 7 Dec 2010, 10:22:21 UTC

At the moment I have 3/4 tasks hung (each 0% CPU), each using just ~300 MB RAM.

I'll try to find the tasks and link them.
Fun and success!

Jobs: holzon + 12angebote
Hobbies: doom9/Gleitz + PlaneShift
ID: 68776 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,773,304
RAC: 3,388
Message 68778 - Posted: 7 Dec 2010, 11:48:28 UTC - in response to Message 68776.  

At the moment I have 3/4 tasks hung (each 0% CPU), each using just ~300 MB RAM.

I'll try to find the tasks and link them.


What message is it giving you in the Messages tab of Boinc Manager?
ID: 68778 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile LigH
Avatar

Send message
Joined: 7 Sep 09
Posts: 23
Credit: 8,895,348
RAC: 3,304
Message 68780 - Posted: 7 Dec 2010, 13:01:33 UTC - in response to Message 68778.  
Last modified: 7 Dec 2010, 13:02:58 UTC

What message is it giving you in the Messages tab of Boinc Manager?


Paste -- I don't see anything suspicious. But they possibly already started last week, that's not included anymore...

Tasks are linked in the minirosetta 2.17 thread.
Fun and success!

Jobs: holzon + 12angebote
Hobbies: doom9/Gleitz + PlaneShift
ID: 68780 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,773,304
RAC: 3,388
Message 68789 - Posted: 8 Dec 2010, 11:50:16 UTC - in response to Message 68780.  

What message is it giving you in the Messages tab of Boinc Manager?


Paste -- I don't see anything suspicious. But they possibly already started last week, that's not included anymore...

Tasks are linked in the minirosetta 2.17 thread.


This is the line I was concerned about:
"06.12.2010 08:53:59 suspend work if non-BOINC CPU load exceeds 75 %"

I would change that to a smaller number so the pc does not stop crunching. You change it in the Boinc Manager under Advanced Preferences and the processor usage tab. The line starts out " While processor usage is less than[___} percent", for my machines I put a zero in there and it seems to work for me. You MUST click OK at the bottom to save any changes!! These changes will only be for that pc, not on all of them, to do that you must make the changes on the website.
ID: 68789 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile LigH
Avatar

Send message
Joined: 7 Sep 09
Posts: 23
Credit: 8,895,348
RAC: 3,304
Message 68807 - Posted: 15 Dec 2010, 8:30:33 UTC - in response to Message 68789.  

This is the line I was concerned about:
"06.12.2010 08:53:59 suspend work if non-BOINC CPU load exceeds 75 %"


This value is there for a good reason. BOINC may suspend tasks when other threads are really busy (like video encoding).

But I know my tools quite well, I can handle e.g. ProcessExplorer ... a task suspended due to a busy machine is different from a task being stuck, an otherwise idle PC (except for text editors) should allow 4 BOINC tasks running on a QuadCore CPU, and when the CPU is not used up to 100% by 4x minirosetta in such nearly-idle times, then there is something buggy.
Fun and success!

Jobs: holzon + 12angebote
Hobbies: doom9/Gleitz + PlaneShift
ID: 68807 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68810 - Posted: 15 Dec 2010, 12:54:29 UTC

OK, now your description more clearly matches a problem that crops up on occasion where BOINC shows tasks a "running", but they aren't getting any CPU time, even when idle CPU is available. Since BOINC is responsible to assign CPU time to tasks, I tend to believe it is a BOINC problem, not a Rosetta problem. To date I've not been able to discern any patterns as to what causes the tasks to stop getting CPU. The only way to get them started again, that I know of anyway, is to completely exit and restart BOINC. Which, especially when you've got many tasks running, can mean losing a fair amount of CPU time restarting back at the most recent checkpoints for each. If the machine periodically reboots (thus a complete restart of BOINC) you might just "suspend" the tasks that indicate they are in a "running" status, but are not accruing CPU time. Other tasks will then run. Then after a reboot, "resume" the tasks again.

All that gets to be quite bothersome, so it is a good thing it is a fairly rare occurrence. Although it seems once one task has the problem, others are more likely to as well. Any insight as to what causes the task to stop getting CPU is appreciated. I believe Windows is the only platform where such things are being reported.
Rosetta Moderator: Mod.Sense
ID: 68810 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Tasks getting stuck - calculating for hours without progress



©2024 University of Washington
https://www.bakerlab.org