Stalled WU

Message boards : Number crunching : Stalled WU

To post messages, you must log in.

AuthorMessage
Ed Machak

Send message
Joined: 10 Nov 16
Posts: 7
Credit: 17,339,411
RAC: 0
Message 105294 - Posted: 1 Mar 2022, 0:50:45 UTC

Hello,

I have run at least a half dozen WU down to > 99% completed then they stall. Time remaining goes to a few minutes and stays there. I've had to abort all 6 WU as they've run past the expiration due date.

Is this a common thing? It's been happening over the last month. I hate to waste all that CPU time that might go to better use on another project. I've been with R@H since 2011 and would like to continue to do useful work if possible.

Thank you,

Ed Machak
ID: 105294 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 105296 - Posted: 1 Mar 2022, 1:06:53 UTC

Some python work units quickly turn into zombies ,
after less than 60 seconds of CPU time
If you see a work unit over running , click on it and "properties" if its "elapsed" time is a lot more , its gone zombie
Abort them
There are far to many of them .
ID: 105296 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ed Machak

Send message
Joined: 10 Nov 16
Posts: 7
Credit: 17,339,411
RAC: 0
Message 105306 - Posted: 1 Mar 2022, 21:39:47 UTC - in response to Message 105296.  

Thanks for the tip.

Ed
ID: 105306 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 1,996
Message 105307 - Posted: 1 Mar 2022, 22:54:42 UTC

If the task does not complete within 12 hours elapsed time then it stuck. No need to run it for 1 or 2 days.
You ran 2.5 days before killing it. After 1 day elapsed, kill it.

Before you kill it, goto the slot it is stored in and look at the stderr text.
Look at checkpoint times and elapsed times.

Like this:


Status Report: Elapsed Time: '6000.564621'
Status Report: CPU Time: '6877.687500'

Look at each status report from the bottom up and see how much it advanced or if there are any error messages in the text. This will give you a better idea whats going on.
If everything looks normal up to the 95% or whatever mark and then it stalls, then its something in the data itself and all you can do is kill the task.

You can also download and use Emfer Boinc Tasks program and set up the columns so you see CPU% as one of them and then you can tell if its stalled or not. If it uses a decimal percent of the CPU then its stuck and you can kill it. BT is a very useful program for monitoring.
ID: 105307 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 105898 - Posted: 10 Apr 2022, 13:31:59 UTC
Last modified: 10 Apr 2022, 13:32:38 UTC

Hi all

I don't think I have "stalled" tasks - as the %age work done is still increase - but they are taking AGES to complete...

task #1

Application - rosetta python projects 1.03 (vbox64)
Name - aagb-NMPHE_pp-NMVAL-GGLY-mACPenC12C_pp_7_2674773_4
State - Running
Received - 08/04/2022 00:41:54
Report deadline - 11/04/2022 00:41:56
Estimated computation size - 80,000 GFLOPs
CPU time - 00:34:13
CPU time since checkpoint - 00:00:06
Elapsed time - 1d 20:46:22
Estimated time remaining - 01:06:58
Fraction done - 97.568%
Virtual memory size - 101.57 MB
Working set size - 2.79 GB
Directory - slots/3
Process ID - 5000
Progress rate - 2.160% per hour
Executable - vboxwrapper_26203_windows_x86_64.exe

=========
tasks #2

Application - rosetta python projects 1.03 (vbox64)
Name - aagb-mAZE-mPHE-GPN-mB3PHG_pp_9_2612326_4
State - Running
Received - 08/04/2022 00:41:11
Report deadline - 11/04/2022 00:41:13
Estimated computation size - 80,000 GFLOPs
CPU time - 00:37:56
CPU time since checkpoint - 00:00:06
Elapsed time - 2d 02:10:06
Estimated time remaining - 00:47:31
Fraction done - 98.446%
Virtual memory size - 101.04 MB
Working set size - 2.79 GB
Directory - slots/1
Process ID - 7280
Progress rate - 1.800% per hour
Executable - vboxwrapper_26203_windows_x86_64.exe


And from Task Manager Is ee that CPU usage fluctuates between 0% and maybe 1%

This is very much a waste of computing time, if the tasks are not actually doing much...but I don't want to abort them, if the task is going to complete and the "result" file is of benefit...

Maybe some admin can provide more succinct answers as to why this is happening, as others seems to ahev reported similar issues with what appear to be "zombie" tasks.,.
regards,
Tim

Founder, UK BOINC Team
Join UK BOINC Team: http://www.ukboincteam.org.uk/newforum
ID: 105898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 376
Credit: 10,749,980
RAC: 6,154
Message 105899 - Posted: 10 Apr 2022, 15:24:13 UTC - in response to Message 105898.  

Look at the difference between CPU time and elapsed time, either there is something serious running alongside Boinc or, far more likely, those tasks are dead.
ID: 105899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 105901 - Posted: 10 Apr 2022, 18:18:51 UTC - in response to Message 105899.  
Last modified: 10 Apr 2022, 18:24:11 UTC

Look at the difference between CPU time and elapsed time, either there is something serious running alongside Boinc or, far more likely, those tasks are dead.


Hi

Thanks for the feedback. :-)

I've seen this sort of behaviour before with other non-VBox projects and usually the rule of thumb is to "leave them be" and they will (eventually) complete...

But I've not had this happen with Rosetta's VBox tasks before - and indeed I have one other host, with the same OS (Win 7 Pro), the same VBox version and the same version of BOINC Manager, and that has been fairly rattling through the tasks...and both hosts have plenty of installed, working RAM - and no other significant non-BOINC tasks are taking place simultaneously.

eg: One VBoxHeadless.exe is taking up 71Mb, the other is at 39Mb and VirtualBox.exe is taking up 18.5Mb - which are minute amounts of RAM in the grand scheme of things...

So, it might be my old CPU on this one host could be "past it" - maybe the right CPU "core-functions" are not up to the mark ...but it works fine with LHC and QuChem VBox tasks...

Which leads one to assume there might be something peculiar with the Rosetta VBox tasks themselves... ?
regards,
Tim

Founder, UK BOINC Team
Join UK BOINC Team: http://www.ukboincteam.org.uk/newforum
ID: 105901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 376
Credit: 10,749,980
RAC: 6,154
Message 105902 - Posted: 10 Apr 2022, 18:34:55 UTC - in response to Message 105901.  

Look at the difference between CPU time and elapsed time, either there is something serious running alongside Boinc or, far more likely, those tasks are dead.


Hi

Thanks for the feedback. :-)

I've seen this sort of behaviour before with other non-VBox projects and usually the rule of thumb is to "leave them be" and they will (eventually) complete...

But I've not had this happen with Rosetta's VBox tasks before - and indeed I have one other host, with the same OS (Win 7 Pro), the same VBox version and the same version of BOINC Manager, and that has been fairly rattling through the tasks...and both hosts have plenty of installed, working RAM - and no other significant non-BOINC tasks are taking place simultaneously.

eg: One VBoxHeadless.exe is taking up 71Mb, the other is at 39Mb and VirtualBox.exe is taking up 18.5Mb - which are minute amounts of RAM in the grand scheme of things...

So, it might be my old CPU on this one host could be "past it" - maybe the right CPU "core-functions" are not up to the mark ...but it works fine with LHC and QuChem VBox tasks...

Which leads one to assume there might be something peculiar with the Rosetta VBox tasks themselves... ?


Yes, there is a problem with some of the Rosetta VBox tasks that causes this behaviour.
ID: 105902 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 105905 - Posted: 10 Apr 2022, 23:50:29 UTC - in response to Message 105901.  

So, it might be my old CPU on this one host could be "past it" - maybe the right CPU "core-functions" are not up to the mark ...but it works fine with LHC and QuChem VBox tasks...

Whatever the cpu is it is good enugh to run them so it is ok in that way .
Which leads one to assume there might be something peculiar with the Rosetta VBox tasks themselves... ?

Now there is an understatement . . .
ID: 105905 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 105907 - Posted: 11 Apr 2022, 11:42:06 UTC - in response to Message 105902.  
Last modified: 11 Apr 2022, 11:44:01 UTC

Which leads one to assume there might be something peculiar with the Rosetta VBox tasks themselves... ?


Yes, there is a problem with some of the Rosetta VBox tasks that causes this behaviour.


Hi

Yup - it certainly seems like that :-(

You'da thought that a "tech admin" would be overseeing the results returned, would have recognised that a certain percentage were taking far too long to be reported and would be actively figuring out there was a problem and would fix it.

Instead, the situation seems to be that volunteers computers are wasting time, money and electricity, by spinning their wheels, due to Rosetta's poor and inefficient management of the tasks they make available. :-(
regards,
Tim

Founder, UK BOINC Team
Join UK BOINC Team: http://www.ukboincteam.org.uk/newforum
ID: 105907 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 376
Credit: 10,749,980
RAC: 6,154
Message 105908 - Posted: 11 Apr 2022, 13:51:45 UTC - in response to Message 105907.  


Instead, the situation seems to be that volunteers computers are wasting time, money and electricity, by spinning their wheels, due to Rosetta's poor and inefficient management of the tasks they make available. :-(


Simple solution, I just refuse to run the Python tasks - too buggy and too resource hungry.
ID: 105908 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,584,443
RAC: 17,403
Message 105911 - Posted: 11 Apr 2022, 21:21:09 UTC - in response to Message 105898.  
Last modified: 11 Apr 2022, 21:29:51 UTC

I don't think I have "stalled" tasks - as the %age work done is still increasing[quote]

task #1
CPU time - 00:34:13
Elapsed time - 1d 20:46:22

=========
tasks #2

CPU time - 00:37:56
Elapsed time - 2d 02:10:06

And from Task Manager I see that CPU usage fluctuates between 0% and maybe 1%

This is very much a waste of computing time if the tasks are not actually doing much... but I don't want to abort them if the task is going to complete and the "result" file is of benefit...

Maybe some admin can provide more succinct answers as to why this is happening, as others seem to have reported similar issues with what appear to be "zombie" tasks.,.

There's no way of telling from the task manager which task is running or not.
The difference in CPU time and Elapsed time is telling you <exactly> what's happening with the task.
It's the very definition of "stalled" or a "zombie" task. Things are no more complicated than that.
Knowing why is the researcher's problem. We only need to know that they've stopped and, of the hundreds I've seen, they <never ever> restart and nothing you can do will change that.
Abort on sight. Don't worry about why, just do it and get on with your day.

Quoting my earlier message referring specifically to VBox tasks:
~~~
Repeating my earlier message for those who haven't seen it:
If you have a task you think is stalled or taking a long time, click on it and select properties on the left.
If there's a large difference between CPU time and Elapsed time, then it's stalled and you can only abort it. They <never> restart.

Also, if later tasks are completing before earlier tasks, it's a clue to check Properties of those earlier tasks and, if they've stalled in the way described above, abort them.
This wastes the least amount of processing time.

It's not your fault and there's <nothing> you can do to correct it
~~~
ID: 105911 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,998,809
RAC: 64,136
Message 105917 - Posted: 12 Apr 2022, 17:42:58 UTC

BOINCTasks shows whether a task is using CPU time or not so you can see what to abort.


https://efmer.com/boinctasks/download-boinctasks/
ID: 105917 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,513,485
RAC: 18,969
Message 105928 - Posted: 13 Apr 2022, 16:39:02 UTC - in response to Message 105917.  

BOINCTasks shows whether a task is using CPU time or not so you can see what to abort.
https://efmer.com/boinctasks/download-boinctasks/


I use Windows BOINCTasks and it is very obvious when a Rosetta WU hangs. The CPU usage goes to zero and stays. I have never seen one finish after the CPU goes to 0%.
On Linux I use "top -i -c -d3" to get a similar display. I press "SHIFT P" to sort processes by CPU time.

"-i" only show running processes
"-c" show the command line so you can see what is burning CPU
"-d 3" sample every 3 seconds so I can see the display


I have two computers with near identical configurations and I saw the number of stalls/hangs increase SIGNIFICANTLY when I simply updated VirtualBox to a newer version than comes with BOINC. When I uninstalled BOINC and VirtualBox and reinstalled again, the problems cleared up. It appears the Rosetta developers/integrator introduced some dependency on a VirtualBox.

Using VirtualBox was supposed to reduce the Rosetta developer problems with different environments. It looks more like they just put a 3gb vbox wrapper around it and introduced a new set of problems.

BOINC startup times when running Rosetta WU is now minutes instead of seconds.
Checkpoints that write gb of data to the BOINC drive is going to kill volunteer HW.
Excess memory demands exhausts memory and adds to the unnecessary excess power needed to run Rosetta WU.
ID: 105928 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,584,443
RAC: 17,403
Message 105942 - Posted: 15 Apr 2022, 10:47:39 UTC - in response to Message 105928.  

I use Windows BOINCTasks and it is very obvious when a Rosetta WU hangs. The CPU usage goes to zero and stays. I have never seen one finish after the CPU goes to 0%.

A new one I've seen is a task with zero elapsed time with a status of "waiting to run"
They never ever start either. Annoying, but quicker to abort
ID: 105942 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Stalled WU



©2024 University of Washington
https://www.bakerlab.org