Summary of issues with VirtualBox tasks

Message boards : Number crunching : Summary of issues with VirtualBox tasks

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,573,201
RAC: 8,994
Message 104693 - Posted: 7 Feb 2022, 13:55:43 UTC

Hey everyone, I thought it would be a good idea to create a post listing the issues with the VirtualBox tasks, which can then be updated as/if they get fixed. This isn't a thread to list the details of the issues - just to link to the wider discussions elsewhere:

1. Sometimes tasks don't start. They sit there at 0% and with no time used, but say "Running" in BOINC Manager. This tends to happen in batches for me and happens on multiple machines. I don't think there is a thread discussing this. Restarting BOINC Manager fixes this.

2. Some tasks never end. The % keeps climbing but they have to be aborted. My record that I've noticed is 4-days of CPU time. The error on the Vbox screen is always the same, but might be misleading (Spectre error). Thread here:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14897

3. Some machines cannot run Rosetta VirtualBox tasks due to the Intel MKL (Math Kernel Library) fatal error. I would guess this affects something like 20% of machines, including servers. This is not due to Virtualisation exensions being disabled as other VBox projects work fine. Thread here:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14886&postid=104574

Of course there are lots of other issues with VirtualBox tasks, like the disk space requirements and the volume of disk writes, which are not technically probelms, but do have a significant impact on the amount of processing available to the project.

Have I missed any major issues?
ID: 104693 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jordan Toth

Send message
Joined: 19 Dec 16
Posts: 6
Credit: 172,398
RAC: 0
Message 104696 - Posted: 7 Feb 2022, 17:04:17 UTC - in response to Message 104693.  

I can't install Virtualbox - it states it's not compatible with my iMac, do I need to have it installed in order to run Rosetta@home? I haven't gotten any work for my computer to do.
ID: 104696 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 353
Credit: 1,227,479
RAC: 1,506
Message 104699 - Posted: 7 Feb 2022, 17:10:25 UTC - in response to Message 104696.  

I can't install Virtualbox - it states it's not compatible with my iMac, do I need to have it installed in order to run Rosetta@home? I haven't gotten any work for my computer to do.



Virtualbox is available for Mac though: https://www.oracle.com/virtualization/technologies/vm/downloads/virtualbox-downloads.html

If you don't install Virtualbox, you will not get the Python tasks. You will be limited to the standard Rosetta 4.20 tasks which are quite rare these days.
ID: 104699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gbayler

Send message
Joined: 10 Apr 20
Posts: 14
Credit: 3,069,484
RAC: 0
Message 104735 - Posted: 8 Feb 2022, 21:44:24 UTC - in response to Message 104693.  


1. Sometimes tasks don't start. They sit there at 0% and with no time used, but say "Running" in BOINC Manager. This tends to happen in batches for me and happens on multiple machines. I don't think there is a thread discussing this. Restarting BOINC Manager fixes this.

2. Some tasks never end. The % keeps climbing but they have to be aborted. My record that I've noticed is 4-days of CPU time. The error on the Vbox screen is always the same, but might be misleading (Spectre error). Thread here:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14897

I observed sort of a combination of both problems, but maybe it is something different altogether. In my case, VirtualBox-tasks seem to start in the BOINC Manager (Status: "Running"), when actually the VM that belongs to that task hangs while booting. The progress indicated in the BOINC manager asymptotically approaches 100%, but never reaches it. These tasks run until they are manually aborted. They don't consume much CPU time. I observed the problem using Linux and using Windows. When using Windows, I checked the VBox screen: in one case it was completely empty, in the other case it showed the error message
Couldn't copy file: fwrite() failed

I have not seen the Spectre error yet. In Linux I haven't yet figured out how to check the VBox-screen.
@Jim1348 described such tasks in the forum as "0 CPU-tasks".
I have written a watchdog-script to abort such tasks as soon as possible. This is a good workaround for me, of course it would be better if this problem would be fixed.


Have I missed any major issues?

4. Some tasks get the status "Postponed: VM job unmanagable, restarting later." and block a slot where another work unit could be processed. This was discussed here and here. In my experience, after restarting the BOINC client, the results of such tasks are reported and new work units are downloaded and processed.
ID: 104735 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104739 - Posted: 8 Feb 2022, 22:32:55 UTC - in response to Message 104735.  


1. Sometimes tasks don't start. They sit there at 0% and with no time used, but say "Running" in BOINC Manager. This tends to happen in batches for me and happens on multiple machines. I don't think there is a thread discussing this. Restarting BOINC Manager fixes this.

2. Some tasks never end. The % keeps climbing but they have to be aborted. My record that I've noticed is 4-days of CPU time. The error on the Vbox screen is always the same, but might be misleading (Spectre error). Thread here:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14897

I observed sort of a combination of both problems, but maybe it is something different altogether. In my case, VirtualBox-tasks seem to start in the BOINC Manager (Status: "Running"), when actually the VM that belongs to that task hangs while booting. The progress indicated in the BOINC manager asymptotically approaches 100%, but never reaches it. These tasks run until they are manually aborted. They don't consume much CPU time. I observed the problem using Linux and using Windows. When using Windows, I checked the VBox screen: in one case it was completely empty, in the other case it showed the error message
Couldn't copy file: fwrite() failed

I have not seen the Spectre error yet. In Linux I haven't yet figured out how to check the VBox-screen.
@Jim1348 described such tasks in the forum as "0 CPU-tasks".
I have written a watchdog-script to abort such tasks as soon as possible. This is a good workaround for me, of course it would be better if this problem would be fixed.


Have I missed any major issues?

4. Some tasks get the status "Postponed: VM job unmanagable, restarting later." and block a slot where another work unit could be processed. This was discussed here and here. In my experience, after restarting the BOINC client, the results of such tasks are reported and new work units are downloaded and processed.


What's that error in QuChem? Very similar...virtual enviroment unmanageable...restart later (paraphrase)
ID: 104739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xii5ku

Send message
Joined: 29 Nov 16
Posts: 22
Credit: 13,815,783
RAC: 190
Message 104802 - Posted: 12 Feb 2022, 7:52:00 UTC
Last modified: 12 Feb 2022, 7:59:07 UTC

(reposted from the AnandTech forum and the QuChemPedIA message board)

About the particular problem of tasks which want to run endlessly, consuming very little CPU time doing so: While I haven't looked deeply enough to find the cause, let a lone a fix, I at least automated the only currently known workaround — which is to abort these tasks.

I am using the following script which periodically checks for the presence of tasks with CPU time << elapsed time and aborts these. The script interpreter is 'bash', hence it is not entirely straightforward to run on Windows. Cygwin should work, WSL might work. (I am only running Linux myself. You could also run the script on a Linux box and let it control Windows hosts.) Furthermore, the script requires a fairly recent version of 'boinccmd'. I don't know precisely how recent, but 7.16.17 works, 7.16.6 does not work. (The version of the boinc clients which are controlled by the script does not matter.)

#!/bin/bash

# Edit this:
#    a list of hosts, each optionally with GUI port number appended
#    (may be just a single host, or dozens of hosts)
hosts=(
	"localhost"
	"computer_a"
	"computer_b:31420"
)

# Edit this:
#    the password from gui_rpc_auth.cfg
#    This script expects the same password on all hosts.
#    Can be set to "" if you have empty gui_rpc_auth.cfg's.
password="$(cat /var/lib/boinc/gui_rpc_auth.cfg)"

# Edit this if you want to apply this to a different project.
project_url="https://boinc.bakerlab.org/rosetta/"

# Change this from "abort" to "suspend" if you prefer.
task_op="abort"

# Before a task hasn't been executing for some time, other task stats
# may still be imprecise.  The script therefore does not touch any
# tasks which haven't been executing for at least this many seconds.
# You can use integer numbers here, but not floating point numbers.
# E.g.: 5 * 60 for 5 minutes.
min_elapsed_time=$((5 * 60))

# After tasks were aborted, boinc-client may cease to request
# new work due to "Communication deferred". To avoid this, should a
# project update be forced after one or more tasks were aborted?
# Set to 1 for yes, 0 for no.
force_project_update=1

# Loop intervals.
# You probably don't need to edit these.
check_every_n_minutes=10
timestamp_every_n_minutes=120

# That's it; there is probably no need to edit anything from here on.
delay=$((${check_every_n_minutes}*60/${#hosts[*]}+1))
ts=${timestamp_every_n_minutes}

echo "Monitoring ${hosts[*]}."
for ((;;))
do
	(( (ts += check_every_n_minutes) >= timestamp_every_n_minutes )) && { date; ts=0; }

	for host in ${hosts[*]}
	do
		# Edit this if you run on Cygwin:
		#    boinccmd="/cygdrive/c/Program*Files/BOINC/boinccmd --host ${host} --passwd ${password}"
		if [ -n "${password}" ]
		then
			boinccmd="boinccmd --host ${host} --passwd ${password}"
		else
			boinccmd="boinccmd --host ${host}"
		fi

		tasks=$(${boinccmd} --get_tasks) || { sleep ${delay}; continue; }

		unset name url state ett cct
		while read line
		do
			case ${line} in
		             		[1-9]* )	 i=${line%)*};;
		        	     "name: "* )  name[$i]=${line#*"name: "};;
			      "project URL: "* )   url[$i]=${line#*"project URL: "};;
			"active_task_state: "* ) state[$i]=${line#*"active_task_state: "};;
			"elapsed task time: "* )       tmp=${line#*"elapsed task time: "}; ett[$i]=${tmp%.*};;
			 "current CPU time: "* )       tmp=${line#*"current CPU time: "};  cct[$i]=${tmp%.*};;
			esac
		done <<< "${tasks}"

		n=0
		for j in ${!name[*]}
		do
			# Skip tasks
			#   - which do not belong to this project,
			#   - which are not currently running,
			#   - which have been running for less than $min_elapsed_time seconds,
			#   - which have a CPU time of more than 50% of elapsed time.
			[ "${url[$j]}"   != "${project_url}" ] && continue
			[ "${state[$j]}" != "EXECUTING"      ] && continue
			e=${ett[$j]}; ((e < min_elapsed_time)) && continue
			c=${cct[$j]}; ((e < 2*c)) && continue

			printf "${host}: ${task_op} ${name[$j]}t"
			printf "(elapsed: %02d:%02d:%02d," $((e/3600)) $((e%3600/60)) $((e%60))
			printf " CPU: %02d:%02d:%02d)n"   $((c/3600)) $((c%3600/60)) $((c%60))
			${boinccmd} --task "${project_url}" "${name[$j]}" "${task_op}"
			((n++))
		done

		((force_project_update && n)) && { sleep 1; ${boinccmd} --project "${project_url}" update; }

		sleep ${delay}
	done
done


One thing to keep in mind though is that Rosetta@home configures the workunits with "max # of error/total/success tasks" = 1, 2, 1 which is rather low. That is, one task of a workunit might fail, but the next replica needs to succeed, otherwise the whole workunit fails. However, whenever I checked on the workunits of which I aborted a task of this 'neverending; little CPU time' kind, the replica task was eventually finished successfully by the wingman. That is, the chance that the replica errors out is luckily rather low.
ID: 104802 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 104809 - Posted: 12 Feb 2022, 22:14:18 UTC

'neverending; little CPU time'

I refer to them as `zombie` tasks , and the `kill` command seems a fun way to deal with them :-)
Ok , I abort them as normal in BM , I read that joke in a magazine.
ID: 104809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xii5ku

Send message
Joined: 29 Nov 16
Posts: 22
Credit: 13,815,783
RAC: 190
Message 104818 - Posted: 14 Feb 2022, 17:49:45 UTC - in response to Message 104802.  

xii5ku wrote:
the script requires a fairly recent version of 'boinccmd'. I don't know precisely how recent, but 7.16.17 works, 7.16.6 does not work. (The version of the boinc clients which are controlled by the script does not matter.)
PS, you can tell whether your boinccmd is recent enough for the script by looking at the output of the --get_tasks call (towards any client which has one ore more task in progress). If there is a line with "elapsed task time:" for each task, it will work.
ID: 104818 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gbayler

Send message
Joined: 10 Apr 20
Posts: 14
Credit: 3,069,484
RAC: 0
Message 104819 - Posted: 14 Feb 2022, 18:08:17 UTC - in response to Message 104818.  

xii5ku wrote:
the script requires a fairly recent version of 'boinccmd'. I don't know precisely how recent, but 7.16.17 works, 7.16.6 does not work. (The version of the boinc clients which are controlled by the script does not matter.)
PS, you can tell whether your boinccmd is recent enough for the script by looking at the output of the --get_tasks call (towards any client which has one ore more task in progress). If there is a line with "elapsed task time:" for each task, it will work.


I saw that too! According to boinccmd --get_tasks is missing elapsed time #3463, the issue was solved with version 7.16.11.

Something else: is there a way to make a sticky post here? Otherwise, I have little hope that this thread will do what the thread starter/original poster @dcdc intended:
I thought it would be a good idea to create a post listing the issues with the VirtualBox tasks, which can then be updated as/if they get fixed. This isn't a thread to list the details of the issues - just to link to the wider discussions elsewhere:
ID: 104819 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xii5ku

Send message
Joined: 29 Nov 16
Posts: 22
Credit: 13,815,783
RAC: 190
Message 105061 - Posted: 20 Feb 2022, 18:07:48 UTC

For what it's worth, the issue of "postponed" tasks and the issue of "infinite no-CPU-usage" tasks are both present on a computer with slots directory in a RAM disk (and plenty of free RAM, swap space disabled). I.e. disk latency is not the problem, as far as live task data are concerned.
ID: 105061 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gbayler

Send message
Joined: 10 Apr 20
Posts: 14
Credit: 3,069,484
RAC: 0
Message 105073 - Posted: 20 Feb 2022, 20:29:01 UTC

I have compiled the issues mentioned in this thread into a Google Sheet: https://docs.google.com/spreadsheets/d/1lBP27MYx2RH9PYuweMoSwOLvmIaoqI77Q0_gC34e-Z0/edit?usp=sharing
The idea is to make it simpler for everybody to get an overview of the open issues.
Since the forum is spammed from time to time, the sheet's access rights are "comment onIy", that is, you cannot directly edit the sheet. If you have some additions, just let me know! I'll keep an eye on this thread anyways and update the sheet from time to time.
ID: 105073 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,573,201
RAC: 8,994
Message 105114 - Posted: 21 Feb 2022, 22:47:56 UTC - in response to Message 105073.  

Good plan. Hopefully we can tick some of them off at some point!
ID: 105114 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile 4J2TqEp9pPkmvLkFuy8PL3QqQrvy

Send message
Joined: 16 Aug 10
Posts: 6
Credit: 13,357,453
RAC: 10,194
Message 105118 - Posted: 22 Feb 2022, 8:01:22 UTC - in response to Message 104693.  
Last modified: 22 Feb 2022, 8:06:22 UTC


    Tasks sometimes get stuck occupying a CPU slot indefinitely, until the deadline.
    Tasks occupy much more RAM, on a Ryzen 5900x 64GB of ram is not enough to utilize the entire host.
    KVM might not be available on Linux due to it being utilized by another hypervisor, this will make the tasks extremely slow (about 20 / 50 times slower).
    The heavy I/O operations that start with the sudden downloads / start of the VMs cause Linux systems with NVMe SSDs to become unresponsive for half a minute at a time (this is due to polling vs interrupt based I/O).



I think it is absolutely undoubtedly necessary we get an option to disable vbox tasks in the computing preferences menu and will abort all vbox tasks until then (sorry).

I'm contemplating writing a script that will only ask for new tasks once the scheduler shows regular tasks are available as well.

ID: 105118 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,794,987
RAC: 22,743
Message 105119 - Posted: 22 Feb 2022, 8:08:28 UTC - in response to Message 105118.  
Last modified: 22 Feb 2022, 8:11:11 UTC

I think it is absolutely undoubtedly necessary we get an option to disable vbox tasks in the computing preferences menu and will abort all vbox tasks until then (sorry).
You can do that now per machine.
Got to your account page, Computing and Credit, Computers on this account, click on View, click on Details for the computer you are after, down the bottom somewhere (i think think it is) should be a Skip button.
No more Python work.
Grant
Darwin NT
ID: 105119 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile 4J2TqEp9pPkmvLkFuy8PL3QqQrvy

Send message
Joined: 16 Aug 10
Posts: 6
Credit: 13,357,453
RAC: 10,194
Message 105121 - Posted: 22 Feb 2022, 8:20:33 UTC - in response to Message 105119.  
Last modified: 22 Feb 2022, 8:21:04 UTC

Got to your account page, Computing and Credit, Computers on this account, click on View, click on Details for the computer you are after, down the bottom somewhere (i think think it is) should be a Skip button.
No more Python work.


Thank you so much, last time I went to board for this in October this was not possible. I am very happy with this new feature :)
ID: 105121 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 259
Credit: 489,025
RAC: 575
Message 105160 - Posted: 23 Feb 2022, 11:08:24 UTC - in response to Message 105121.  

Solution for postponed vbox tasks: install vbox 5.2.44
ID: 105160 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xii5ku

Send message
Joined: 29 Nov 16
Posts: 22
Credit: 13,815,783
RAC: 190
Message 105161 - Posted: 23 Feb 2022, 11:40:12 UTC - in response to Message 105160.  
Last modified: 23 Feb 2022, 11:41:25 UTC

kotenok2000 wrote:
Solution for postponed vbox tasks: install vbox 5.2.44
Is it known whether this only reduces, or actually completely eliminates, the occurrence of "postponed" tasks?
On Windows? On Linux?

(Not asking for myself. I am accepting out-of-tree kernel drivers only in versions which are managed by the respective Linux distributor. In my case, this limits me to VirtualBox 6.1. On those of my computers which are used for relevant purposes besides distributed computing, I am not accepting out-of-tree kernel drivers at all. --- I suppose that many other Linux users likewise stick with software versions which are distro-managed.)
ID: 105161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
computezrmle

Send message
Joined: 9 Dec 11
Posts: 63
Credit: 9,680,103
RAC: 0
Message 105162 - Posted: 23 Feb 2022, 12:53:45 UTC - in response to Message 105161.  

Even recent vboxwrapper versions on Windows do not (yet) support the COM interface version used by VirtualBox 6.1.
Hence, BOINC's download page provides both (with/without COM):
https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables

This results in the suggestion to use VirtualBox 5.2.44 on Windows which is supported by vboxwrapper.


Non-Windows versions of vboxwrapper always use the plain vboxmanage interface.
My personal experience with them is that the "Postponed ..." issue depends on the vboxwrapper sent by the projects.
Some versions may be compiled using well meant compiler flags that worsen the performance under heavier load.
Since I use a self compiled vboxwrapper that issue disappeared.


At the end it's the job of the project team to create an app_version that works fine.
ID: 105162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 259
Credit: 489,025
RAC: 575
Message 105165 - Posted: 23 Feb 2022, 14:30:51 UTC - in response to Message 105162.  
Last modified: 23 Feb 2022, 14:35:16 UTC

On windows system begins lagging if virtualbox 5.2.44 vm number is big enough.
With latest version wrapper loses connection immediately. With 5.2.44 virtualbox continues working.
ID: 105165 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tullio

Send message
Joined: 10 May 20
Posts: 63
Credit: 630,125
RAC: 0
Message 105171 - Posted: 23 Feb 2022, 15:45:44 UTC

I am using VirtualBox 6.1.32 on three projects, this one, LHC@home (Atlas@home, CMS@home, Theory@home) and QuChemPedIA@home, all on Windows hosts, and I find no problem. Recently a few Rosetta 4.20 tasks failed, no rosetta pyhon task ever failed.
Tullio
ID: 105171 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Summary of issues with VirtualBox tasks



©2024 University of Washington
https://www.bakerlab.org