Message boards : Number crunching : Summary of issues with VirtualBox tasks
Author | Message |
---|---|
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
Hey everyone, I thought it would be a good idea to create a post listing the issues with the VirtualBox tasks, which can then be updated as/if they get fixed. This isn't a thread to list the details of the issues - just to link to the wider discussions elsewhere: 1. Sometimes tasks don't start. They sit there at 0% and with no time used, but say "Running" in BOINC Manager. This tends to happen in batches for me and happens on multiple machines. I don't think there is a thread discussing this. Restarting BOINC Manager fixes this. 2. Some tasks never end. The % keeps climbing but they have to be aborted. My record that I've noticed is 4-days of CPU time. The error on the Vbox screen is always the same, but might be misleading (Spectre error). Thread here: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14897 3. Some machines cannot run Rosetta VirtualBox tasks due to the Intel MKL (Math Kernel Library) fatal error. I would guess this affects something like 20% of machines, including servers. This is not due to Virtualisation exensions being disabled as other VBox projects work fine. Thread here: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14886&postid=104574 Of course there are lots of other issues with VirtualBox tasks, like the disk space requirements and the volume of disk writes, which are not technically probelms, but do have a significant impact on the amount of processing available to the project. Have I missed any major issues? |
Jordan Toth Send message Joined: 19 Dec 16 Posts: 6 Credit: 172,398 RAC: 0 |
I can't install Virtualbox - it states it's not compatible with my iMac, do I need to have it installed in order to run Rosetta@home? I haven't gotten any work for my computer to do. |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 2,238 |
I can't install Virtualbox - it states it's not compatible with my iMac, do I need to have it installed in order to run Rosetta@home? I haven't gotten any work for my computer to do. Virtualbox is available for Mac though: https://www.oracle.com/virtualization/technologies/vm/downloads/virtualbox-downloads.html If you don't install Virtualbox, you will not get the Python tasks. You will be limited to the standard Rosetta 4.20 tasks which are quite rare these days. |
gbayler Send message Joined: 10 Apr 20 Posts: 14 Credit: 3,069,484 RAC: 0 |
I observed sort of a combination of both problems, but maybe it is something different altogether. In my case, VirtualBox-tasks seem to start in the BOINC Manager (Status: "Running"), when actually the VM that belongs to that task hangs while booting. The progress indicated in the BOINC manager asymptotically approaches 100%, but never reaches it. These tasks run until they are manually aborted. They don't consume much CPU time. I observed the problem using Linux and using Windows. When using Windows, I checked the VBox screen: in one case it was completely empty, in the other case it showed the error message Couldn't copy file: fwrite() failed I have not seen the Spectre error yet. In Linux I haven't yet figured out how to check the VBox-screen. @Jim1348 described such tasks in the forum as "0 CPU-tasks". I have written a watchdog-script to abort such tasks as soon as possible. This is a good workaround for me, of course it would be better if this problem would be fixed.
4. Some tasks get the status "Postponed: VM job unmanagable, restarting later." and block a slot where another work unit could be processed. This was discussed here and here. In my experience, after restarting the BOINC client, the results of such tasks are reported and new work units are downloaded and processed. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
What's that error in QuChem? Very similar...virtual enviroment unmanageable...restart later (paraphrase) |
xii5ku Send message Joined: 29 Nov 16 Posts: 22 Credit: 13,815,783 RAC: 231 |
(reposted from the AnandTech forum and the QuChemPedIA message board) About the particular problem of tasks which want to run endlessly, consuming very little CPU time doing so: While I haven't looked deeply enough to find the cause, let a lone a fix, I at least automated the only currently known workaround — which is to abort these tasks. I am using the following script which periodically checks for the presence of tasks with CPU time << elapsed time and aborts these. The script interpreter is 'bash', hence it is not entirely straightforward to run on Windows. Cygwin should work, WSL might work. (I am only running Linux myself. You could also run the script on a Linux box and let it control Windows hosts.) Furthermore, the script requires a fairly recent version of 'boinccmd'. I don't know precisely how recent, but 7.16.17 works, 7.16.6 does not work. (The version of the boinc clients which are controlled by the script does not matter.) #!/bin/bash # Edit this: # a list of hosts, each optionally with GUI port number appended # (may be just a single host, or dozens of hosts) hosts=( "localhost" "computer_a" "computer_b:31420" ) # Edit this: # the password from gui_rpc_auth.cfg # This script expects the same password on all hosts. # Can be set to "" if you have empty gui_rpc_auth.cfg's. password="$(cat /var/lib/boinc/gui_rpc_auth.cfg)" # Edit this if you want to apply this to a different project. project_url="https://boinc.bakerlab.org/rosetta/" # Change this from "abort" to "suspend" if you prefer. task_op="abort" # Before a task hasn't been executing for some time, other task stats # may still be imprecise. The script therefore does not touch any # tasks which haven't been executing for at least this many seconds. # You can use integer numbers here, but not floating point numbers. # E.g.: 5 * 60 for 5 minutes. min_elapsed_time=$((5 * 60)) # After tasks were aborted, boinc-client may cease to request # new work due to "Communication deferred". To avoid this, should a # project update be forced after one or more tasks were aborted? # Set to 1 for yes, 0 for no. force_project_update=1 # Loop intervals. # You probably don't need to edit these. check_every_n_minutes=10 timestamp_every_n_minutes=120 # That's it; there is probably no need to edit anything from here on. delay=$((${check_every_n_minutes}*60/${#hosts[*]}+1)) ts=${timestamp_every_n_minutes} echo "Monitoring ${hosts[*]}." for ((;;)) do (( (ts += check_every_n_minutes) >= timestamp_every_n_minutes )) && { date; ts=0; } for host in ${hosts[*]} do # Edit this if you run on Cygwin: # boinccmd="/cygdrive/c/Program*Files/BOINC/boinccmd --host ${host} --passwd ${password}" if [ -n "${password}" ] then boinccmd="boinccmd --host ${host} --passwd ${password}" else boinccmd="boinccmd --host ${host}" fi tasks=$(${boinccmd} --get_tasks) || { sleep ${delay}; continue; } unset name url state ett cct while read line do case ${line} in [1-9]* ) i=${line%)*};; "name: "* ) name[$i]=${line#*"name: "};; "project URL: "* ) url[$i]=${line#*"project URL: "};; "active_task_state: "* ) state[$i]=${line#*"active_task_state: "};; "elapsed task time: "* ) tmp=${line#*"elapsed task time: "}; ett[$i]=${tmp%.*};; "current CPU time: "* ) tmp=${line#*"current CPU time: "}; cct[$i]=${tmp%.*};; esac done <<< "${tasks}" n=0 for j in ${!name[*]} do # Skip tasks # - which do not belong to this project, # - which are not currently running, # - which have been running for less than $min_elapsed_time seconds, # - which have a CPU time of more than 50% of elapsed time. [ "${url[$j]}" != "${project_url}" ] && continue [ "${state[$j]}" != "EXECUTING" ] && continue e=${ett[$j]}; ((e < min_elapsed_time)) && continue c=${cct[$j]}; ((e < 2*c)) && continue printf "${host}: ${task_op} ${name[$j]}t" printf "(elapsed: %02d:%02d:%02d," $((e/3600)) $((e%3600/60)) $((e%60)) printf " CPU: %02d:%02d:%02d)n" $((c/3600)) $((c%3600/60)) $((c%60)) ${boinccmd} --task "${project_url}" "${name[$j]}" "${task_op}" ((n++)) done ((force_project_update && n)) && { sleep 1; ${boinccmd} --project "${project_url}" update; } sleep ${delay} done done One thing to keep in mind though is that Rosetta@home configures the workunits with "max # of error/total/success tasks" = 1, 2, 1 which is rather low. That is, one task of a workunit might fail, but the next replica needs to succeed, otherwise the whole workunit fails. However, whenever I checked on the workunits of which I aborted a task of this 'neverending; little CPU time' kind, the replica task was eventually finished successfully by the wingman. That is, the chance that the replica errors out is luckily rather low. |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
'neverending; little CPU time' I refer to them as `zombie` tasks , and the `kill` command seems a fun way to deal with them :-) Ok , I abort them as normal in BM , I read that joke in a magazine. |
xii5ku Send message Joined: 29 Nov 16 Posts: 22 Credit: 13,815,783 RAC: 231 |
xii5ku wrote: the script requires a fairly recent version of 'boinccmd'. I don't know precisely how recent, but 7.16.17 works, 7.16.6 does not work. (The version of the boinc clients which are controlled by the script does not matter.)PS, you can tell whether your boinccmd is recent enough for the script by looking at the output of the --get_tasks call (towards any client which has one ore more task in progress). If there is a line with "elapsed task time:" for each task, it will work. |
gbayler Send message Joined: 10 Apr 20 Posts: 14 Credit: 3,069,484 RAC: 0 |
xii5ku wrote:the script requires a fairly recent version of 'boinccmd'. I don't know precisely how recent, but 7.16.17 works, 7.16.6 does not work. (The version of the boinc clients which are controlled by the script does not matter.)PS, you can tell whether your boinccmd is recent enough for the script by looking at the output of the --get_tasks call (towards any client which has one ore more task in progress). If there is a line with "elapsed task time:" for each task, it will work. I saw that too! According to boinccmd --get_tasks is missing elapsed time #3463, the issue was solved with version 7.16.11. Something else: is there a way to make a sticky post here? Otherwise, I have little hope that this thread will do what the thread starter/original poster @dcdc intended: I thought it would be a good idea to create a post listing the issues with the VirtualBox tasks, which can then be updated as/if they get fixed. This isn't a thread to list the details of the issues - just to link to the wider discussions elsewhere: |
xii5ku Send message Joined: 29 Nov 16 Posts: 22 Credit: 13,815,783 RAC: 231 |
For what it's worth, the issue of "postponed" tasks and the issue of "infinite no-CPU-usage" tasks are both present on a computer with slots directory in a RAM disk (and plenty of free RAM, swap space disabled). I.e. disk latency is not the problem, as far as live task data are concerned. |
gbayler Send message Joined: 10 Apr 20 Posts: 14 Credit: 3,069,484 RAC: 0 |
I have compiled the issues mentioned in this thread into a Google Sheet: https://docs.google.com/spreadsheets/d/1lBP27MYx2RH9PYuweMoSwOLvmIaoqI77Q0_gC34e-Z0/edit?usp=sharing The idea is to make it simpler for everybody to get an overview of the open issues. Since the forum is spammed from time to time, the sheet's access rights are "comment onIy", that is, you cannot directly edit the sheet. If you have some additions, just let me know! I'll keep an eye on this thread anyways and update the sheet from time to time. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
Good plan. Hopefully we can tick some of them off at some point! |
4J2TqEp9pPkmvLkFuy8PL3QqQrvy Send message Joined: 16 Aug 10 Posts: 6 Credit: 13,261,645 RAC: 3,094 |
Tasks sometimes get stuck occupying a CPU slot indefinitely, until the deadline. Tasks occupy much more RAM, on a Ryzen 5900x 64GB of ram is not enough to utilize the entire host. KVM might not be available on Linux due to it being utilized by another hypervisor, this will make the tasks extremely slow (about 20 / 50 times slower). The heavy I/O operations that start with the sudden downloads / start of the VMs cause Linux systems with NVMe SSDs to become unresponsive for half a minute at a time (this is due to polling vs interrupt based I/O).
|
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1675 Credit: 17,697,747 RAC: 19,279 |
I think it is absolutely undoubtedly necessary we get an option to disable vbox tasks in the computing preferences menu and will abort all vbox tasks until then (sorry).You can do that now per machine. Got to your account page, Computing and Credit, Computers on this account, click on View, click on Details for the computer you are after, down the bottom somewhere (i think think it is) should be a Skip button. No more Python work. Grant Darwin NT |
4J2TqEp9pPkmvLkFuy8PL3QqQrvy Send message Joined: 16 Aug 10 Posts: 6 Credit: 13,261,645 RAC: 3,094 |
Got to your account page, Computing and Credit, Computers on this account, click on View, click on Details for the computer you are after, down the bottom somewhere (i think think it is) should be a Skip button. Thank you so much, last time I went to board for this in October this was not possible. I am very happy with this new feature :) |
kotenok2000 Send message Joined: 22 Feb 11 Posts: 259 Credit: 483,503 RAC: 74 |
Solution for postponed vbox tasks: install vbox 5.2.44 |
xii5ku Send message Joined: 29 Nov 16 Posts: 22 Credit: 13,815,783 RAC: 231 |
kotenok2000 wrote: Solution for postponed vbox tasks: install vbox 5.2.44Is it known whether this only reduces, or actually completely eliminates, the occurrence of "postponed" tasks? On Windows? On Linux? (Not asking for myself. I am accepting out-of-tree kernel drivers only in versions which are managed by the respective Linux distributor. In my case, this limits me to VirtualBox 6.1. On those of my computers which are used for relevant purposes besides distributed computing, I am not accepting out-of-tree kernel drivers at all. --- I suppose that many other Linux users likewise stick with software versions which are distro-managed.) |
computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0 |
Even recent vboxwrapper versions on Windows do not (yet) support the COM interface version used by VirtualBox 6.1. Hence, BOINC's download page provides both (with/without COM): https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables This results in the suggestion to use VirtualBox 5.2.44 on Windows which is supported by vboxwrapper. Non-Windows versions of vboxwrapper always use the plain vboxmanage interface. My personal experience with them is that the "Postponed ..." issue depends on the vboxwrapper sent by the projects. Some versions may be compiled using well meant compiler flags that worsen the performance under heavier load. Since I use a self compiled vboxwrapper that issue disappeared. At the end it's the job of the project team to create an app_version that works fine. |
kotenok2000 Send message Joined: 22 Feb 11 Posts: 259 Credit: 483,503 RAC: 74 |
On windows system begins lagging if virtualbox 5.2.44 vm number is big enough. With latest version wrapper loses connection immediately. With 5.2.44 virtualbox continues working. |
tullio Send message Joined: 10 May 20 Posts: 63 Credit: 630,125 RAC: 0 |
I am using VirtualBox 6.1.32 on three projects, this one, LHC@home (Atlas@home, CMS@home, Theory@home) and QuChemPedIA@home, all on Windows hosts, and I find no problem. Recently a few Rosetta 4.20 tasks failed, no rosetta pyhon task ever failed. Tullio |
Message boards :
Number crunching :
Summary of issues with VirtualBox tasks
©2024 University of Washington
https://www.bakerlab.org