Some machines will not run VirtualBox tasks

Message boards : Number crunching : Some machines will not run VirtualBox tasks

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104470 - Posted: 24 Jan 2022, 7:54:31 UTC

I've uploaded a screenshot to the Virtualbox forum thread.
ID: 104470 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104471 - Posted: 24 Jan 2022, 12:54:34 UTC
Last modified: 24 Jan 2022, 13:47:04 UTC

New info:

I hadn't appreciated that I could just look at the VirtualBox script that is running by hitting "Show". On the one machine I've looked at so far, it is failing on "mk1_lapack_ps_mc3_dsytrf_l_small" if that means anything to anyone?

Screenshot:
https://ibb.co/0JX2LL6

Maybe this is relevant:
https://conda.io/projects/conda/en/latest/user-guide/troubleshooting.html#numpy-mkl-library-load-failed
ID: 104471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
computezrmle

Send message
Joined: 9 Dec 11
Posts: 63
Credit: 9,680,103
RAC: 0
Message 104472 - Posted: 24 Jan 2022, 14:07:15 UTC - in response to Message 104471.  

Screenshot:
https://ibb.co/0JX2LL6

This is a message from inside your VM.
The message clearly shows that the Rosetta team did not correctly configure their software environment.
No volunteer can do anything to solve this.


A few days ago I wrote this:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14886&postid=104332
What can be wrong:
.
.
.
- some other weird things causing the VM to crash


How to check the latter:
- Open the VirtualBox GUI and select the VM you want to check.
In case of a crash the small preview might show a hint.
ID: 104472 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104473 - Posted: 24 Jan 2022, 14:15:26 UTC - in response to Message 104472.  

My fault - I thought you meant the info given in "Details" rather than the preview. It looks like we might be able to do something about it if #1 in the conda.io link works. But I agree, fundamentally it's a project issue to sovle.


A few days ago I wrote this:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14886&postid=104332
[quote]What can be wrong:
.
.
.
- some other weird things causing the VM to crash


How to check the latter:
- Open the VirtualBox GUI and select the VM you want to check.
In case of a crash the small preview might show a hint.

ID: 104473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
computezrmle

Send message
Joined: 9 Dec 11
Posts: 63
Credit: 9,680,103
RAC: 0
Message 104474 - Posted: 24 Jan 2022, 14:38:59 UTC - in response to Message 104473.  

... we might be able to do something about it ...

No way.
You (we) would have to patch something inside the VM.
The Rosetta team has to prepare a new vdi file where all required changes and settings are included.


Beside that:
The link to conda.io would not even help them since it describes how to solve an error regarding Windows.
The VM runs Debian Linux.
ID: 104474 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104475 - Posted: 24 Jan 2022, 15:07:46 UTC - in response to Message 104474.  

Yeah good point.
ID: 104475 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104481 - Posted: 25 Jan 2022, 10:23:18 UTC
Last modified: 25 Jan 2022, 10:24:31 UTC

I accept that this is unlikely to be the cause, but I'll point it out anyway just in case it is useful. Something that hadn't occured to me before is that for me this is only affecting older Intel CPUs. My affected machines are:

Dual CPU Nehalem (1st gen Core2) Xeon L5640
Haswell (4th gen Core2) Pentium G3220
Skylake (6th gen Core2) Pentium G4500

I have other Intel machines of similar age that are fine though. So I'm wondering if it might be due to something like a microcode update, or lack of, on those CPUs causing some part of the LAPACK module crash. I don't know enough about microcode updates to know whether that might be the issue or not, but I believe they're provided at boot by the BIOS first, and the OS second, so aren't persistent on the CPU.

It's that after installing multiple OS's on the same machine, the error persists that is confusing, although it could be coincidence, but seems unlikely.
ID: 104481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
computezrmle

Send message
Joined: 9 Dec 11
Posts: 63
Credit: 9,680,103
RAC: 0
Message 104482 - Posted: 25 Jan 2022, 11:25:05 UTC - in response to Message 104481.  

... we might be able to do something ...

What recently came in my mind:
The vdi file miight be corrupted for some reason (it's really large).
To get a fresh one from the project server:
- Shut down BOINC
- Delete the vdi file
- Restart BOINC (the client usually checks if all required files are present; if not -> download)


Regarding microcode etc.
The "BIOS" presented to the guest is not the hardware BIOS of the box.
Instead it's a special VirtualBox BIOS.

Microcode updates are done by the installed guest OS.
There's no difference between a real box and a virtual box.

Hence, you get what the guest OS has implemented.
Very unlikely that older CPUs are not supported (especially those from mainstream vendors).


Older systems sometimes suffer from dust which prevents getting the heat off (rarely 3 systems concurrently).
ID: 104482 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104483 - Posted: 25 Jan 2022, 11:39:35 UTC - in response to Message 104482.  
Last modified: 25 Jan 2022, 11:59:30 UTC

... we might be able to do something ...

What recently came in my mind:
The vdi file miight be corrupted for some reason (it's really large).
To get a fresh one from the project server:
- Shut down BOINC
- Delete the vdi file
- Restart BOINC (the client usually checks if all required files are present; if not -> download)


Regarding microcode etc.
The "BIOS" presented to the guest is not the hardware BIOS of the box.
Instead it's a special VirtualBox BIOS.

Microcode updates are done by the installed guest OS.
There's no difference between a real box and a virtual box.

Hence, you get what the guest OS has implemented.
Very unlikely that older CPUs are not supported (especially those from mainstream vendors).


Older systems sometimes suffer from dust which prevents getting the heat off (rarely 3 systems concurrently).


Definitely not a heat issue on my machines. They'll run Prime95, memtest and other BOINC projects including LHC all day at low temps (CPU ~45°C - it's cold in there at the moment!).

Regarding the VDI file, I did a binary comparison of that against a good machine and most of the other files in the slots folder. The vdi files were identical. There were plenty of other differences but nothing that looked obviously suspicous.

Also, if it's a server issue then I wouldn't expect it to last through project resets or OS changes.

The PC name is another thing that has been persistent on my machines. I can try changing that but seems very unlikley to be the cause.
ID: 104483 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104501 - Posted: 26 Jan 2022, 8:21:31 UTC

I have wiped Ubuntu and installed a fresh copy of Windows 10 (again) on my PowerEdge R410 server. It has completed some Rosetta 4.20 tasks successfully but all Vbox tasks have failed as expected. So that is 4 different OS installs on this machine (Win10, Ubuntu, Win11, Win10) and all have failed on Rosetta Vbox tasks but nothing else. Just to clarify, this machine will run LHC Vbox tasks successfully.

https://ibb.co/y8HbKjr

And the Vbox logs are here:
https://ufile.io/f/zityf

Next up I'm going to try:
* Resetting the BIOS settings
* adding a GPU
* Getting a PC running and then moving the hard drive to a PC that doesn't run these to confirm it's not a data issue

Any other ideas?
ID: 104501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,906,845
RAC: 23,790
Message 104508 - Posted: 26 Jan 2022, 16:16:22 UTC - in response to Message 104472.  
Last modified: 26 Jan 2022, 16:16:37 UTC

Screenshot:
https://ibb.co/0JX2LL6

This is a message from inside your VM.
The message clearly shows that the Rosetta team did not correctly configure their software environment.
No volunteer can do anything to solve this.

I've picked this message up and sent it to the admin team to look at.
I'm not competent on this subject, so I have no idea if what you've said is correct, but if you are then hopefully they'll take a look at it.
ID: 104508 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104513 - Posted: 26 Jan 2022, 17:57:20 UTC - in response to Message 104508.  

Screenshot:
https://ibb.co/0JX2LL6

This is a message from inside your VM.
The message clearly shows that the Rosetta team did not correctly configure their software environment.
No volunteer can do anything to solve this.

I've picked this message up and sent it to the admin team to look at.
I'm not competent on this subject, so I have no idea if what you've said is correct, but if you are then hopefully they'll take a look at it.



I also sent Admin a DM. But what we don't have an answer for, is why does it work reasonably reliably (ignoring the occasional spectre errors) on some machines, but never works on others, regardless of OS etc?
ID: 104513 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,906,845
RAC: 23,790
Message 104565 - Posted: 30 Jan 2022, 5:02:59 UTC - in response to Message 104508.  

Screenshot:
https://ibb.co/0JX2LL6

This is a message from inside your VM.
The message clearly shows that the Rosetta team did not correctly configure their software environment.
No volunteer can do anything to solve this.

I've picked this message up and sent it to the admin team to look at.
I'm not competent on this subject, so I have no idea if what you've said is correct, but if you are then hopefully they'll take a look at it.

I've had a reply, but my PCs in 2 different locations are unable to access my email again - but I've managed to access it on my phone. It reads

I haven't had much time to look into the Vbox issues in detail, but if anyone on the forums has a clear recommendation for troubleshooting, I'm all ears.
Many of the errors seem to be related to VT-x not being enabled

If you can be more explicit in the areas that needs to be configured correctly, I'll pass that on again
ID: 104565 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104574 - Posted: 30 Jan 2022, 18:16:02 UTC
Last modified: 30 Jan 2022, 18:51:34 UTC

Thanks Sid. This issue definitely isn't due to VT-x or similar not being enabled, or a bad download, unless there's a server issue. The proof of that is:

    1. On these machines the virtual machine starts, and the script starts, but always fails with the same error on the VBox screen after about 20 seconds:
    "Intel MKL FATAL ERROR: Error on loading function mkl_lapack_ps_mc3_dsytrf_l_small."
    
    See: https://ibb.co/0JX2LL6

    2. Other BOINC VBox tasks (e.g. LHC) run successfully on the affected machines.

    3. I can install and run an OS in VBox without issue. I have tried both Debian 64-bit and Ubuntu 20.04 64-bit.

    4. There are 3 apps listed here which identify whether Virtualization and VT-x are enabled:
    - https://www.ilovefreesoftware.com/05/featured/free-tools-check-hardware-virtualization-support-windows-10.html
    - None of these apps show any issues. Note that the Intel app button actually links to the Leomoon app, but you can google the intel app.

    5. I have tried countless OS reinstalls, firmware updates, firmware settings, stability tests etc.



So it seems it is an issue with the hardware/firmware (or maybe drivers) on the affected machines which causes the MKL library to fail. Machines of mine that are affected (about 1 in 4):
* Dell Optiplex 3040 with a dual-core Pentium G4500 Skylake 6th gen Intel CPU. Tried mulitple OS's including Ubuntu, Win 10 and Win 11.
* Dual 6-core Xeon Dell R410 server, 2x Nehalem 1st gen Intel CPUs.
* Dual-core Pentium G3220 Haswell 4th gen Intel CPU.

One thing that I haven't confirmed yet is whether it is due to the hard drives being too slow. I brought the SSD home from what I thought was a working machine, to put in the Optiplex. The tasks failed as expected, but I hadn't realised that they hadn't been running on the machine I took the drive from. I will confirm that either way tomorrow. I think that's very unlikely to be the cause though.

The only thing I can think of is that maybe there's a time-out in the script and my cheap SSD isn't fast enough for it, so the file that it's looking for isn't available in time, causing the MKL library to crash. Other than that, I'm stumped. The only thing that I can think of that they all have in common is that they're a) all Intel, and b) all use DDR3. I have other machines of the same generation that are unaffected though.

Updating the MKL and DFT-D4 versions used in the Vbox might be an easy fix though, in case it is that causing something strange that only affects certain computer hardware. DFT-D4 V3.3.0 is available now rather than the 2.5.0 being used.

D


ID: 104574 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104582 - Posted: 30 Jan 2022, 20:19:05 UTC
Last modified: 30 Jan 2022, 20:20:17 UTC

Oh yeah, also I can put these machines on Ralph to test any potential fixes.

Or I can run a virtual image if someone wants to send me one to test.
ID: 104582 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,711,666
RAC: 514
Message 104583 - Posted: 30 Jan 2022, 20:23:56 UTC - in response to Message 104321.  

Hi everyone

I have a new-to-me PC (Dell Optiplex 3040, Pentium dual core G4500 Skylake gen CPU, 16GB DDR3L RAM, Kingston 120GB SSD) and I've tried Windows 10, Ubuntu 20.04 and Windows 11. It will run other non-VirtualBox projects fine. I haven't tried any VB projects other than Rosetta until now, but it is currently happily running an LHC VB task on both cores at ~75% CPU utilisation so that looks good. It will download and start to run VirtualBox Rosetta tasks, but they never complete and CPU utilisation rarely rises above ~2% according to BOINCTasks which is backed up by task manager when I look at the machine.

Any ideas what might be wrong? The machine seems completely stable. It had the same issues under Ubuntu as it does under Windows. And I have a another machine that behaves exactly the same - that's an old Dell dual CPU server (Poweredge R410) which I've also tried Win10 and Ubuntu on.

I tend to log in via Remote Desktop on this machine, but I didn't when it was running Ubuntu as I hadn't set that up yet.

Windows memory diagnostic hasn't detected any RAM errors, and I've tried resetting the project multiple times. I'm wondering if it might be due to:

* the SSD
* the CPU - temperatures are around ~40°C whilst it's running an LHC VB task.
* RAM - it passes the Windows Memory Diagnostic
* a setting in FW. The Leomoon app says that everything is enabled for virtualisation to work.
* Just very unlucky with the tasks it's getting?
* something else?

I'm running out of ideas! I'm tempted to move a hard drive from a working machine over to it to see if it runs the tasks in the other PC's queue.




Ive gotten to 97% and then stalled out. I didn't notice it and 1 day elapsed time went by with little or no progress so I had to kill them.
ID: 104583 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104586 - Posted: 30 Jan 2022, 20:38:43 UTC - in response to Message 104583.  
Last modified: 30 Jan 2022, 21:10:07 UTC


Ive gotten to 97% and then stalled out. I didn't notice it and 1 day elapsed time went by with little or no progress so I had to kill them.


That is a different issue. If you look at the screen output in VirtualBox they will probably say something about a spectre error. This thread is for machines that never run VBox Rosetta tasks beyond the first minute and all error out due to an MKL error.

See this thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14897

It would be good if you could post in that thread if the failures you are seeing match the error I listed there.
ID: 104586 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 117,106,572
RAC: 82,821
Message 104600 - Posted: 31 Jan 2022, 12:24:02 UTC - in response to Message 104574.  

One thing that I haven't confirmed yet is whether it is due to the hard drives being too slow. I brought the SSD home from what I thought was a working machine, to put in the Optiplex. The tasks failed as expected, but I hadn't realised that they hadn't been running on the machine I took the drive from. I will confirm that either way tomorrow. I think that's very unlikely to be the cause though.

I have now confirmed that I can take the SSD from a machine that runs VBox tasks and put it in a machine that won't run them. The tasks then immediately fail in the non-working machine. When I put the SSD back in the original machine then the tasks start running correctly again. I think this eliminates the SSD speed, unless I guess it could be a bottleneck on the motherboard's SATA bus. That seems very unlikely.
ID: 104600 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
computezrmle

Send message
Joined: 9 Dec 11
Posts: 63
Credit: 9,680,103
RAC: 0
Message 104602 - Posted: 31 Jan 2022, 13:28:29 UTC

@Sid Celery

Your efforts are nice and appreciated, but there's something I don't understand:
Sid Celery wrote:
I also sent Admin a DM
.
.
.
I've had a reply
.
.
.
If you can be more explicit ... I'll pass that on again

The admins/developers may sometimes need a bump to get aware of relevant posts.
Once this is done, why can't they directly communicate here in the forum?
The answer should be better than "no time" or "too much work".
ID: 104602 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,906,845
RAC: 23,790
Message 104604 - Posted: 31 Jan 2022, 13:56:44 UTC - in response to Message 104602.  

@Sid Celery
Your efforts are nice and appreciated, but there's something I don't understand:
Sid Celery wrote:
I also sent Admin a DM
.
.
I've had a reply
.
.
If you can be more explicit ... I'll pass that on again

The admins/developers may sometimes need a bump to get aware of relevant posts.
Once this is done, why can't they directly communicate here in the forum?
The answer should be better than "no time" or "too much work".

It should be, you're right. I'm not here to defend it.
On the previous occasion something came up and I pointed to a particular thread, they did post in that thread in the way you suggest.

It's been said to me before, the Rosetta project team is <very> small.
It might be better to assume there is <no> Admin team here. Rather there are only researchers, only a few of whom are tasked to look at the general admin when or if they have time - time they don't have.
Hence why they've asked for a specific pointer to the area where a correction is needed so they can get in, do something that works for people, then get back to their main job.

So when Greg regularly points out there's no-one monitoring the forum, that seems to be right. And regularly repeating the fact won't change it.
That's why I'm passing things on in the way I am. It's not ideal, but it is the reality, so I'm working with that. And trying not to make a nuisance of myself in the process.

On the plus side (and I'm prepared that it may not be a plus at all) I've discovered my own failures with Python tasks were because I hadn't turned virtualisation on at all in my BIOS.
Now I've done that and confirmed it with one of the utilities dcdc pointed me to. So I'm going to install the VirtualBox version of Boinc again and see if I'm any more successful than my own previous abortive effort.
Then I can join in much better with the miserableness
ID: 104604 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Some machines will not run VirtualBox tasks



©2024 University of Washington
https://www.bakerlab.org