Message boards : Number crunching : Some machines will not run VirtualBox tasks
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
I've uploaded a screenshot to the Virtualbox forum thread. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
New info: I hadn't appreciated that I could just look at the VirtualBox script that is running by hitting "Show". On the one machine I've looked at so far, it is failing on "mk1_lapack_ps_mc3_dsytrf_l_small" if that means anything to anyone? Screenshot: https://ibb.co/0JX2LL6 Maybe this is relevant: https://conda.io/projects/conda/en/latest/user-guide/troubleshooting.html#numpy-mkl-library-load-failed |
computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0 |
Screenshot: This is a message from inside your VM. The message clearly shows that the Rosetta team did not correctly configure their software environment. No volunteer can do anything to solve this. A few days ago I wrote this: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14886&postid=104332 What can be wrong: |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
My fault - I thought you meant the info given in "Details" rather than the preview. It looks like we might be able to do something about it if #1 in the conda.io link works. But I agree, fundamentally it's a project issue to sovle.
|
computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0 |
... we might be able to do something about it ... No way. You (we) would have to patch something inside the VM. The Rosetta team has to prepare a new vdi file where all required changes and settings are included. Beside that: The link to conda.io would not even help them since it describes how to solve an error regarding Windows. The VM runs Debian Linux. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
Yeah good point. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
I accept that this is unlikely to be the cause, but I'll point it out anyway just in case it is useful. Something that hadn't occured to me before is that for me this is only affecting older Intel CPUs. My affected machines are: Dual CPU Nehalem (1st gen Core2) Xeon L5640 Haswell (4th gen Core2) Pentium G3220 Skylake (6th gen Core2) Pentium G4500 I have other Intel machines of similar age that are fine though. So I'm wondering if it might be due to something like a microcode update, or lack of, on those CPUs causing some part of the LAPACK module crash. I don't know enough about microcode updates to know whether that might be the issue or not, but I believe they're provided at boot by the BIOS first, and the OS second, so aren't persistent on the CPU. It's that after installing multiple OS's on the same machine, the error persists that is confusing, although it could be coincidence, but seems unlikely. |
computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0 |
... we might be able to do something ... What recently came in my mind: The vdi file miight be corrupted for some reason (it's really large). To get a fresh one from the project server: - Shut down BOINC - Delete the vdi file - Restart BOINC (the client usually checks if all required files are present; if not -> download) Regarding microcode etc. The "BIOS" presented to the guest is not the hardware BIOS of the box. Instead it's a special VirtualBox BIOS. Microcode updates are done by the installed guest OS. There's no difference between a real box and a virtual box. Hence, you get what the guest OS has implemented. Very unlikely that older CPUs are not supported (especially those from mainstream vendors). Older systems sometimes suffer from dust which prevents getting the heat off (rarely 3 systems concurrently). |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
... we might be able to do something ... Definitely not a heat issue on my machines. They'll run Prime95, memtest and other BOINC projects including LHC all day at low temps (CPU ~45°C - it's cold in there at the moment!). Regarding the VDI file, I did a binary comparison of that against a good machine and most of the other files in the slots folder. The vdi files were identical. There were plenty of other differences but nothing that looked obviously suspicous. Also, if it's a server issue then I wouldn't expect it to last through project resets or OS changes. The PC name is another thing that has been persistent on my machines. I can try changing that but seems very unlikley to be the cause. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
I have wiped Ubuntu and installed a fresh copy of Windows 10 (again) on my PowerEdge R410 server. It has completed some Rosetta 4.20 tasks successfully but all Vbox tasks have failed as expected. So that is 4 different OS installs on this machine (Win10, Ubuntu, Win11, Win10) and all have failed on Rosetta Vbox tasks but nothing else. Just to clarify, this machine will run LHC Vbox tasks successfully. https://ibb.co/y8HbKjr And the Vbox logs are here: https://ufile.io/f/zityf Next up I'm going to try: * Resetting the BIOS settings * adding a GPU * Getting a PC running and then moving the hard drive to a PC that doesn't run these to confirm it's not a data issue Any other ideas? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2119 Credit: 41,179,074 RAC: 11,480 |
Screenshot: I've picked this message up and sent it to the admin team to look at. I'm not competent on this subject, so I have no idea if what you've said is correct, but if you are then hopefully they'll take a look at it. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
Screenshot: I also sent Admin a DM. But what we don't have an answer for, is why does it work reasonably reliably (ignoring the occasional spectre errors) on some machines, but never works on others, regardless of OS etc? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2119 Credit: 41,179,074 RAC: 11,480 |
Screenshot: I've had a reply, but my PCs in 2 different locations are unable to access my email again - but I've managed to access it on my phone. It reads I haven't had much time to look into the Vbox issues in detail, but if anyone on the forums has a clear recommendation for troubleshooting, I'm all ears. If you can be more explicit in the areas that needs to be configured correctly, I'll pass that on again |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
Thanks Sid. This issue definitely isn't due to VT-x or similar not being enabled, or a bad download, unless there's a server issue. The proof of that is:
"Intel MKL FATAL ERROR: Error on loading function mkl_lapack_ps_mc3_dsytrf_l_small."See: https://ibb.co/0JX2LL6 2. Other BOINC VBox tasks (e.g. LHC) run successfully on the affected machines. 3. I can install and run an OS in VBox without issue. I have tried both Debian 64-bit and Ubuntu 20.04 64-bit. 4. There are 3 apps listed here which identify whether Virtualization and VT-x are enabled: - https://www.ilovefreesoftware.com/05/featured/free-tools-check-hardware-virtualization-support-windows-10.html - None of these apps show any issues. Note that the Intel app button actually links to the Leomoon app, but you can google the intel app. 5. I have tried countless OS reinstalls, firmware updates, firmware settings, stability tests etc.
|
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
Oh yeah, also I can put these machines on Ralph to test any potential fixes. Or I can run a virtual image if someone wants to send me one to test. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Hi everyone Ive gotten to 97% and then stalled out. I didn't notice it and 1 day elapsed time went by with little or no progress so I had to kill them. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
That is a different issue. If you look at the screen output in VirtualBox they will probably say something about a spectre error. This thread is for machines that never run VBox Rosetta tasks beyond the first minute and all error out due to an MKL error. See this thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14897 It would be good if you could post in that thread if the failures you are seeing match the error I listed there. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 6,993 |
One thing that I haven't confirmed yet is whether it is due to the hard drives being too slow. I brought the SSD home from what I thought was a working machine, to put in the Optiplex. The tasks failed as expected, but I hadn't realised that they hadn't been running on the machine I took the drive from. I will confirm that either way tomorrow. I think that's very unlikely to be the cause though. I have now confirmed that I can take the SSD from a machine that runs VBox tasks and put it in a machine that won't run them. The tasks then immediately fail in the non-working machine. When I put the SSD back in the original machine then the tasks start running correctly again. I think this eliminates the SSD speed, unless I guess it could be a bottleneck on the motherboard's SATA bus. That seems very unlikely. |
computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0 |
@Sid Celery Your efforts are nice and appreciated, but there's something I don't understand: Sid Celery wrote: I also sent Admin a DM The admins/developers may sometimes need a bump to get aware of relevant posts. Once this is done, why can't they directly communicate here in the forum? The answer should be better than "no time" or "too much work". |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2119 Credit: 41,179,074 RAC: 11,480 |
@Sid Celery It should be, you're right. I'm not here to defend it. On the previous occasion something came up and I pointed to a particular thread, they did post in that thread in the way you suggest. It's been said to me before, the Rosetta project team is <very> small. It might be better to assume there is <no> Admin team here. Rather there are only researchers, only a few of whom are tasked to look at the general admin when or if they have time - time they don't have. Hence why they've asked for a specific pointer to the area where a correction is needed so they can get in, do something that works for people, then get back to their main job. So when Greg regularly points out there's no-one monitoring the forum, that seems to be right. And regularly repeating the fact won't change it. That's why I'm passing things on in the way I am. It's not ideal, but it is the reality, so I'm working with that. And trying not to make a nuisance of myself in the process. On the plus side (and I'm prepared that it may not be a plus at all) I've discovered my own failures with Python tasks were because I hadn't turned virtualisation on at all in my BIOS. Now I've done that and confirmed it with one of the utilities dcdc pointed me to. So I'm going to install the VirtualBox version of Boinc again and see if I'm any more successful than my own previous abortive effort. Then I can join in much better with the miserableness |
Message boards :
Number crunching :
Some machines will not run VirtualBox tasks
©2024 University of Washington
https://www.bakerlab.org