Occasional VirtualBox failures

Message boards : Number crunching : Occasional VirtualBox failures

To post messages, you must log in.

AuthorMessage
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,505,325
RAC: 57,089
Message 104500 - Posted: 26 Jan 2022, 8:02:04 UTC

In addition to the thread for computers that won't run any VirtualBox tasks (which seems to be hardware related somehow), there are regular failures on my machines that are usually happy to run VirtualBox tasks. I haven't looked at the VirtualBox preview for many of them yet, but I have now seen two in a row with this error:

Spectre V2 : Spectre mitigation: LFENCE not serializing, switching to generic retpoline

https://ibb.co/FDkPMtB

If the logs from these are useful then I'll collect and post them - I presume Vbox.log, VboxHardening.log and one of the BOINC logs would be the appropriate ones to post?
ID: 104500 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,505,325
RAC: 57,089
Message 104644 - Posted: 3 Feb 2022, 8:32:12 UTC

For my machines that will usually successfully run VirtualBox tasks, this Spectre V2 error is still the way that most of the ones that stop running fail. Unfortunately they will often run for days like this if they're not spotted. Fortunately I have BOINCTasks running so usually spot them sooner than that for my local machines, but not the remote ones.
ID: 104644 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
computezrmle

Send message
Joined: 9 Dec 11
Posts: 63
Credit: 9,680,103
RAC: 0
Message 104645 - Posted: 3 Feb 2022, 9:00:27 UTC - in response to Message 104644.  

The Spectre mitigation message appears at every VM start.
I don't think it causes the error.
Instead its the last info printed on the console before the VM hangs.
ID: 104645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,505,325
RAC: 57,089
Message 104646 - Posted: 3 Feb 2022, 15:18:39 UTC - in response to Message 104645.  

The Spectre mitigation message appears at every VM start.
I don't think it causes the error.
Instead its the last info printed on the console before the VM hangs.


Ok yeah that makes sense. So it's at least narrowed down to any point after that! Would the VBox logs be helpful in diagnosing it, assuming anyone on the project is ineterested?
ID: 104646 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 104647 - Posted: 3 Feb 2022, 16:48:06 UTC

If the `Occasional VirtualBox failures` you are talking about are the ones that fail after a few seconds and lock, and use no more cpu time (if it is can you alter the thread title to show this)
In studying of my own error rate and looking at wingmen that have completed the task as valid
Are the fail at start up work units more often on systems with a large number of cpu/threads when the system is so busy that the application borks itself by not waiting for another instance of rosetta to finish reading from file
The output files from these work units often or mostly have line in them like :-
'F:ProgramDataBOINCslots12vm_image.vdi' is locked for reading by another task},
The `slotsnumber` can appear several times in one output file with different `number` in the `slots` as if several instances of rosetta are fighting each other to read {race condition} the file and so crash the work unit
From what I have seen of it, is any system with more than 12 cpu/threads [approxametly] more likely to have the startup faults than 4 or 8 core systems
A full top down view that only the Admin can get may rubbish this idea in seconds ,
its the best I have got on it so here it is for you to consider [ and tell me I am talking carp ]

Also things like {The object is not ready}, make me think the app is tripping over itself

{The object functionality is limited} could be because some of the required components of the `slots` folder have not loaded in time.
ID: 104647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
computezrmle

Send message
Joined: 9 Dec 11
Posts: 63
Credit: 9,680,103
RAC: 0
Message 104648 - Posted: 3 Feb 2022, 17:18:57 UTC - in response to Message 104647.  

The slots/x are the working directories for each task.
They should be cleared by BOINC when a task ends and (in case of vbox tasks) the vdi image should be deregistered.
Your messages point out that there is either a vdi file from a previous task in the slot, e.g. after a crash or a timeout, or that the corresponding entry has not been removed from the VirtualBox medium manager.

Both has to be cleaned up manually.
- Shut down BOINC
- wait until all corresponding processes are closed
- delete garbage from the slots; be careful not to remove anything from currently "in progress" tasks
- Open the VirtualBox Manager and run the medium manager from the menu
- Remove orphaned disk entries; also be careful to ... (same as above)
- Restart BOINC

My explanation would be that systems under heavy load (lots of concurrently running tasks with heavy I/O) sooner or later run into timeout problems and leave garbage in the slots.
ID: 104648 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 104649 - Posted: 3 Feb 2022, 20:56:22 UTC

I did find two dud / zombies in there so will have to keep an eye on that, thanks.
ID: 104649 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Occasional VirtualBox failures



©2024 University of Washington
https://www.bakerlab.org