Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 196 · 197 · 198 · 199 · 200 · 201 · 202 . . . 276 · Next

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,715,212
RAC: 8,590
Message 105734 - Posted: 28 Mar 2022, 10:59:57 UTC - in response to Message 105655.  

VirtualBox comes in two major versions, vbox and vbox64. The Python tasks use only the newer of these, vbox64. Since vbox emulates a 32-bit instruction set and vbox64 emulates a 64-bit instruction set, they are not interchangeable.

Each is a program, and therefore requires a certain list of instructions from the physical CPU core it runs on. BOINC makes a list of the major groups of instructions available as it starts up.

It appears that vbox has been in use long enough that it only uses CPU instructions available on nearly all computers still in use, but vbox64 hasn't.

VirtualBox

https://www.virtualbox.org/wiki/Downloads

https://www.virtualbox.org/

If some of you can identify specific emulated CPU instructions for which emulation fails and shuts down the emulation, you might give the details to Oracle and see if they will fix at least part of the problem, even if Rosetta@Home won't help.

The details you send them should include the list of CPU instruction groups produced when BOINC starts up.

One thing many of us might send them is a request that when the VM unmanageable error is given, vbox64 should give more details on why.
From the data collected, the instructions are one or more of avx, avx2, f16c, fma. What do you suggest we do now?
ID: 105734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,848,401
RAC: 2,043
Message 105735 - Posted: 28 Mar 2022, 13:07:02 UTC - in response to Message 105734.  

[snip]

If some of you can identify specific emulated CPU instructions for which emulation fails and shuts down the emulation, you might give the details to Oracle and see if they will fix at least part of the problem, even if Rosetta@Home won't help.

The details you send them should include the list of CPU instruction groups produced when BOINC starts up.

One thing many of us might send them is a request that when the VM unmanageable error is given, vbox64 should give more details on why.
From the data collected, the instructions are one or more of avx, avx2, f16c, fma. What do you suggest we do now?


Time to ask Oracle to produce more meaningful error messages if any of the missing instructions are not present when vbox64 runs.
ID: 105735 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,715,212
RAC: 8,590
Message 105737 - Posted: 28 Mar 2022, 13:19:18 UTC - in response to Message 105735.  

If some of you can identify specific emulated CPU instructions for which emulation fails and shuts down the emulation, you might give the details to Oracle and see if they will fix at least part of the problem, even if Rosetta@Home won't help.

The details you send them should include the list of CPU instruction groups produced when BOINC starts up.

One thing many of us might send them is a request that when the VM unmanageable error is given, vbox64 should give more details on why.
From the data collected, the instructions are one or more of avx, avx2, f16c, fma. What do you suggest we do now?
Time to ask Oracle to produce more meaningful error messages if any of the missing instructions are not present when vbox64 runs.
It sounds like you know more about the way Oracle works than me - particularly whether Oracle or the program decides what instructions are available. Perhaps you should contact them? I would have thought Oracle just passes the available instruction set to the Python program, but maybe not.
ID: 105737 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,848,401
RAC: 2,043
Message 105739 - Posted: 28 Mar 2022, 14:10:46 UTC - in response to Message 105737.  

[snip]

One thing many of us might send them is a request that when the VM unmanageable error is given, vbox64 should give more details on why.
From the data collected, the instructions are one or more of avx, avx2, f16c, fma. What do you suggest we do now?
Time to ask Oracle to produce more meaningful error messages if any of the missing instructions are not present when vbox64 runs.
It sounds like you know more about the way Oracle works than me - particularly whether Oracle or the program decides what instructions are available. Perhaps you should contact them? I would have thought Oracle just passes the available instruction set to the Python program, but maybe not.

I tried contacting Oracle. They made it rather difficult.
ID: 105739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,715,212
RAC: 8,590
Message 105740 - Posted: 28 Mar 2022, 14:48:19 UTC - in response to Message 105739.  

I tried contacting Oracle. They made it rather difficult.
Well we know Rosetta is impossible to speak to. Trouble is, are we sure who is to blame here? Does Oracle have a feature missing, or is Rosetta programmed badly?

If you want to contact Oracle, there seems to be many ways to do so, here: https://www.virtualbox.org/wiki/Community
ID: 105740 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,703,329
RAC: 2,182
Message 105742 - Posted: 28 Mar 2022, 18:09:26 UTC - in response to Message 105740.  

I tried contacting Oracle. They made it rather difficult.
Well we know Rosetta is impossible to speak to. Trouble is, are we sure who is to blame here? Does Oracle have a feature missing, or is Rosetta programmed badly?

If you want to contact Oracle, there seems to be many ways to do so, here: https://www.virtualbox.org/wiki/Community



Why should Oracle care about a little problem with a specific program that does not affect thousands or tens of thousands of users of it's product? That is probably why they ran you off.

It's like me contacting a cold wear testing lab about a specific product they tested and showed data for only 2 out of 12 zones and neither of these zones are critical to the more important areas that get cold the fastest. I am only a individual contacting a company that tests for million dollar industrial foot companies. My request got round filled or back burnered.
ID: 105742 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,848,401
RAC: 2,043
Message 105760 - Posted: 31 Mar 2022, 0:03:53 UTC
Last modified: 31 Mar 2022, 0:07:08 UTC

A failing Python task.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1317287356

The section of the vbox_trace.txt file that looks relevant:

2022-03-28 14:55:36 (14760):
Command: VBoxManage -q showvminfo "boinc_35d83054a4475009" --machinereadable
Exit Code: -2135228415
Output:
VBoxManage.exe: error: Could not find a registered machine named 'boinc_35d83054a4475009'
VBoxManage.exe: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee IUnknown
VBoxManage.exe: error: Context: "FindMachine(Bstr(VMNameOrUuid).raw(), machine.asOutParam())" at line 2621 of file VBoxManageInfo.cpp

2022-03-28 14:55:36 (14760):
Command: VBoxManage -q showhdinfo "C:ProgramDataBOINCslots10/vm_image.vdi"
Exit Code: 0
Output:
UUID: ef35dff9-d482-48f8-9519-fef6c1b23a3b
Parent UUID: base
State: created
Type: normal (base)
Location: C:ProgramDataBOINCslots10vm_image.vdi
Storage format: VDI
Format variant: dynamic default
Capacity: 8192 MBytes
Size on disk: 7115 MBytes
Encryption: disabled


Elapsed time MUCH greater than simulated CPU time.

I aborted it.
ID: 105760 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,561,500
RAC: 59,108
Message 105762 - Posted: 31 Mar 2022, 16:19:29 UTC - in response to Message 105734.  
Last modified: 31 Mar 2022, 16:21:15 UTC

VirtualBox comes in two major versions, vbox and vbox64. The Python tasks use only the newer of these, vbox64. Since vbox emulates a 32-bit instruction set and vbox64 emulates a 64-bit instruction set, they are not interchangeable.

Each is a program, and therefore requires a certain list of instructions from the physical CPU core it runs on. BOINC makes a list of the major groups of instructions available as it starts up.

It appears that vbox has been in use long enough that it only uses CPU instructions available on nearly all computers still in use, but vbox64 hasn't.

VirtualBox

https://www.virtualbox.org/wiki/Downloads

https://www.virtualbox.org/

If some of you can identify specific emulated CPU instructions for which emulation fails and shuts down the emulation, you might give the details to Oracle and see if they will fix at least part of the problem, even if Rosetta@Home won't help.

The details you send them should include the list of CPU instruction groups produced when BOINC starts up.

One thing many of us might send them is a request that when the VM unmanageable error is given, vbox64 should give more details on why.
From the data collected, the instructions are one or more of avx, avx2, f16c, fma. What do you suggest we do now?


I think you guys have found the issue here. My machines that don't work are a 1st gen Nehalem Xeon (AVX was introduced in 2nd gen Sandy Bridge) and Pentiums which have AVX/AVX2 disabled. I'm not sure about F16C or FMA yet.

It looks like the Intel MKL doesn't require AVX, but if Virtualbox is telling it that it's available when it's not then it's going to crash.

CPUs that don't work:
https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20L5640%20-%20AT80614005133AB%20(BX80614L5640).html
https://www.cpu-world.com/CPUs/Pentium_Dual-Core/Intel-Pentium%20G3220.html
https://www.cpu-world.com/CPUs/Pentium_Dual-Core/Intel-Pentium%20G4500.html
ID: 105762 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,715,212
RAC: 8,590
Message 105763 - Posted: 31 Mar 2022, 16:48:11 UTC - in response to Message 105762.  
Last modified: 31 Mar 2022, 16:48:23 UTC

I think you guys have found the issue here. My machines that don't work are a 1st gen Nehalem Xeon (AVX was introduced in 2nd gen Sandy Bridge) and Pentiums which have AVX/AVX2 disabled. I'm not sure about F16C or FMA yet.

It looks like the Intel MKL doesn't require AVX, but if Virtualbox is telling it that it's available when it's not then it's going to crash.

CPUs that don't work:
https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20L5640%20-%20AT80614005133AB%20(BX80614L5640).html
https://www.cpu-world.com/CPUs/Pentium_Dual-Core/Intel-Pentium%20G3220.html
https://www.cpu-world.com/CPUs/Pentium_Dual-Core/Intel-Pentium%20G4500.html
If it's a case of "Virtualbox is telling it that it's available when it's not" then perhaps we ought to speak to Virtualbox?
ID: 105763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,561,500
RAC: 59,108
Message 105766 - Posted: 31 Mar 2022, 20:59:54 UTC - in response to Message 105763.  

It might be VirtualBox, but might it also just be that the script is setup to assume AVX (or whichever extension is missing) is available without checking?
ID: 105766 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,461,917
RAC: 15,153
Message 105767 - Posted: 31 Mar 2022, 23:47:06 UTC

A new batch of Rosetta 4.20 tasks are out atm, named YIL10mer_YILstub*
I'm getting a lot of computation errors here
Unhandled exception errors all over the place after just a few seconds.

After a couple of attempts, I appear to have all 4 cores running tasks right now, but it's been a struggle.
Beware
ID: 105767 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MStenholm

Send message
Joined: 18 Apr 20
Posts: 17
Credit: 22,630,993
RAC: 27,990
Message 105768 - Posted: 1 Apr 2022, 0:07:58 UTC - in response to Message 105767.  

A new batch of Rosetta 4.20 tasks are out atm, named YIL10mer_YILstub*
I'm getting a lot of computation errors here
Unhandled exception errors all over the place after just a few seconds.

After a couple of attempts, I appear to have all 4 cores running tasks right now, but it's been a struggle.
Beware

It seems to be another batch that prefers Linux as I can see from my team members. I got two times 16 running on Linux about one hour in and as I can see the Windows ones errors out fast, seconds in.
ID: 105768 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,715,212
RAC: 8,590
Message 105778 - Posted: 1 Apr 2022, 14:29:40 UTC - in response to Message 105767.  

A new batch of Rosetta 4.20 tasks are out atm, named YIL10mer_YILstub*
I'm getting a lot of computation errors here
Unhandled exception errors all over the place after just a few seconds.

After a couple of attempts, I appear to have all 4 cores running tasks right now, but it's been a struggle.
Beware
Errors after a few seconds don't bother me. Cosmology at home wasting the whole task time before deciding to crash is annoying.
ID: 105778 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1861
Credit: 8,161,694
RAC: 8,174
Message 105779 - Posted: 1 Apr 2022, 14:30:52 UTC

Some errors on VirtualBox WUS:

<message>
(unknown error) - exit code 2159738884 (0x80bb0004)</message>
<stderr_txt>
2022-04-01 16:23:15 (16520): Detected: vboxwrapper 26202
2022-04-01 16:23:15 (16520): Detected: BOINC client v7.16.20
2022-04-01 16:23:15 (16520): Detected: VirtualBox VboxManage Interface (Version: 6.1.28)
2022-04-01 16:23:16 (16520): Feature: Checkpoint interval offset (388 seconds)
2022-04-01 16:23:16 (16520): Detected: Minimum checkpoint interval (600.000000 seconds)
2022-04-01 16:24:04 (16520): Create VM. (boinc_dcbfc0c30d52b14d, slot#0)
2022-04-01 16:24:58 (16520): Error in create for VM: -2135228412
Command:
VBoxManage -q createvm --name "boinc_dcbfc0c30d52b14d" --basefolder "C:ProgramDataBOINCslots" --ostype "Debian_64" --register
Output:
VBoxManage.exe: error: Machine settings file 'C:ProgramDataBOINCslotsboinc_dcbfc0c30d52b14dboinc_dcbfc0c30d52b14d.vbox' already exists
VBoxManage.exe: error: Details: code VBOX_E_FILE_ERROR (0x80bb0004), component MachineWrap, interface IMachine, callee IUnknown
VBoxManage.exe: error: Context: "CreateMachine(bstrSettingsFile.raw(), bstrName.raw(), ComSafeArrayAsInParam(groups), bstrOsTypeId.raw(), createFlags.raw(), machine.asOutParam())" at line 280 of file VBoxManageMisc.cpp

2022-04-01 16:24:58 (16520): Could not create VM
2022-04-01 16:24:58 (16520): ERROR: VM failed to start

ID: 105779 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 105784 - Posted: 1 Apr 2022, 20:41:57 UTC

A bit late to the party with this
some cpu specs :-

Runs python ok , with only `normal` zombies
Processor: 16 AuthenticAMD AMD Opteron(TM) Processor 6276 [Family 21 Model 1 Stepping 2]
Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 cx16 sse4_1 sse4_2 popcnt aes syscall nx lm avx svm sse4a osvw ibs xop skinit wdt lwp fma4 topx page1gb rdtscp
OS: Microsoft Windows 7: Ultimate x64 Edition, Service Pack 1, (06.01.7600.00)
........
Also Runs python ok , with only `normal` zombies
Processor: 48 GenuineIntel Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz [Family 6 Model 62 Stepping 4]
Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 sse4_2 popcnt aes f16c rdrandsyscall nx lm avx vmx smx tm2 dca pbe fsgsbase smep
OS: Microsoft Windows 7: Ultimate x64 Edition, Service Pack 1, (06.01.7600.00)
...........
Will not run VB tasks for rosetta@home or cosmology@home everything craps out after a few seconds
Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz [Family 6 Model 23 Stepping 7]
Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe

That is the only three systems I have infected with Virtual pox
ID: 105784 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 105786 - Posted: 2 Apr 2022, 0:07:03 UTC
Last modified: 2 Apr 2022, 0:38:00 UTC

From the data collected, the instructions are one or more of avx, avx2, f16c, fma. What do you suggest we do now?

From this short list the only on that I see is avx and I have no idea why VB + pythons + cosmology would need it .
So , is it a simple matter of the admin of rosetta blocking all work to systems that don't have avx . . . .
or removing its requirement , if possible . . . .
Hmm . . . .
I`le go back under my rock now :-)

Well , actualy , its a old metal bin lid , like on `The Clangers` planet .
ok , I admit to having three `clangers` dvd`s
ID: 105786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,715,212
RAC: 8,590
Message 105787 - Posted: 2 Apr 2022, 0:24:18 UTC - in response to Message 105786.  

From the data collected, the instructions are one or more of avx, avx2, f16c, fma. What do you suggest we do now?

From this short list the only on that I see is avx and I have no idea why VB + pythons + cosmology would need it .
So , is it a simple matter of the admin of rosetta blocking all work to systems that don't have avx . . . .
or removing its requirement , if possible . . . .
Hmm . . . .
I`le go back under my rock now :-)

Well , actualy , its a old metal bin lid , like on `The Clangers` planet .
Cosmology doesn't need it. I can run Cosmology on all 7 of my machines, most are missing AVX. The only thing that annoys Cosmology is VB 6. VB 5 is ok.
ID: 105787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 105788 - Posted: 2 Apr 2022, 1:01:52 UTC - in response to Message 105787.  

Cosmology doesn't need it. I can run Cosmology on all 7 of my machines, most are missing AVX. The only thing that annoys Cosmology is VB 6. VB 5 is ok.

Had a look , the q9450 is on Boinc 7.16.20 so its got VB 6.1.2
I will finish all work and revert/uninstall/nuke back to Boinc 7.14.2 uses VB 5.2.8 to see what happens .
I have got versions of boinc mangler back to 5.10.13
Oh! , that's 45 all together in win/Lin 32/64/VB or not , sad case . . . .
Just in case .
But sometimes they come in usefull .
ID: 105788 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,461,917
RAC: 15,153
Message 105790 - Posted: 2 Apr 2022, 1:20:14 UTC - in response to Message 105768.  

A new batch of Rosetta 4.20 tasks are out atm, named YIL10mer_YILstub*
I'm getting a lot of computation errors here
Unhandled exception errors all over the place after just a few seconds.

After a couple of attempts, I appear to have all 4 cores running tasks right now, but it's been a struggle.
Beware

It seems to be another batch that prefers Linux as I can see from my team members. I got two times 16 running on Linux about one hour in and as I can see the Windows ones errors out fast, seconds in.

I'm not reporting anything recently, but I will send another message pointing out this LinuxWindows issue because it's turned up in several separate batches of work now.
One-off little issues I don't bother with, but this seems systemic to me
ID: 105790 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,715,212
RAC: 8,590
Message 105791 - Posted: 2 Apr 2022, 1:59:24 UTC - in response to Message 105788.  
Last modified: 2 Apr 2022, 2:01:57 UTC

Cosmology doesn't need it. I can run Cosmology on all 7 of my machines, most are missing AVX. The only thing that annoys Cosmology is VB 6. VB 5 is ok.

Had a look , the q9450 is on Boinc 7.16.20 so its got VB 6.1.2
I will finish all work and revert/uninstall/nuke back to Boinc 7.14.2 uses VB 5.2.8 to see what happens .
I have got versions of boinc mangler back to 5.10.13
Oh! , that's 45 all together in win/Lin 32/64/VB or not , sad case . . . .
Just in case .
But sometimes they come in usefull .
Boinc version and VB version are not linked. Just install the older VB from the Oracle site. It will install on top of a newer one. Be sure to get the correct extensions along with it. I've not found any project that needs 6.

If you change Boinc version you could break other things like SSL and you won't be able to contact some projects.
ID: 105791 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 196 · 197 · 198 · 199 · 200 · 201 · 202 . . . 276 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org