Bug: rosetta_4.08_x86_64-pc-linux-gnu uses unsupported CPU features.

Questions and Answers : Unix/Linux : Bug: rosetta_4.08_x86_64-pc-linux-gnu uses unsupported CPU features.

To post messages, you must log in.

AuthorMessage
ShimmerFairy

Send message
Joined: 26 Mar 20
Posts: 3
Credit: 2,060
RAC: 0
Message 92529 - Posted: 29 Mar 2020, 13:32:34 UTC

I've been having trouble with running the 64-bit rosetta software, while 32-bit is working fine. First I made sure to enable legacy vsyscall emulation, which I was warned I'd need, but still rosetta_4.08_x86_64-pc-linux-gnu breaks (according to how different the backtraces look in the coredumps from attempts before and after, legacy vsyscall did at least seem to solve 1 problem, it just then revealed another). So, suspecting that the problem might be unsupported CPU features being assumed in the code, I looked at the backtrace for all my segfaulting tasks.

In all of the coredumps, the address of the last frame before the first <signal handler called> indicator is 0x00000000013bc5f8. So I go ahead and disassemble that immediate area and find:

(gdb) disassemble 0x13bc5f8,0x13bc608
Dump of assembler code from 0x13bc5f8 to 0x13bc608:
   0x00000000013bc5f8:  pshufb %xmm1,%xmm0
   0x00000000013bc5fd:  movdqa %xmm0,0x0(%r13)
   0x00000000013bc603:  jae    0x13bc6cc
End of assembler dump.


A quick bit of looking shows that the pshufb instruction was introduced in SSSE3. The problem with this is that my poor old AMD Athlon X2 does not have support for the SSSE3 instruction set, so of course it falls apart the instant the program tries to do a pshufb.

As far as I can tell, there are four solutions to this problem:

  1. If this unusable assembler is the result of some hand-written assembler in the project for optimization purposes, update rosetta_4.08_x86_64-pc-linux-gnu so that it can determine at runtime if the CPU features used are actually supported, and take a slower codepath if not.
  2. If this unusable assembler is the result of compiler flags telling the compiler to generate these instructions, recompile the program with the compiler set to not generate them (this would punish capable CPUs that can use these instructions, though I can't say if the effect would be significant).
  3. Have BOINC or whoever else be able to tell Rosetta "So, the CPU architecture is right, but it doesn't have the necessary features" and have the system avoid sending my computer tasks via 64-bit software that I can't actually run. (Assuming I haven't missed an already-existing option for this I can change in the client or on the website.)
  4. Do nothing and let my computer keep failing tasks given by Rosetta that require these unusable programs until the end of time.



I've done enough to know that, at the very least, there is nothing I can do on my end about this problem that wouldn't involve spending hundreds of dollars on a new computer.

ID: 92529 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92533 - Posted: 29 Mar 2020, 15:20:09 UTC

I've been a moderator a long time. I think this is the first time anyone has reported a problem, along with analysis of a disassembly! I am not the developer of the code, so forgive some naive questions, but I do wish to get as much info. as I can.

I notice that your machine is AMD. In one case, a work unit that you failed on was completed without incident by an Intel machine running Linux.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1020967004
Their machine reports as:
GenuineIntel Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz [Family 6 Model 158 Stepping 9]
Am I correct to presume that an i5-7500 has the SSSE support? (a quick search, it looks like "SSSE" is Intel's term, so that's probably true)

Sounds like perhaps there is a missing conditional compilation directive somewhere.

One question, you ran a disassembly and see the SSSE instruction there... do you have any way to confirm if your machine actually attempted to run that instruction? I believe the compiled code probably has conditional sections that are automatically placed by the compiler options. So seeing the instruction in the code would be expected, but your machine should have branched around it. ...I'm sure you now what I'm trying to say.

This may explain an issue I've been seeing with Linux machines. I had presumed it was related to machines with only 1GB per CPU. But typically those would be older machines, and typically older machines would be the ones that might not support the SSSE, so perhaps that is actually the nail that I've been looking for. Thank you for the fresh perspective. I have informed the Project Team with your information.
Rosetta Moderator: Mod.Sense
ID: 92533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 92555 - Posted: 29 Mar 2020, 18:06:12 UTC - in response to Message 92529.  

Indeed this is an issue and thanks for your detective work. We can address this in the next planned app update which is in the works. We'll have to push out the current apps that are being tested on Ralph@h tomorrow so that we can get started on dependent COVID-19 related tasks sooner than later but we also plan to do another update soon to include a specific COVID-19 protocol. For this update which we are currently working on, we can include specific sse and non-sse apps (your options 2 and 3). We can setup a plan class on Ralph which should address option 3 and test the server functionality on Ralph soon with the existing apps, and after the next update, test the functionality with sse and non-sse apps (option 2).

Thanks again! And thank you Mod-Sense for also bringing this to our attention.
ID: 92555 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92557 - Posted: 29 Mar 2020, 18:31:34 UTC
Last modified: 29 Mar 2020, 18:34:01 UTC

So, to try and generalize here a bit:

App version showing problem: Rosetta v4.08 x86_64-pc-linux-gnu
Symptom: Task ends early with signal 11
Effected systems: CPUs that do not support SSSE, which is often AMD CPUs (especially older ones, "prior to bulldozer").

The good news is it seems to often trip in the first 2 minutes of execution of a task.
Rosetta Moderator: Mod.Sense
ID: 92557 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ShimmerFairy

Send message
Joined: 26 Mar 20
Posts: 3
Credit: 2,060
RAC: 0
Message 92564 - Posted: 29 Mar 2020, 19:34:47 UTC

First, I want to stress that the issue is with SSSE3, not with any of the other very similarly-named extensions to the x86 line of CPUs. I only want to stress it because, in particular, there's an insidious difference between SSE3 (referred to as "pni" in the CPU flags in /proc/cpuinfo, and which my CPU does support) and SSSE3 (the instruction set my CPU doesn't support). It's also worth noting that my CPU unsurprisingly doesn't support any of SSE4 either.

Here's the set of flags directly from my /proc/cpuinfo, for reference:

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good nopl cpuid extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch vmmcall lbrv


So, to sum up, my CPU in particular can handle SSE, SSE2, and SSE3 ("pni" in the above), but not SSSE3 or any of SSE4.

A very quick look seems to tell me that SSE and SSE2 were in the very first x86-64 CPUs, so those don't realistically need to be conditioned in 64-bit code, and SSE3 showed up not at first but fairly early on. (With SSSE3 taking some more time to show up, at least on the AMD side.)

About "did the instruction actually run?", I got the location in question from looking at the backtraces in each of the coredumps generated from the segfault, with the aid of GDB. For example:

(gdb) bt
#0  0x0000000008389940 in ?? ()
#1  0x0000000005f2e604 in ?? ()
#2  <signal handler called>
#3  0x0000000008389940 in ?? ()
#4  0x0000000005f2e604 in ?? ()
#5  <signal handler called>
#6  0x00000000013bc5f8 in ?? ()
#7  0x0000000003aec103 in ?? ()
#8  0x0000000003b01e08 in ?? ()
#9  0x0000000003a8bf0d in ?? ()
#10 0x0000000003a32025 in ?? ()
#11 0x0000000003a37de5 in ?? ()
#12 0x0000000003a9513c in ?? ()
#13 0x0000000003a95434 in ?? ()
#14 0x0000000003deb5ab in ?? ()
#15 0x00000000027480d3 in ?? ()
#16 0x0000000002749f3a in ?? ()
#17 0x0000000002f70884 in ?? ()
#18 0x000000000371db00 in ?? ()
#19 0x000000000371eccb in ?? ()
#20 0x000000000371f44e in ?? ()
#21 0x00000000037866f8 in ?? ()
#22 0x0000000003788221 in ?? ()
#23 0x0000000003826698 in ?? ()
#24 0x00000000038261a3 in ?? ()
#25 0x00000000004135e6 in ?? ()
#26 0x0000000005ff3ccc in ?? ()
#27 0x00000000006108e7 in ?? ()


As you can see, address 0x00000000013bc5f8 comes from the innermost call frame before the signal handler first get tripped, and every single failed attempt after I fixed my vsyscall support has that same address just before the signal handler gets called.
ID: 92564 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 92579 - Posted: 30 Mar 2020, 2:13:54 UTC - in response to Message 92564.  

noted, SSSE3. thanks
ID: 92579 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ShimmerFairy

Send message
Joined: 26 Mar 20
Posts: 3
Credit: 2,060
RAC: 0
Message 93323 - Posted: 4 Apr 2020, 0:42:13 UTC

Just a quick addition, I got worried that there were other instruction sets in the program that my computer couldn't run, and if the compatibility version you're making just goes to disable SSSE3 specifically, then I'd keep coming back over and over again to say "now it doesn't work for this reason". So I wrote a quick program to go through rosetta_4.08_x86_64-pc-linux-gnu and check ahead of time. Being a very quick project I didn't do anything super fancy like figuring out if the too-new instructions were put inside a conditional to make sure it only got used on capable computers, but with that in mind I did find other instruction sets I can't use on my CPU (I'll list them at the bottom).

The easiest way to handle this would be for the compatibility version to be compiled for a specific kind of CPU, instead of just trying to disable specific features. For example, if you want to support all x86-64 CPUs, then you could compile them for the K8 family of AMD processors (the first ones to implement x86-64). If you're using GCC, for instance, that should be doable by making sure that -march=k8 shows up in the options to gcc without any other options contradicting it. (This is all of course assuming none of the instruction sets in question show up as a result of hand-written assembler or through the use of compiler intrinsics that make for a slightly easier version of hand-written assembler.)

I also just want to say that I appreciate you guys trying to support old CPUs like mine. My Athlon X2 is over a decade old now (in fact it's part of that K8 family, though a bit later on in the series), and I wouldn't be surprised if you ultimately decided "if your x86-64 bit CPU is too old, then it can only run the 32-bit stuff".

The specific instruction sets I found in the program that I can't handle, in case this is of interest, were SSE4.1, SSE4.2, AVX, and more niche feature sets like Restricted Transactional Memory, xgetbv, and rdrand.
ID: 93323 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Germano_0x

Send message
Joined: 27 Dec 13
Posts: 3
Credit: 2,493,872
RAC: 0
Message 94046 - Posted: 10 Apr 2020, 9:28:32 UTC

Hi ShimmerFairy, can you tell us the whole procedure you followed so we (other users) can do the same in case of similar problems?
For example a thing that I don't know, is how to attach GDB to a working unit that is no longer running because it has failed.
Thank you
ID: 94046 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sspseudoo

Send message
Joined: 4 Mar 20
Posts: 7
Credit: 23,843
RAC: 0
Message 96393 - Posted: 12 May 2020, 12:33:51 UTC

I have problems again: https://boinc.bakerlab.org/rosetta/results.php?userid=2083373

I did "strace" a crashing process and these are the last lines:

--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x14c77b8} ---
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, SNDCTL_TMR_CONTINUE or TCSETSF, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
+++ killed by SIGSEGV (core dumped) +++


Maybe again "wrong" instructions?
Some packages seem to work, the others crash after a minute or so...

@Germano_0x:
I did it as follows:
Wait for a new task that is about to crash.
Using "top" finding out the PID, or look in your boinc manager under properties or similar.
Using "strace -pid" to attach to the process.
See the crash and the address as above.

Then (using Fedora) I just did a "coredumpctl gdb" and the coredump was automatically loaded into gdb.
In gdb I used "bt" to backtrace the crash. The last adress before <signal handler called> is the same address as shown in "strace".
(So the strace thing I did in the beginning was not necessary at all, but already lead to into the right direction, I suppose.)
Following I disassembled the area as written in the first post "(gdb) disassemble 0x14c77b8,0x14c77c8"

Dump of assembler code from 0x14c77b8 to 0x14c77c8:
   0x00000000014c77b8:	pshufb %xmm1,%xmm0
   0x00000000014c77bd:	movdqa %xmm0,0x0(%r13)
   0x00000000014c77c3:	jae    0x14c788c
End of assembler dump.


The area about to look at: I do not really know, but just went with the last value a bit higher.

The not supported instruction pshufb seems still to be used in rosetta_4.20_x86_64-pc-linux-gnu on this "older" computer.
ID: 96393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Questions and Answers : Unix/Linux : Bug: rosetta_4.08_x86_64-pc-linux-gnu uses unsupported CPU features.



©2024 University of Washington
https://www.bakerlab.org