x86-32: floating-point return values undergo implicit format conversion #66803

jyknight · 2023-09-19T19:00:25Z

With SSE2 disabled, the floating-point semantics are pretty hopeless, with implicit conversions happening all over the place (e.g. spilling registers to stack). Most of those issues go away with SSE2 enabled, because we use SSE2 instructions and registers for all float/double operations.

The remaining issue with SSE2 enabled, is that the default C ABI requires that float and double values are returned in x87 registers. Returning a float or double value thus converts to x86_fp80 (and then back, in the caller). This conversion means that a signaling NaN cannot be returned, because the behind-the-scenes conversion to x87_fp80 will raise an FP invalid exception, and quiet the NaN.

LLVM does support other ABIs which don't have this problem: you can either use an alternative calling convention on the function (such as "fastcc"), or by annotating the return type with "inreg" (as seen here):

llvm-project/llvm/lib/Target/X86/X86CallingConv.td

Lines 300 to 304 in 575a648

    
           // The X86-32 calling convention returns FP values in FP0, unless marked 
        
           // with "inreg" (used here to distinguish one kind of reg from another, 
        
           // weirdly; this is really the sse-regparm calling convention) in which 
        
           // case they use XMM0, otherwise it is the same as the common X86 calling 
        
           // conv.

While this is a fundamental problem with the x86-32 ABI, I believe we could potentially fix it on the LLVM side, without breaking the ABI, because loading/storing an 80-bit value from x87 FPU register does not trigger a conversion operation. Thus, we could potentially write custom conversion routines to go from 32/64-bit float to 80-bit float (and back), and use that at the call boundary.

Such a routine would have runtime overhead vs using the X87 FPU's native conversion support, and it's also unclear whether anyone cares enough about precise x86-32 FP semantics in order to actually bother implementing it. But, it seemed worth at least recording the issue, and a possible resolution.

llvmbot · 2023-09-19T19:00:44Z

@llvm/issue-subscribers-backend-x86

With SSE2 disabled, the floating-point semantics are pretty hopeless, with implicit conversions happening all over the place (e.g. spilling registers to stack). Most of those issues go away with SSE2 enabled, because we use SSE2 instructions and registers for all float/double operations.

The remaining issue with SSE2 enabled, is that the default C ABI requires that float and double values are returned in x87 registers. Returning a float or double value thus converts to x86_fp80 (and then back, in the caller). This conversion means that a signaling NaN cannot be returned, because the behind-the-scenes conversion to x87_fp80 will raise an FP invalid exception, and quiet the NaN.

LLVM does support other ABIs which don't have this problem: you can either use an alternative calling convention on the function (such as "fastcc"), or by annotating the return type with "inreg" (as seen here):

llvm-project/llvm/lib/Target/X86/X86CallingConv.td

Lines 300 to 304 in 575a648

    
           // The X86-32 calling convention returns FP values in FP0, unless marked 
        
           // with "inreg" (used here to distinguish one kind of reg from another, 
        
           // weirdly; this is really the sse-regparm calling convention) in which 
        
           // case they use XMM0, otherwise it is the same as the common X86 calling 
        
           // conv.

While this is a fundamental problem with the x86-32 ABI, I believe we could potentially fix it on the LLVM side, without breaking the ABI, because loading/storing an 80-bit value from x87 FPU register does not trigger a conversion operation. Thus, we could potentially write custom conversion routines to go from 32/64-bit float to 80-bit float (and back), and use that at the call boundary.

Such a routine would have runtime overhead vs using the X87 FPU's native conversion support, and it's also unclear whether anyone cares enough about precise x86-32 FP semantics in order to actually bother implementing it. But, it seemed worth at least recording the issue, and a possible resolution.

jyknight · 2023-09-19T19:08:58Z

See also Rust investigations linked from rust-lang/rust#115567 @RalfJung

DimitryAndric · 2023-09-20T07:50:12Z

See also #29774 (7 years old now :) where it keeps asserting when SSE2 is disabled, due to some internal inconsistency. This has never been fixed, and there are lots and lots of duplicates...

RalfJung · 2023-09-20T11:48:40Z

This issue is specifically about the problems that remain when SSE2 is enabled.

phoebewang · 2023-09-20T13:18:37Z

Such a routine would have runtime overhead vs using the X87 FPU's native conversion support, and it's also unclear whether anyone cares enough about precise x86-32 FP semantics in order to actually bother implementing it. But, it seemed worth at least recording the issue, and a possible resolution.

I think we may need to extend the -fexcess-precision=fast|standard support in the backend to balance both precision and performance. IIRC, this is the way how GCC solves this problem.

RalfJung · 2023-09-20T16:39:20Z

While this is a fundamental problem with the x86-32 ABI, I believe we could potentially fix it on the LLVM side, without breaking the ABI, because loading/storing an 80-bit value from x87 FPU register does not trigger a conversion operation. Thus, we could potentially write custom conversion routines to go from 32/64-bit float to 80-bit float (and back), and use that at the call boundary.

Could this be routine be called for NaN values only (to preserve their bits perfectly), using the native routine for other values? Then it could be skipped entirely when nnan can be deduced.

jyknight · 2023-12-14T19:08:40Z

Could this be routine be called for NaN values only

The requirement is that a round-trip from float/double (in memory) to x87 register and back to memory preserves the exact bitpattern of the input, and raises no fp exception flags. I believe this is true for all values other than sNaN.

So, totally doable. But, it's still pretty complex, and will have some performance/code-size cost, so I doubt we want to enable this by default. sNaN is a pretty niche feature, and x86-32 is becoming a pretty niche architecture. The combination of people who care about sNaN on x86-32 is likely pretty negligible. But, it would be nice to at least offer a mode where this works.

We could use a routine like below to convert from a float in integer form, to a value in an x87 register (almost-C code, but hand-waving away the x87 calling convention boundaries).

/*pushes onto x87 stack*/ nansafe_float_to_x87(uint32_t in) {
  char buf[10]
  if ((in & 0x7fc00000) == 0x7f800000 && (in & ~0xffc00000) != 0) {
    uint64_t out_lo = ((uint64_t)in) << 40;
    memcpy(buf, &out_lo, 8);
    uint16_t out_hi = 0x7fff | ((in & 0x80000000) >> 16);
    memcpy(buf+8, &out_hi, 2);
    /* use FLD m80 to copy 'buf' into x87 */
  } else {
    /* use FLD m32 to format-convert 'in' into x87 */
  }
}

Then, of course, in the caller, we'd need to do the same thing in reverse, to convert from 80-bit x87 back to float. Since you already need to copy and store out the 80-bit value from the x87 register to memory, to even detect that it's an sNaN, a separate inline fast-path doesn't seem that doable.

Something like:

void nansafe_x87_to_float(/* value on top of x87 stack input*/ in, /* 10-byte */ char *buf) {
  /* use FLD st(0) to duplicate top of x87 stack. */
  /* use FSTP m80 to pop x87 stack into buf */
  uint32_t in_mid;
  uint16_t in_hi;
  memcpy(&in_mid, buf+4, 4);
  memcpy(&in_hi, buf+8, 2);

  if((in_hi & 0x7fff) == 0x7fff && (in_mid & 0xc0000000) == 0x80000000 && ((in_mid >> 8) & ~0xc0000000) != 0) {
    /* use FSTP m80 again, to pop extra stack register */
    uint32_t out_num = 0x7f800000 | ((in_hi & 0x8000) << 16) | (in_mid >> 8);
    memcpy(out, &out_num, 4);
  } else {
    /* use FSTP m32 to pop top of stack and format-convert to float in 'out'. */
  }
}

Should be relatively straightforward to add a set of routines like that to compiler-rt (written in asm, so they can use a nonstandard ABI to deal with x87 stack I/O), and have the backend emit a call instead of native FST/FLD instructions.

Could some of it be inlined instead? Perhaps, but IMO not terribly worthwhile, given that the check for whether to take the slow-path is already half the code.

I don't plan to work on this, but if someone else does, hope the above helps. :)

jyknight added the backend:X86 label Sep 19, 2023

jyknight mentioned this issue Sep 19, 2023

[LangRef] Specify NaN behavior more precisely #66579

Merged

RalfJung mentioned this issue Sep 20, 2023

Tracking issue: 32bit x86 targets lose float NaN payload in return values rust-lang/rust#115567

Open

beetrees mentioned this issue Jul 8, 2024

LLVM miscompiles passing/returning half on several backends by using lossy conversions #97981

Open

7 tasks

beetrees mentioned this issue Oct 28, 2024

Test more targets against a custom-built musl libm rust-lang/libm#300

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86-32: floating-point return values undergo implicit format conversion #66803

x86-32: floating-point return values undergo implicit format conversion #66803

jyknight commented Sep 19, 2023

llvmbot commented Sep 19, 2023

jyknight commented Sep 19, 2023

DimitryAndric commented Sep 20, 2023

RalfJung commented Sep 20, 2023

phoebewang commented Sep 20, 2023

RalfJung commented Sep 20, 2023

jyknight commented Dec 14, 2023

x86-32: floating-point return values undergo implicit format conversion #66803

x86-32: floating-point return values undergo implicit format conversion #66803

Comments

jyknight commented Sep 19, 2023

llvmbot commented Sep 19, 2023

jyknight commented Sep 19, 2023

DimitryAndric commented Sep 20, 2023

RalfJung commented Sep 20, 2023

phoebewang commented Sep 20, 2023

RalfJung commented Sep 20, 2023

jyknight commented Dec 14, 2023