Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86-32: floating-point return values undergo implicit format conversion #66803

Open
jyknight opened this issue Sep 19, 2023 · 7 comments
Open

Comments

@jyknight
Copy link
Member

With SSE2 disabled, the floating-point semantics are pretty hopeless, with implicit conversions happening all over the place (e.g. spilling registers to stack). Most of those issues go away with SSE2 enabled, because we use SSE2 instructions and registers for all float/double operations.

The remaining issue with SSE2 enabled, is that the default C ABI requires that float and double values are returned in x87 registers. Returning a float or double value thus converts to x86_fp80 (and then back, in the caller). This conversion means that a signaling NaN cannot be returned, because the behind-the-scenes conversion to x87_fp80 will raise an FP invalid exception, and quiet the NaN.

LLVM does support other ABIs which don't have this problem: you can either use an alternative calling convention on the function (such as "fastcc"), or by annotating the return type with "inreg" (as seen here):

// The X86-32 calling convention returns FP values in FP0, unless marked
// with "inreg" (used here to distinguish one kind of reg from another,
// weirdly; this is really the sse-regparm calling convention) in which
// case they use XMM0, otherwise it is the same as the common X86 calling
// conv.

While this is a fundamental problem with the x86-32 ABI, I believe we could potentially fix it on the LLVM side, without breaking the ABI, because loading/storing an 80-bit value from x87 FPU register does not trigger a conversion operation. Thus, we could potentially write custom conversion routines to go from 32/64-bit float to 80-bit float (and back), and use that at the call boundary.

Such a routine would have runtime overhead vs using the X87 FPU's native conversion support, and it's also unclear whether anyone cares enough about precise x86-32 FP semantics in order to actually bother implementing it. But, it seemed worth at least recording the issue, and a possible resolution.

@llvmbot
Copy link

llvmbot commented Sep 19, 2023

@llvm/issue-subscribers-backend-x86

With SSE2 disabled, the floating-point semantics are pretty hopeless, with implicit conversions happening all over the place (e.g. spilling registers to stack). Most of those issues go away with SSE2 enabled, because we use SSE2 instructions and registers for all float/double operations.

The remaining issue with SSE2 enabled, is that the default C ABI requires that float and double values are returned in x87 registers. Returning a float or double value thus converts to x86_fp80 (and then back, in the caller). This conversion means that a signaling NaN cannot be returned, because the behind-the-scenes conversion to x87_fp80 will raise an FP invalid exception, and quiet the NaN.

LLVM does support other ABIs which don't have this problem: you can either use an alternative calling convention on the function (such as "fastcc"), or by annotating the return type with "inreg" (as seen here):

// The X86-32 calling convention returns FP values in FP0, unless marked
// with "inreg" (used here to distinguish one kind of reg from another,
// weirdly; this is really the sse-regparm calling convention) in which
// case they use XMM0, otherwise it is the same as the common X86 calling
// conv.

While this is a fundamental problem with the x86-32 ABI, I believe we could potentially fix it on the LLVM side, without breaking the ABI, because loading/storing an 80-bit value from x87 FPU register does not trigger a conversion operation. Thus, we could potentially write custom conversion routines to go from 32/64-bit float to 80-bit float (and back), and use that at the call boundary.

Such a routine would have runtime overhead vs using the X87 FPU's native conversion support, and it's also unclear whether anyone cares enough about precise x86-32 FP semantics in order to actually bother implementing it. But, it seemed worth at least recording the issue, and a possible resolution.

@jyknight
Copy link
Member Author

See also Rust investigations linked from rust-lang/rust#115567 @RalfJung

@DimitryAndric
Copy link
Collaborator

See also #29774 (7 years old now :) where it keeps asserting when SSE2 is disabled, due to some internal inconsistency. This has never been fixed, and there are lots and lots of duplicates...

@RalfJung
Copy link
Contributor

This issue is specifically about the problems that remain when SSE2 is enabled.

@phoebewang
Copy link
Contributor

Such a routine would have runtime overhead vs using the X87 FPU's native conversion support, and it's also unclear whether anyone cares enough about precise x86-32 FP semantics in order to actually bother implementing it. But, it seemed worth at least recording the issue, and a possible resolution.

I think we may need to extend the -fexcess-precision=fast|standard support in the backend to balance both precision and performance. IIRC, this is the way how GCC solves this problem.

@RalfJung
Copy link
Contributor

While this is a fundamental problem with the x86-32 ABI, I believe we could potentially fix it on the LLVM side, without breaking the ABI, because loading/storing an 80-bit value from x87 FPU register does not trigger a conversion operation. Thus, we could potentially write custom conversion routines to go from 32/64-bit float to 80-bit float (and back), and use that at the call boundary.

Could this be routine be called for NaN values only (to preserve their bits perfectly), using the native routine for other values? Then it could be skipped entirely when nnan can be deduced.

@jyknight
Copy link
Member Author

Could this be routine be called for NaN values only

The requirement is that a round-trip from float/double (in memory) to x87 register and back to memory preserves the exact bitpattern of the input, and raises no fp exception flags. I believe this is true for all values other than sNaN.

So, totally doable. But, it's still pretty complex, and will have some performance/code-size cost, so I doubt we want to enable this by default. sNaN is a pretty niche feature, and x86-32 is becoming a pretty niche architecture. The combination of people who care about sNaN on x86-32 is likely pretty negligible. But, it would be nice to at least offer a mode where this works.

We could use a routine like below to convert from a float in integer form, to a value in an x87 register (almost-C code, but hand-waving away the x87 calling convention boundaries).

/*pushes onto x87 stack*/ nansafe_float_to_x87(uint32_t in) {
  char buf[10]
  if ((in & 0x7fc00000) == 0x7f800000 && (in & ~0xffc00000) != 0) {
    uint64_t out_lo = ((uint64_t)in) << 40;
    memcpy(buf, &out_lo, 8);
    uint16_t out_hi = 0x7fff | ((in & 0x80000000) >> 16);
    memcpy(buf+8, &out_hi, 2);
    /* use FLD m80 to copy 'buf' into x87 */
  } else {
    /* use FLD m32 to format-convert 'in' into x87 */
  }
}

Then, of course, in the caller, we'd need to do the same thing in reverse, to convert from 80-bit x87 back to float. Since you already need to copy and store out the 80-bit value from the x87 register to memory, to even detect that it's an sNaN, a separate inline fast-path doesn't seem that doable.

Something like:

void nansafe_x87_to_float(/* value on top of x87 stack input*/ in, /* 10-byte */ char *buf) {
  /* use FLD st(0) to duplicate top of x87 stack. */
  /* use FSTP m80 to pop x87 stack into buf */
  uint32_t in_mid;
  uint16_t in_hi;
  memcpy(&in_mid, buf+4, 4);
  memcpy(&in_hi, buf+8, 2);

  if((in_hi & 0x7fff) == 0x7fff && (in_mid & 0xc0000000) == 0x80000000 && ((in_mid >> 8) & ~0xc0000000) != 0) {
    /* use FSTP m80 again, to pop extra stack register */
    uint32_t out_num = 0x7f800000 | ((in_hi & 0x8000) << 16) | (in_mid >> 8);
    memcpy(out, &out_num, 4);
  } else {
    /* use FSTP m32 to pop top of stack and format-convert to float in 'out'. */
  }
}

Should be relatively straightforward to add a set of routines like that to compiler-rt (written in asm, so they can use a nonstandard ABI to deal with x87 stack I/O), and have the backend emit a call instead of native FST/FLD instructions.

Could some of it be inlined instead? Perhaps, but IMO not terribly worthwhile, given that the check for whether to take the slow-path is already half the code.

I don't plan to work on this, but if someone else does, hope the above helps. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants