Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine the strategy that should be used by the Jit when folding implementation-defined casts #47478

Closed
SingleAccretion opened this issue Jan 26, 2021 · 12 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Jan 26, 2021

Today, the Jit supports folding casts like these: (uint)float.MaxValue. The value returned by this cast when done by the Jit is defined by the C++ compiler it was compiled with, which may or may not match the value returned by the instruction that will be emitted for an equivalent dynamic cast. It also may or may not match the value Roslyn will produce (0) when constant-folding these casts at compile time.

One of the manifestations of this issue was observed in #47374, where MSVC x86 compiler gave a different answer to a question that would "normally" return 0.

However, a scenario like the above is actually not the main concern that I would like to raise for consideration in this issue. A much bigger problem is cross-compilation, which, presumably, will become fairly common once Crossgen2 is fully productized, where the AOT Jit can potentially give a different answer to the same question than the JIT Jit.

As I see it, there are a couple of options that can pursued here:

  1. Do nothing. As per the ECMA specification, overflowing casts from floating-point types to integers are undefined behavior. This could lead to some problems (described below), but hopefully the impact will be minimal as such casts are not exactly widespread, and only an incredibly small fraction of them will involve constants (I conjecture this can only occur in real-world code when inlining).

  2. Do not fold when the result is undefined (as per ECMA), neither in AOT, nor the JIT mode. This options has a couple of benefits associated with it:

    • Behavior of folded code and non-folded code will be consistent for one architecture. This is in particular means that Release and Debug build should behave the same.
    • AOT mode will be consistent with the JIT mode "for free", as no implementation-defined folding will be done at AOT time.
    • It is a relatively simple approach, implementation wise.
      However, this would mean that some folding opportunities will be missed and some UB not taken advantage of when optimizing, which can have cascading effects on, for example, inlining.
  3. Only fold when not in AOT mode. This will eliminate the cross-compilation concerns, but can still potentially results in inconsistencies, depending on how the folding will be implemented (see below).

  4. Fold always. This will be a stronger option than the above, and have the same problems/benefits, just magnified by the amount of code that will be exposed.

Now, one critical decision to make if it is to be decided that the folding is to be performed for these edge cases, is how should it be done? Again, there are a couple of options:

  1. Fold everything to zero. This is what Roslyn does today, but it would mean that there will be inconsistencies between the folding and the dynamic execution, which existing comments in the code highlight as undesirable:
    // to an integer, ..., the value returned is unspecified." However, it would
    // at least be desirable to have the same value returned for casting an overflowing
    // constant to an int as would obtained by passing that constant as a parameter
    // then casting that parameter to an int type. We will assume that the C compiler's

    This is the reason why e. g. (uint)float.PositiveInfinity is not folded today, as are conversions from NaN. Opting for this behavior is also simple in terms of the implementation, but it comes with the usual baggage that UB has: Debug vs Release inconsistencies, inlining vs no inlining inconsistencies (which may become dynamic with the advent of PGO and such), etc.
  2. Fold to the same value the native instruction of the target architecture doing the conversion would produce when casting dynamically. This means that for x64, for example, the Jit would follow vcvttss2si's behavior, for ARM64 - fcvtzus, and so on. This is a relatively expensive option implementation-wise as the existing codegen would need to be surveyed for the exact instructions used, their behavior determined and stored in a dedicated table, which would need to be kept in sync with any future changes in the code generator. Hopefully, this should be a fixed-sized cost however, as it is not really expected that the emitted instructions will be changing frequently, if at all.

As I am no expert of the matter, this is only an overview of the problem, with the detailed analysis on the cost/benefit ratios of the options missing, with the intent on starting the discussion and determining those concretely.

It may turn out that the decision that will conclude this is that this simply is a non-issue extremely unlikely to be hit by anyone in the real world and thus doesn't merit any effort being dedicated to fixing it.

category:design
theme:optimization

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Jan 26, 2021
@tannergooding
Copy link
Member

Fold to the same value the native instruction of the target architecture doing the conversion would produce when casting dynamically.
This is a relatively expensive option implementation-wise as the existing codegen would need to be surveyed for the exact instructions used, their behavior determined and stored in a dedicated table, which would need to be kept in sync with any future changes in the code generator.

I think this would only be required if the folding was done at R2R/AOT time. For JIT time scenarios, where you are targeting the current machine, you can just execute the underlying C/C++ conversion and don't need to maintain any table outside of what cases are undefined behavior.

@SingleAccretion
Copy link
Contributor Author

One benefit of using the table anyway (of course, it'll only be used for cases that overflow) is that we'll get guaranteed consistent behavior across compilers. It also seems like using the table if we already have one (note that we say "the table", but really it is a couple of #ifs on the TARGET) would be simpler in terms of control flow?

@JulieLeeMSFT
Copy link
Member

CC @AndyAyersMS

@AndyAyersMS
Copy link
Member

@SingleAccretion thanks for the writeup.
cc @davidwrighton who has also done some thinking in this area.

We should try to get this sorted during .NET 6.

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jan 29, 2021
@davidwrighton
Copy link
Member

I'm working on a writeup for what we should be doing here. It's irritatingly complicated. My current thinking on the matter is that we should provide a CoreCLR definition for the the behavior of all casting that is unified across architectures and platforms. It will cause a slight performance impact, but even microbenchmarks show that a helper function could be used without a significant impact on real applications. (They show a 25-50% slowdown for an application which does nothing but casting, and as we all know casting may be common, but its not THAT common.)

@tannergooding Unfortunately, relying on the C++ compiler for casting behavior in undefined scenarios is a complete mess. For instance, the MSVC compiler has a variety of different conversion behaviors, none of which exactly match the our codegen in various edge cases even on X64 Windows.

@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Feb 10, 2021
@tannergooding
Copy link
Member

but even microbenchmarks show that a helper function could be used without a significant impact on real applications

Could you clarify what types of applications were tested? I'd expect a heavier hit in things like ML.NET (as well as the Pytorch and Tensorflow ports) since they tend to be more heavily dependent on doing the same floating-point operation over and over in a loop.
I'd imagine it likewise negatively impacts something like ImageSharp (although less as they move to using SIMD for more scenarios).

@tannergooding
Copy link
Member

The JIT has always had undefined behavior for things like this and for over/under shifting of a value (with C# handling the latter by explicitly emitting & mask on all shifts) and even when referring to a single architecture like x86/x64, you have cases where Intel and AMD differ in behavior (like with FLD, FSTP, BSF, and BSR). Its largely just with the introduction of ARM that these are becoming more visible due to explicit differences in behavior.

We also have much larger inconsistencies in behavior elsewhere that users already have to deal with. Like in certain IO/file handling between Unix and Windows or in the results returned by Math.Sin across x86, x64, ARM32, ARM64, Windows, Linux, macOS, and all the variations therein.

So while I think we should first and foremost aim to be IEEE compliant, I also feel where those differences are undefined behavior that its fine to have them be inconsistent between different machines.

@tannergooding
Copy link
Member

tannergooding commented Feb 10, 2021

For instance, the MSVC compiler has a variety of different conversion behaviors, none of which exactly match the our codegen in various edge cases even on X64 Windows.

For what its worth, at least one of these cases is a clear bug and causes us to not be IEEE compliant: #43895
Given .NET core doesn't have the same compat requirements, I'd expect us to be fixing this.

The only real edge cases that exist are values outside the range of the target type. For double->float, there is clear IEEE required behavior here.
For double/float->integral the behavior is undefined, but most implementations normalize to one of 0, MaxValue, or MinValue (where hardware support is not available).

@JulieLeeMSFT
Copy link
Member

Ping @davidwrighton, do you plan to work on this in .NET 6 or will you move this to .NET 7?

@davidwrighton davidwrighton modified the milestones: 6.0.0, 7.0.0 Aug 4, 2021
@davidwrighton davidwrighton modified the milestones: 7.0.0, 8.0.0 Aug 4, 2022
@JulieLeeMSFT JulieLeeMSFT modified the milestones: 8.0.0, 9.0.0 Jul 28, 2023
@davidwrighton
Copy link
Member

@tannergooding can you take this over? Or have we already done this?

@tannergooding
Copy link
Member

We now have a unified behavior for casting of floating-point to integer. I don't think we have done the work to enable the JIT to constant fold in those scenarios (including for R2R/Crossgen/etc)

@davidwrighton
Copy link
Member

Sounds like this is done then. Constant folding is an optimization detail, not a detail of how folding should work.

@github-actions github-actions bot locked and limited conversation to collaborators May 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Status: Done
Development

No branches or pull requests

5 participants