Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: implied Boolean range assertions for local prop #109481

Merged
merged 5 commits into from
Nov 25, 2024

Conversation

AndyAyersMS
Copy link
Member

If morph sees an assignment of 0/1 to a local, have it generate a [0..1] range assertion in addition to the constant assertion.

This helps morph propagate more Boolean ranges at merges, which allows morph to elide more casts.

Also, bump up the local assertion table size since morph is now generating more assertions.

If morph sees an assignment of 0/1 to a local, have it generate a [0..1] range
assertion in addition to the constant assertion.

This helps morph propagate more Boolean ranges at merges, which allows morph to
elide more casts.

Also, bump up the local assertion table size since morph is now generating more
assertions.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 2, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member Author

Split off from #109469. Seems beneficial on its own. Let's see how costly it is..

@AndyAyersMS
Copy link
Member Author

Split off from #109469. Seems beneficial on its own. Let's see how costly it is..

Surprisingly costly... let's try smaller BVs.

image

@AndyAyersMS
Copy link
Member Author

Might need to do some actual data modelling to get a better idea on how to rightsize these bit vectors.

The old formula I created used lvaCount which seems a bit wrongheaded. We should only be generating assertions for tracked locals. However it is not clear if lvaTrackedCount is better or even accurate ( it may be an underestimate).

We have metrics tracking (roughly) how many assertions we drop because of table size limits (rough because once the table is full, repeated attempts to add say x === 0 will fail and each will count as a miss, though it would only create one table entry). So we could start with that. But perhaps we'd need to just run with super-large tables and collect the never miss table data and the available measurements, and then look at the distribution, and try and find a good compromise where we get a formula based on the measurements that gets us most of the assertions in most of the cases.

@AndyAyersMS
Copy link
Member Author

Here's some data from ASP.NET, with the assertion table size set to 4096. About 82000 methods. The current table sizing loses about 29000 assertions (with the change in this PR, which create more assertions than usual).

There is a decent linear trend of roughly 1.17 assertions per tracked local, but that is too conservative for many small methods.

image

The number of tracked locals is fairly predictably about 0.76 of the total number of locals so either measure can be used to build a sizing heuristic.

image

Adding sizing and a trend line to the original chart for the new heuristic (orange below), it does a good job of not missing too many assertions (about 3.5K get dropped). And (not shown) it is almost always picking bigger vectors that current main (bigger or same for 79.5K, smaller 2.5K methods) and perhaps not surprisingly about 32 bits longer on average.

image

Unfortunately pin is broken on my box so I can't dig in with inst count and see if it's the extra BV width or the extra assertion construction cost or the extra morph actions that is the TP culprit.

Zooming in to the smaller methods (less than 200 tracked locals)

image

On 64 bit hosts the cost of the BV gets rounded up to the nearest mutiple of 64, so those 96 bit vectors are actually 128 bits. So assuming we're looking at BV cost as the predominant TP impact, we can claw some back for the small cases (which are overwhelmingly the frequent cases) by limiting the BV size to 64. If we do this for cases with less than 24 live locals (see extreme detail below) we only lose assertions in a handful of methods:

image

@AndyAyersMS
Copy link
Member Author

Proposed revised heuristic:

image

@jakobbotsch
Copy link
Member

I wonder how it would affect things if we used a dynamically expansible (up to some maximum) bit vector implementation instead... Say use the lowest bit to indicate whether this is a pointer to a (length, words) tuple or just inline 63 (or 31) bits.

@AndyAyersMS
Copy link
Member Author

I wonder how it would affect things if we used a dynamically expansible (up to some maximum) bit vector implementation instead... Say use the lowest bit to indicate whether this is a pointer to a (length, words) tuple or just inline 63 (or 31) bits.

Phoenix had a fairly sophisticated sparse BV setup, maybe we can just steal it.

I need to profile this to see if the BV size is the culprit -- will try one of my other boxes.

@AndyAyersMS
Copy link
Member Author

Looks like BV cost is not the dominant factor

range\local-boolean-range.csproj]
  8796189684   9164784738  0.06%  |  public: struct GenTree * __cdecl Compiler::optAssertionProp_LclVar(unsigned __int64 *const &, struct GenTreeLclVarCommon *, struct Statement *)
           0    284474182  0.05%  |  `Compiler::fgAssertionGen'::`2'::<lambda_2>::operator()
   977146327   1196551328  0.04%  |  private: void __cdecl Compiler::fgAssertionGen(struct GenTree *)
  3532984967   3751969201  0.04%  |  public: unsigned short __cdecl Compiler::optAddAssertion(struct Compiler::AssertionDsc *)
   721753622    857188262  0.02%  |  public: static unsigned __int64 * __cdecl BitSetOps<unsigned __int64 *, 1, struct BitVecTraits *, struct BitVecTraits>::MakeCopy(struct BitVecTraits *, unsigned __int64 *)
 11823108036  11925253145  0.02%  |  public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)
  1017382342   1103530350  0.01%  |  public: static void __cdecl BitSetOps<unsigned __int64 *, 1, struct BitVecTraits *, struct BitVecTraits>::IntersectionD(struct BitVecTraits *, unsigned __int64 *&, unsigned __int64 *)
  2747803608   2806466352  0.01%  |  __security_check_cookie
   186332106    237772711  0.01%  |  public: static void __cdecl BitSetOps<unsigned __int64 *, 1, struct BitVecTraits *, struct BitVecTraits>::DiffD(struct BitVecTraits *, unsigned __int64 *&, unsigned __int64 *)
   952832850    984661705  0.01%  |  public: unsigned __int64 *& __cdecl Compiler::GetAssertionDep(unsigned int)

Instead it's the cost of making the new assertions and then later on filtering through them. I think we can improve the filtering part a little. Will do that separately as it should be no diff without the other changes here.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Nov 23, 2024

Local TP of this PR plus #110091 vs baseline (win x64 libraries tests no TC) shows we come out ahead:

  8796189684   3168571960  -0.97%  |  public: struct GenTree * __cdecl Compiler::optAssertionProp_LclVar(unsigned __int64 *const &, struct GenTreeLclVarCommon *, struct Statement *)
  3540234974    305186960  -0.56%  |  public: struct GenTree * __cdecl Compiler::optCopyAssertionProp(struct Compiler::AssertionDsc *, struct GenTreeLclVarCommon *, struct Statement *)
   721753622   1331032982   0.10%  |  public: static unsigned __int64 * __cdecl BitSetOps<unsigned __int64 *, 1, struct BitVecTraits *, struct BitVecTraits>::MakeCopy(struct BitVecTraits *, unsigned __int64 *)
  1017382342   1505682806   0.08%  |  public: static void __cdecl BitSetOps<unsigned __int64 *, 1, struct BitVecTraits *, struct BitVecTraits>::IntersectionD(struct BitVecTraits *, unsigned __int64 *&, unsigned __int64 *)
   952832850   1311702304   0.06%  |  public: unsigned __int64 *& __cdecl Compiler::GetAssertionDep(unsigned int)
  1660842818   1967928213   0.05%  |  protected: void __cdecl JitExpandArray<struct ValueNumStore::Chunk *>::EnsureCoversInd(unsigned int)
           0    284474182   0.05%  |  `Compiler::fgAssertionGen'::`2'::<lambda_2>::operator()
 11823108036  12090888934   0.05%  |  public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)
   977146327   1196551328   0.04%  |  private: void __cdecl Compiler::fgAssertionGen(struct GenTree *)
  3532984967   3751969201   0.04%  |  public: unsigned short __cdecl Compiler::optAddAssertion(struct Compiler::AssertionDsc *)

though that perhaps had outsized benefits from #110091

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Nov 23, 2024

New diffs

Worst-case TP is about 0.3 which was paid for by #110091

SPMI failures are unrelated, likely something to do with the recent ISA additions.... opened #110106

[02:08:27] ISSUE: <ASSERT> #140 D:\a\_work\1\s\src\coreclr\jit\compiler.cpp (2281) - Assertion failed 'instructionSetFlags.Equals(EnsureInstructionSetFlagsAreValid(instructionSetFlags))' in 'PlatformBenchmarks.Startup:Configure(Microsoft.AspNetCore.Builder.IApplicationBuilder):this' during 'Pre-import' (IL size 1; hash 0x79211bb5; Optimization-Level-Not-Yet-Set)

@EgorBo PTAL
cc @dotnet/jit-contrib

@EgorBo EgorBo self-requested a review November 24, 2024 07:40
}
}
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndyAyersMS I am a bit confused - if a local already have lcl = 1 (O1K_LCLVAR OAK_EQUAL O2K_CONST_INT) assertion, why does it need a separate OAK_SUBRANGE one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AP merge strategy is very simple, it just intersects bit vectors. So the merge of X=1 from one path and X=[0,1] from another is the empty set, since those are different assertions, instead of the more reasonable X=[0,1].

A tricker case might be X=0 on one path and X=1 on another.

Adding merge smarts to AP likely would slow things down quite a bit (though we could try it as an alternative). But by adding a partially redundant X=[0,1] whenever we have X=0 or X=1 things just work out naturally, and the number of extra assertions isn't too bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, makes sense then

@AndyAyersMS
Copy link
Member Author

Thought this might fix #110019, but the problem there is that we lose the fact that the call to SequenceEqual returns a bool, so we need to re-test the result.

Seems like we could perhaps annotate such (managed) calls as really returning bools, and have AP generate a suitable subtype assertion...

@EgorBo
Copy link
Member

EgorBo commented Nov 24, 2024

Seems like we could perhaps annotate such (managed) calls as really returning bools, and have AP generate a suitable subtype assertion...

Yeah I've tried that once: I added a lvBoolean to LclDesc and set that flag in importervectorization.cpp that produces a boolean QMARK - it had some diffs

@MichalPetryka
Copy link
Contributor

annotate such (managed) calls as really returning bool

Isn't that unsafe for arbitrary calls since non 0/1 bools are legal in managed code?

@AndyAyersMS
Copy link
Member Author

annotate such (managed) calls as really returning bool

Isn't that unsafe for arbitrary calls since non 0/1 bools are legal in managed code?

If the jit is generating the code it will guarantee the result is 0/1. For interop it depends on whether it's going via an IL stub or not.

@AndyAyersMS
Copy link
Member Author

/ba-g SPMI failures

@AndyAyersMS AndyAyersMS merged commit 0349889 into dotnet:main Nov 25, 2024
103 of 108 checks passed
@jakobbotsch
Copy link
Member

If the jit is generating the code it will guarantee the result is 0/1. For interop it depends on whether it's going via an IL stub or not.

I don't think we do any sort of normalization like this. E.g. something like bool Foo(byte x) => ldarg 0; ret can produce unnormalized bools.

@EgorBo
Copy link
Member

EgorBo commented Nov 25, 2024

If the jit is generating the code it will guarantee the result is 0/1. For interop it depends on whether it's going via an IL stub or not.

I don't think we do any sort of normalization like this. E.g. something like bool Foo(byte x) => ldarg 0; ret can produce unnormalized bools.

We can at least do that for some [Intrinsic] including the one in question - SequenceEqual

@AndyAyersMS
Copy link
Member Author

If the jit is generating the code it will guarantee the result is 0/1. For interop it depends on whether it's going via an IL stub or not.

I don't think we do any sort of normalization like this. E.g. something like bool Foo(byte x) => ldarg 0; ret can produce unnormalized bools.

We can at least do that for some [Intrinsic] including the one in question - SequenceEqual

It also might be enough for us to assert the result is [0..255].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants