-
Notifications
You must be signed in to change notification settings - Fork 2.7k
JIT: remove incremental ref count updates #19345
JIT: remove incremental ref count updates #19345
Conversation
@dotnet/jit-contrib PTAL. There didn't seem to be any way to do this without such a large change. It is mostly deletions. Diffs are surprisingly minimal -- no diffs in minopts. When optimizing:
The second ref count recompute is needed to avoid code bloat from dead stores. Release TP seems a wash to perhaps slightly slower. We could avoid the second recompute and perhaps save a bit of time if we kept track of liveness' IR alterations. |
@dotnet-bot test Windows_NT x64 Build and Test |
Hmm, I guess that makes improving sorting potentially more useful.
That post-lower liveness is an somewhat unfortunate case of ordering. We enter with dead stores in lowering so we'll end up lowering stuff that's not needed. Lower itself is not likely to introduce many new dead stores so it would be useful to run liveness before lowering. But then lowering can break liveness and then liveness is need in LSRA. Oh well... I wonder if precise ref counting wouldn't enable us to get rid of at least some dead stores without liveness. It's possible that some dead stores are simply single defs without uses, not killed defs. And single defs without uses could be detected while ref counting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one minor suggestion
src/jit/regalloc.cpp
Outdated
*/ | ||
|
||
if (opts.compDbgCode && !stkFixedArgInVarArgs && lclNum < info.compLocalsCount) | ||
{ | ||
needSlot |= true; | ||
assert(varDsc->lvRefCnt() > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense here to set it to one if it is, in fact zero, to avoid having to make this a noway_assert
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it can't hurt -- it should be quite difficult now to create unreferenced locals in debug or minopts but this seems like cheap insurance.
The x86 failure with 97d12d9 is a bit odd. Test case is jit\Regression\CLR-x86-JIT\V1.1-M1-Beta1\b143840. I need to root cause it further, but the precipitating change is that in ; Assembly listing for method ExternalClass:ThrowException():this
; Emitting BLENDED_CODE for generic X86 CPU
; optimized code
; esp based frame
; partially interruptible
; Final local variable assignments
;
-; V00 this [V00,T00] ( 6, 3 ) ref -> esi this class-hnd
+; V00 this [V00,T00] ( 4, 1 ) ref -> [esp+0x00] this class-hnd
;
-; Lcl frame size = 0
+; Lcl frame size = 4
G_M56674_IG01:
56 push esi
- 90909090 nop
- 8BF1 mov esi, ecx
+ 50 push eax
+ 909090 nop
+ 890C24 mov gword ptr [esp], ecx
G_M56674_IG02:
- 8BCE mov ecx, esi
- E8C2D2F60B call CORINFO_HELP_MON_ENTER
+ 8B0C24 mov ecx, gword ptr [esp]
+ E8C0D25D0E call CORINFO_HELP_MON_ENTER
G_M56674_IG03:
+ 8B3424 mov esi, gword ptr [esp]
8B4E04 mov ecx, gword ptr [esi+4]
- E84A45F70B call CORINFO_HELP_THROW
+ E845455E0E call CORINFO_HELP_THROW
8BCE mov ecx, esi
- E843D6F60B call CORINFO_HELP_MON_EXIT
+ E83ED65D0E call CORINFO_HELP_MON_EXIT
G_M56674_IG04:
+ 59 pop ecx
5E pop esi
C3 ret
-; Total bytes of code 31, prolog size 5 for method ExternalClass:ThrowException():this
+; Total bytes of code 37, prolog size 5 for method ExternalClass:ThrowException():this The extra prolog bumps for reg parameters were not kicking in for implicitly referenced locals, and The upshot is that I'll make the prolog bumps more general but this need for prolog bumps is likely hiding some other bug. |
@mikedn agree there is more here that can be re-examined. Not sure sorting is really all that valuable. So perhaps first we should look at just getting rid of it. Also not sure why we can't run DCE / dead stores much earlier, gain some TP win from not carrying around extra IR, and avoid iterative removal -- if we have pruned SSA surely it is not that hard to find the closure set of defs with no real use. And as you say most of the subsequent changes should rarely introduce new dead stores. |
Yep, typical methods have fewer than 512 lclvars so we can simply track all vars that aren't disqualified for other reasons (e.g. address taken). So far I only found this (https://github.com/dotnet/coreclr/blob/master/src/jit/optcse.cpp#L1413-L1456 CSE code where the order seems to matter. It traverses the tracked variables in sorted order and stores the first ref count it finds under certain conditions, sort of a "get the max ref count" it seems. |
This is going to conflict with #19351 so will need a rebase at some point. |
GC info for the "bad" x86 codegen above:
From the failure it looks like the runtime is trying to sanity-check the sync state inside a sync region and can't find the this pointer from the frame. In the good case we have:
|
I think that might be good - especially when there are fewer than 512.
The register allocator will allocate the incoming parameters in order, which naturally gives preference to those that are more heavily accessed. Sorting just those might make sense if nothing else. |
Ah, issue is that |
There is logic in liveness to keep the this alive and in emit to watch for the kept-alive this even when on the stack. Not sure why it's not kicking in in the bad case.
|
Not 100% sure, but I think the issue is in So
causes (Actually |
Arm64 machine had some kind of hiccup. Need to rebase and force push anyways, so won't rerun.
|
5711b2e
to
8905995
Compare
@dotnet-bot test Windows_NT x64 Build and Test |
Hmm, the winx64 failure is actually a timeout, and other successful runs look like the come close to the 10 minute limit (see history). Don't see any codegen diffs. So will retry... @dotnet-bot test Windows_NT x64 Checked Build and Test |
Seems unfortunate that one must type @dotnet-bot Windows_NT x64 Build and Test to trigger "Windows_NT x64 Checked Build and Test" |
Or rather @dotnet-bot test Windows_NT x64 Build and Test |
baseservices\threading\generics\threadstart\GThread23 has timed out on me 3 times. But no jit diffs in the test code, and it runs quickly, when run by itself. |
Going to retry the other two failing legs while I puzzle out GThread23. @dotnet-bot test Windows_NT x64 Checked CoreFX Tests |
@dotnet-bot test Windows_NT x64 Checked corefx_baseline |
This is unfortunately historic. I've aggressively made the non-x64 cases more regular. E.g., my current PR makes things a little more regular: #19350 The GThread23 failures have been happening regularly: https://github.com/dotnet/coreclr/issues/19339 |
Thanks Bruce, I'll assume GThread23 is an unrelated issue. That leaves the CoreFx failure in |
That's also known: https://github.com/dotnet/coreclr/issues/19295 |
Looks like it should be fixed now, according to the issue. |
Thanks again Bruce. CoreFx update probably hasn't made it over here yet. Any other test suites you think I should run? |
Actually, our corefx tests (except "innerloop", aka "CoreFX Tests") check out the live corefx tree, so I'm skeptical the issue is fixed. I just kicked off all x86/x64
Make sure you have a mix of pri1 including, IMO, all architectures. Maybe add R2R, and some JIT stress. |
x86 jitstress2 failures seem to be instances of #18986. |
Crst failures related: https://github.com/dotnet/coreclr/issues/19008 |
x64 jitstress2 Now returns 108 instead of 100. Works ok w/o jitstress=2. Was also ok before I pulled in changes in df8e05d..a7dbd1f. So it was something recent. |
Adding @kouvel . Its possible the GCCoverage solution is just to change inc/CrstTypes.def so that GCCover is marked as being acquired before CrstReJITSharedDomainTable. Is this issue new? I see the failure is occurring when tiered compilation is off, so this code path should have been running as-is for quite some time. I'm not sure what would have caused a recent regression, but it would also seem surprising for a failure this large to go unnoticed for a long span of time. |
Looks like it was broken sometime on or after 7/31, so yes, fairly new. We only run gc stress once a week (I think ?) via automation so it takes a while sometimes to see these fail. |
@davmason @AndyAyersMS @noahfalk If I revert #19054, GCStress=c works again. |
@BruceForstall @AndyAyersMS I'm taking a look now. Is there an easy way to repro locally? Is it just running the normal tests with complus_gcstress set? |
Yes, just pick (probably) any test, run it with
|
Build the test overlay like normal, then use Then |
Thanks guys. I was able to get a repro and debug, the issue was that GCStress 8 or higher didn't work with rejit enabled. Which didn't used to be an issue but that change was to enable rejit by default. It looks like the locks don't have any circular dependencies with each other, so I should be able to just update CrstTypes.def. I'm testing the fix now, I should have a PR open in the next little bit, assuming the tests start passing after. |
Tests are passing on my machine so I opened #19401 |
Looks like |
Not easy to assess the state of this change given the various pre-existing failures in PR/rolling legs we don't run that often. But I think most of the failing cases are known issues. Will wait for the GC stress fix to land and rerun, though in testing it is turning up some failures too. One of the gripes I have with our rolling/pr test split is that it is not easy to tell if a pr test failure is "known" as rolling failures get logged in different bins. So if xxx rolling hits a failure, and we rarely run xxx pr, when we do run xxx pr and it fails, it one can waste a fair amount of time figuring out what is going on. |
@dotnet-bot test Windows_NT x64 Checked gcstress0xc |
GcStress failures seem like known issues too. @dotnet/jit-contrib any thoughts on whether this PR Is ready to merge or I should do more testing? |
Will likely conflict with #19489. |
Remove almost all of the code in the jit that tries to maintain local ref counts incrementally. Also remove `lvaSortAgain` and related machinery. Explicitly sort locals before post-lower-liveness when optimizing to get the best set of tracked locals. Explicitly recount after post-lower liveness to get accurate counts after dead stores. This can lead to tracked unreferenced arguments; tolerate this during codegen.
8905995
to
c56d1d7
Compare
Rebased and force pushed to handle merge conflict. |
Rerunning some of the non-default legs: @dotnet-bot test Windows_NT x64 Build and Test |
|
The ref count update traversal in the optimizer is not doing anything, so remove it. This was overlooked when we changed away from incremental updates in dotnet#19345. Also: fix up comments and documentation to reflect the current approach to local var ref counts.
) The ref count update traversal in the optimizer is not doing anything, so remove it. This was overlooked when we changed away from incremental updates in #19345. Also: fix up comments and documentation to reflect the current approach to local var ref counts.
…net/coreclr#22954) The ref count update traversal in the optimizer is not doing anything, so remove it. This was overlooked when we changed away from incremental updates in dotnet/coreclr#19345. Also: fix up comments and documentation to reflect the current approach to local var ref counts. Commit migrated from dotnet/coreclr@04fed62
Remove almost all of the code in the jit that tries to maintain local ref
counts incrementally. Also remove
lvaSortAgain
and related machinery.Explicitly sort locals before post-lower-liveness when optimizing to get the
best set of tracked locals.
Explicitly recount after post-lower liveness to get accurate counts after
dead stores. This can lead to tracked unreferenced arguments; tolerate this
during codegen.