Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Abandon loop search if we reach the end of the bbNext chain #69503

Merged
merged 1 commit into from
May 18, 2022

Conversation

AndyAyersMS
Copy link
Member

In #69323 the 6.0.4 jit caused an AV because it walked off the end of the
bbNext chain during optFindNaturalLoops.

Analysis of a customer-provided dump showed that MakeCompactAndFindExits
might fail to find an expected loop block and so walk the entire bbNext chain
and then fall off the end. Details from the dump suggested that this happened
because a prior call to MakeCompactAndFindExits had moved most but not all of
a loop's blocks later in bbNext order, leaving that loop's bottom block earlier
in the bbNext chain then it's top. This ordering was unexpected.

I cannot repro this failure. The customer was using PGO and it's likely that
earlier PGO-driven block reordering contributed to this problem by interleaving
the blocks from two loops. We can recover the root method PGO schema from the
dump, but applying this is insufficient to cause the problem. This method does
quite a bit of inlining so it's likely that some inlinee PGO data must also be
a contributing factor.

At any rate, we can guard against this case easily enough, and simply abandon
recognition of any loop where we fail to find an expected loop block during
the bbNext chain walk.

…t chain

In dotnet#69323 the 6.0.4 jit caused an AV because it walked off the end of the
bbNext chain during `optFindNaturalLoops`.

Analysis of a customer-provided dump showed that `MakeCompactAndFindExits`
might fail to find an expected loop block and so walk the entire bbNext chain
and then fall off the end. Details from the dump suggested that this happened
because a prior call to `MakeCompactAndFindExits` had moved most but not all of
a loop's blocks later in bbNext order, leaving that loop's bottom block earlier
in the bbNext chain then it's top. This ordering was unexpected.

I cannot repro this failure. The customer was using PGO and it's likely that
earlier PGO-driven block reordering contributed to this problem by interleaving
the blocks from two loops. We can recover the root method PGO schema from the
dump, but applying this is insufficient to cause the problem. This method does
quite a bit of inlining so it's likely that some inlinee PGO data must also be
a contributing factor.

At any rate, we can guard against this case easily enough, and simply abandon
recognition of any loop where we fail to find an expected loop block during
the bbNext chain walk.
@ghost ghost assigned AndyAyersMS May 18, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 18, 2022
@ghost
Copy link

ghost commented May 18, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

In #69323 the 6.0.4 jit caused an AV because it walked off the end of the
bbNext chain during optFindNaturalLoops.

Analysis of a customer-provided dump showed that MakeCompactAndFindExits
might fail to find an expected loop block and so walk the entire bbNext chain
and then fall off the end. Details from the dump suggested that this happened
because a prior call to MakeCompactAndFindExits had moved most but not all of
a loop's blocks later in bbNext order, leaving that loop's bottom block earlier
in the bbNext chain then it's top. This ordering was unexpected.

I cannot repro this failure. The customer was using PGO and it's likely that
earlier PGO-driven block reordering contributed to this problem by interleaving
the blocks from two loops. We can recover the root method PGO schema from the
dump, but applying this is insufficient to cause the problem. This method does
quite a bit of inlining so it's likely that some inlinee PGO data must also be
a contributing factor.

At any rate, we can guard against this case easily enough, and simply abandon
recognition of any loop where we fail to find an expected loop block during
the bbNext chain walk.

Author: AndyAyersMS
Assignees: AndyAyersMS
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

@jakobbotsch PTAL
cc @dotnet/jit-contrib

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented May 18, 2022

Some notes on repro attempts. Dump analysis indicated the jit AV happened while tier1 jitting of WorkerThreadStart from SPC.
I ported #58294 back to 6.0 and used it to apply random PGO counts to a PMI-driven jitting of this method (release SPC/runtime plus checked jit).

This requires a bit of ingenuity because random PGO only applies if there is underlying PGO data. So I first had to jit all methods via PMI at Tier0 with TieredPGO and write out the textual pgo data (all zeros for most methods), then read this text PGO data back in for the randomized tests.

I also tried editing the text PGO data to fill in the observed PGO counts from the dump, and confirmed those values were getting picked up, but this did not lead to the same behavior.

With this setup, randomized PGO runs turned up a few other asserts in other methods that I'll be opening issues for:

  • in some cases recognized loops are not properly nested
  • in some cases we see unexpected IR nodes cloning trees because of a flow optimization we do after inserting gc polls (I think this one is fixed in 7.0)

Because I was more interested in reproing the problem in WorkerThreadStart I added the ability to apply randomized PGO data just a specific method range (something I'll also PR shortly) to my local 6.0 JIT and ran 10,000 iterations without being able to repro that failure (the gc poll issue came up 6/10,000 times).

So it seems like there is some value in running randomized PGO with a much greater range of seed values. I am going to analyze how much actual diversity I see in those 10,000 instances but at a glance it appears we have indeed produced hundreds or thousands of different outputs. What I did above seems quite automatable.

We might also consider better SOS-level diagnostics, eg to extract the PGO data more readily or find all the PGO data read by the jit during a run (here the root method plus all 100 odd inlines). The dump we got was not a full dump and it seemed unlikely that I could locate all the relevant data. But abstractly it would be nice of we could create "text form" PGO from a dump for a method.

There was also a small mismatch in some of the IL offsets in the customer dump and my local repros; I need to check back with the customer to see if they're doing some form of IL modification. Their partial dump did not allow me to see the method IL but the overall method size was the same, so this mismatch is puzzling.

"script" for doing some of the above here: https://gist.github.com/AndyAyersMS/1a7187cd46f6023831d9d81927413aa2

@AndyAyersMS
Copy link
Member Author

Failure seems to be #69190

@AndyAyersMS AndyAyersMS merged commit 41491a3 into dotnet:main May 18, 2022
@AndyAyersMS
Copy link
Member Author

/backport to release/6.0

@github-actions
Copy link
Contributor

Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/2348597178

@AndyAyersMS
Copy link
Member Author

Finally got a repro using random pgo and was able to validate my guess as to how things go wrong and the fix.

Here there is a potential loop from 57...137. We try and compact and move the range 60...109. But then we fail to compact after that.

The next loop runs from 71...136, but we've moved the first part of the loop down past the seocnd, so that 136 now sits above 71. And when we try to compact that loop we walk off the end of the bbNext chain.

Relocated blocks [BB60..BB109] inserted after BB137
New Basic Block BB149 [0333] created.
Setting edge weights for BB59 -> BB149 to [0 .. 3.402823e+38]
Added an unconditional jump to BB60 after block BB59
Failed to compact potential loop from [BB57 to BB137]
Did not find expected loop block when walking from BB148 [possible loop was from top BB71 to bottom BB136]
Failed to compact potential loop from [BB71 to BB136]

@kunalspathak
Copy link
Member

Is this one of the examples of too much relying on bbNum?

SOS-level diagnostics

I agree. Either that or at least a documentation on how to debug and extract info as much as possible will be helpful.

@AndyAyersMS
Copy link
Member Author

Is this one of the examples of too much relying on bbNum?

Sort of ... we certainly need to be careful if we're making assumptions about bbNum telling us something about the bbNext chain ordering. There is a complicated-to-describe invariant in the logic for MakeCompactAndFindExits -- if we move things carefully and consistently I think it works out, but in cases like we see here where we only move some of the non-loop code out of a loop we can get in trouble.

@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone May 23, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants