-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Abandon loop search if we reach the end of the bbNext chain #69503
Conversation
…t chain In dotnet#69323 the 6.0.4 jit caused an AV because it walked off the end of the bbNext chain during `optFindNaturalLoops`. Analysis of a customer-provided dump showed that `MakeCompactAndFindExits` might fail to find an expected loop block and so walk the entire bbNext chain and then fall off the end. Details from the dump suggested that this happened because a prior call to `MakeCompactAndFindExits` had moved most but not all of a loop's blocks later in bbNext order, leaving that loop's bottom block earlier in the bbNext chain then it's top. This ordering was unexpected. I cannot repro this failure. The customer was using PGO and it's likely that earlier PGO-driven block reordering contributed to this problem by interleaving the blocks from two loops. We can recover the root method PGO schema from the dump, but applying this is insufficient to cause the problem. This method does quite a bit of inlining so it's likely that some inlinee PGO data must also be a contributing factor. At any rate, we can guard against this case easily enough, and simply abandon recognition of any loop where we fail to find an expected loop block during the bbNext chain walk.
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsIn #69323 the 6.0.4 jit caused an AV because it walked off the end of the Analysis of a customer-provided dump showed that I cannot repro this failure. The customer was using PGO and it's likely that At any rate, we can guard against this case easily enough, and simply abandon
|
@jakobbotsch PTAL |
Some notes on repro attempts. Dump analysis indicated the jit AV happened while tier1 jitting of This requires a bit of ingenuity because random PGO only applies if there is underlying PGO data. So I first had to jit all methods via PMI at Tier0 with TieredPGO and write out the textual pgo data (all zeros for most methods), then read this text PGO data back in for the randomized tests. I also tried editing the text PGO data to fill in the observed PGO counts from the dump, and confirmed those values were getting picked up, but this did not lead to the same behavior. With this setup, randomized PGO runs turned up a few other asserts in other methods that I'll be opening issues for:
Because I was more interested in reproing the problem in So it seems like there is some value in running randomized PGO with a much greater range of seed values. I am going to analyze how much actual diversity I see in those 10,000 instances but at a glance it appears we have indeed produced hundreds or thousands of different outputs. What I did above seems quite automatable. We might also consider better SOS-level diagnostics, eg to extract the PGO data more readily or find all the PGO data read by the jit during a run (here the root method plus all 100 odd inlines). The dump we got was not a full dump and it seemed unlikely that I could locate all the relevant data. But abstractly it would be nice of we could create "text form" PGO from a dump for a method. There was also a small mismatch in some of the IL offsets in the customer dump and my local repros; I need to check back with the customer to see if they're doing some form of IL modification. Their partial dump did not allow me to see the method IL but the overall method size was the same, so this mismatch is puzzling. "script" for doing some of the above here: https://gist.github.com/AndyAyersMS/1a7187cd46f6023831d9d81927413aa2 |
Failure seems to be #69190 |
/backport to release/6.0 |
Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/2348597178 |
Finally got a repro using random pgo and was able to validate my guess as to how things go wrong and the fix. Here there is a potential loop from 57...137. We try and compact and move the range 60...109. But then we fail to compact after that. The next loop runs from 71...136, but we've moved the first part of the loop down past the seocnd, so that 136 now sits above 71. And when we try to compact that loop we walk off the end of the bbNext chain.
|
Is this one of the examples of too much relying on
I agree. Either that or at least a documentation on how to debug and extract info as much as possible will be helpful. |
Sort of ... we certainly need to be careful if we're making assumptions about bbNum telling us something about the bbNext chain ordering. There is a complicated-to-describe invariant in the logic for |
In #69323 the 6.0.4 jit caused an AV because it walked off the end of the
bbNext chain during
optFindNaturalLoops
.Analysis of a customer-provided dump showed that
MakeCompactAndFindExits
might fail to find an expected loop block and so walk the entire bbNext chain
and then fall off the end. Details from the dump suggested that this happened
because a prior call to
MakeCompactAndFindExits
had moved most but not all ofa loop's blocks later in bbNext order, leaving that loop's bottom block earlier
in the bbNext chain then it's top. This ordering was unexpected.
I cannot repro this failure. The customer was using PGO and it's likely that
earlier PGO-driven block reordering contributed to this problem by interleaving
the blocks from two loops. We can recover the root method PGO schema from the
dump, but applying this is insufficient to cause the problem. This method does
quite a bit of inlining so it's likely that some inlinee PGO data must also be
a contributing factor.
At any rate, we can guard against this case easily enough, and simply abandon
recognition of any loop where we fail to find an expected loop block during
the bbNext chain walk.