-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/6.0] JIT: Abandon loop search if we reach the end of the bbNext chain #69525
Conversation
…t chain In #69323 the 6.0.4 jit caused an AV because it walked off the end of the bbNext chain during `optFindNaturalLoops`. Analysis of a customer-provided dump showed that `MakeCompactAndFindExits` might fail to find an expected loop block and so walk the entire bbNext chain and then fall off the end. Details from the dump suggested that this happened because a prior call to `MakeCompactAndFindExits` had moved most but not all of a loop's blocks later in bbNext order, leaving that loop's bottom block earlier in the bbNext chain then it's top. This ordering was unexpected. I cannot repro this failure. The customer was using PGO and it's likely that earlier PGO-driven block reordering contributed to this problem by interleaving the blocks from two loops. We can recover the root method PGO schema from the dump, but applying this is insufficient to cause the problem. This method does quite a bit of inlining so it's likely that some inlinee PGO data must also be a contributing factor. At any rate, we can guard against this case easily enough, and simply abandon recognition of any loop where we fail to find an expected loop block during the bbNext chain walk.
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsBackport of #69503 to release/6.0 /cc @AndyAyersMS Customer ImpactTestingRiskIMPORTANT: If this change touches code that ships in a NuGet package, please make certain that you have added any necessary package authoring and gotten it explicitly reviewed.
|
cc @dotnet/jit-contrib |
@jeffschwMSFT fyi |
@AndyAyersMS, could you please put a short description of the fix right above the customer impact section? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. We will take for consideration in 6.0.x
cc @carlossanlop for the servicing approved PR. |
Servicing-approved label applied. CI passed. Area owners signed-off. |
Backport of #69503 to release/6.0
/cc @AndyAyersMS
Analysis of a customer dump shows the JIT is causing an AV within
System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart
while being rejitted at Tier1 with PGO data. This is with 6.0.4 running on x64 ubuntu.See #69323.
The fix is that during the "make compact" stage of loop recognition, when the JIT is scanning the list of blocks looking for the next in loop block, it will abandon the search if it reaches the end of the list. When the search is abandoned like this, the JIT will report that this sequence of blocks is not a proper loop. There is existing logic to handle this sort of reporting already, as loop recognition can fail for other reasons.
Normally the make compact algorithm expects to find such a block before reaching the end, but in this particular case the block list has been reordered. Without this fix the JIT will unexpectedly reach the end of the list and dereference
nullptr
, causing the crash.Customer Impact
Customer updated to .NET 6 recently and enabled Full Dynamic PGO. Application was stable running on .NET 5.
After the update, a key application started to crash unexpectedly. Customer reports this crash has a significant impact on business as this disrupts data processing needed for the next day's work.
Customer has (so far) been able to work around the issue by not setting
DOTNET_TC_QuickJitForLoops=1
which makes this particular method ineligible for PGO.Testing
The cause of the AV in the JIT is self-evident and there is a simple mitigation to avoid crashing and still produce correct code. Details in the dump suggest a plausible but rare sequence of events that could lead to this crash.
I was able to reproduce this issue in local testing by applying randomly generated PGO data to the method (see note), and verified the fix.
Risk
Low, we avoid crashing the jit (and hence the app) and forgo recognizing a loop.