Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/6.0] JIT: Abandon loop search if we reach the end of the bbNext chain #69525

Merged
merged 1 commit into from
Jun 9, 2022

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented May 18, 2022

Backport of #69503 to release/6.0

/cc @AndyAyersMS

Analysis of a customer dump shows the JIT is causing an AV within System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart while being rejitted at Tier1 with PGO data. This is with 6.0.4 running on x64 ubuntu.

See #69323.

The fix is that during the "make compact" stage of loop recognition, when the JIT is scanning the list of blocks looking for the next in loop block, it will abandon the search if it reaches the end of the list. When the search is abandoned like this, the JIT will report that this sequence of blocks is not a proper loop. There is existing logic to handle this sort of reporting already, as loop recognition can fail for other reasons.

Normally the make compact algorithm expects to find such a block before reaching the end, but in this particular case the block list has been reordered. Without this fix the JIT will unexpectedly reach the end of the list and dereference nullptr, causing the crash.

Customer Impact

Customer updated to .NET 6 recently and enabled Full Dynamic PGO. Application was stable running on .NET 5.

After the update, a key application started to crash unexpectedly. Customer reports this crash has a significant impact on business as this disrupts data processing needed for the next day's work.

Customer has (so far) been able to work around the issue by not setting DOTNET_TC_QuickJitForLoops=1 which makes this particular method ineligible for PGO.

Testing

The cause of the AV in the JIT is self-evident and there is a simple mitigation to avoid crashing and still produce correct code. Details in the dump suggest a plausible but rare sequence of events that could lead to this crash.

I was able to reproduce this issue in local testing by applying randomly generated PGO data to the method (see note), and verified the fix.

Risk

Low, we avoid crashing the jit (and hence the app) and forgo recognizing a loop.

…t chain

In #69323 the 6.0.4 jit caused an AV because it walked off the end of the
bbNext chain during `optFindNaturalLoops`.

Analysis of a customer-provided dump showed that `MakeCompactAndFindExits`
might fail to find an expected loop block and so walk the entire bbNext chain
and then fall off the end. Details from the dump suggested that this happened
because a prior call to `MakeCompactAndFindExits` had moved most but not all of
a loop's blocks later in bbNext order, leaving that loop's bottom block earlier
in the bbNext chain then it's top. This ordering was unexpected.

I cannot repro this failure. The customer was using PGO and it's likely that
earlier PGO-driven block reordering contributed to this problem by interleaving
the blocks from two loops. We can recover the root method PGO schema from the
dump, but applying this is insufficient to cause the problem. This method does
quite a bit of inlining so it's likely that some inlinee PGO data must also be
a contributing factor.

At any rate, we can guard against this case easily enough, and simply abandon
recognition of any loop where we fail to find an expected loop block during
the bbNext chain walk.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 18, 2022
@ghost
Copy link

ghost commented May 18, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Backport of #69503 to release/6.0

/cc @AndyAyersMS

Customer Impact

Testing

Risk

IMPORTANT: If this change touches code that ships in a NuGet package, please make certain that you have added any necessary package authoring and gotten it explicitly reviewed.

Author: github-actions[bot]
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member

cc @dotnet/jit-contrib

@AndyAyersMS
Copy link
Member

@jeffschwMSFT fyi

@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.x milestone May 20, 2022
@JulieLeeMSFT
Copy link
Member

@AndyAyersMS, could you please put a short description of the fix right above the customer impact section?

Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. We will take for consideration in 6.0.x

@JulieLeeMSFT JulieLeeMSFT added the Servicing-consider Issue for next servicing release review label May 24, 2022
@rbhanda rbhanda added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels May 31, 2022
@rbhanda rbhanda modified the milestones: 6.0.x, 6.0.7 May 31, 2022
@JulieLeeMSFT
Copy link
Member

cc @carlossanlop for the servicing approved PR.

@carlossanlop
Copy link
Member

Servicing-approved label applied. CI passed. Area owners signed-off.
:shipit:

@carlossanlop carlossanlop merged commit 4f8654f into release/6.0 Jun 9, 2022
@carlossanlop carlossanlop deleted the backport/pr-69503-to-release/6.0 branch June 9, 2022 23:15
@ghost ghost locked as resolved and limited conversation to collaborators Jul 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Servicing-approved Approved for servicing release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants