[3.1] Protect against a rare invalid lock acquision attempt during etw processing during shutdown #27241

kouvel · 2019-10-16T22:22:11Z

Targeted and partial fix for https://github.com/dotnet/coreclr/issues/27129

This is not a generic fix for the issue above, it is only a very targeted fix for an issue seen (a new issue introduced in 3.x). For a generic fix and more details, see the fix in 5.0: Process/send etw rundown events only during graceful shutdown and not during process detach (a form of abrupt shutdown) #27238.
This change avoids taking a lock during process detach - a point in time when all other threads have already been abruptly shut down by the OS and locks may have been orphaned.
The issue leads to a hang during shutdown when ETW tracing is enabled and the .NET process being traced begins the shutdown sequence at an unfortunate time - this is a probably rare timing issue. It would take the shutdown sequence to begin at just the point when a thread holds a particular lock and is terminated by the OS while holding the lock, then the OS sends the process detach event to the CLR, work during which then tries to acquire the lock and cannot because it is orphaned.
The generic fix has broader consequences and is unlikely to be a reasonable change to make so late in the cycle, such a change needs some bake time and feedback. Hence this targeted fix for 3.x.

Issue

https://github.com/dotnet/coreclr/issues/27129

A recent regression from adding optimization tier info to method JIT events. A lock is taken to get the optimization tier. On shutdown, ETW rundown was always performed after all except one thread had already been abruptly terminated by the OS. At that point, the lock may be orphaned (probably a rare case), leading to a hang during shutdown.

Customer impact

We have not seen this issue occur frequently yet in customer scenarios, and has shown up relatively frequently when using the JIT diff tool while profiling, see issue above. It is a narrow-window timing issue.

Fix description

The fix is to set global state indicating process detach before doing ETW rundown, which indicates that all other threads have been abruptly terminated already, and the state of the system is unreliable, and to avoid taking the offending lock in that case. Instead, during ETW rundown method JIT events would indicate Unknown for the optimization tier. Indicating Unknown is not ideal but that is the best we can do for now.

The fix for 5.0 (#27238) is a more complete long-term fix and has wider implications (PR has more details), so a more targeted fix is being done for 3.0.x and 3.1.

Risk

There is currently no risk from sending Unknown optimization tiers for method events, as rundown information sent by the runtime, and optimization tiers in general, are not currently used by PerfView. Once a PR and issue are fixed and the information starts getting used by PerfView in a noticeable way, then methods that were already jitted prior to starting profiling would not show an optimization tier. In my opinion that is relatively low risk.

…essing during abrupt shutdown Targeted and partial fix for https://github.com/dotnet/coreclr/issues/27129 - This is not a generic fix for the issue above, it is only a very targeted fix for an issue seen (a new issue introduced in 3.x). For a generic fix and more details, see the fix in 5.0: dotnet#27238. - This change avoids taking a lock during process detach - a point in time when all other threads have already been abruptly shut down by the OS and locks may have been orphaned. - The issue leads to a hang during shutdown when ETW tracing is enabled and the .NET process being traced begins the shutdown sequence at an unfortunate time - this is a probably rare timing issue. It would take the shutdown sequence to begin at just the point when a thread holds a particular lock and is terminated by the OS while holding the lock, then the OS sends the process detach event to the CLR, work during which then tries to acquire the lock and cannot because it is orphaned. - The generic fix has broader consequences and is unlikely to be a reasonable change to make so late in the cycle, such a change needs some bake time and feedback. Hence this targeted fix for 3.x.

noahfalk

LGTM

kouvel · 2019-10-18T14:32:54Z

Applied feedback from #27242

brianrob · 2019-10-18T17:02:52Z

FYI, I don't think you will need this PR. If you submit a fix to 3.0 it should be auto-ported to 3.1.

kouvel · 2019-10-19T04:26:35Z

Ah ok, I'll check the port and will close it

brianrob

LGTM

MeiChin-Tsai · 2019-10-24T19:58:25Z

approved for 3.1

kouvel added area-Tracing area-System.Diagnostics.Tracing labels Oct 16, 2019

kouvel added this to the 3.1 milestone Oct 16, 2019

kouvel requested review from brianrob, noahfalk and jkotas October 16, 2019 22:22

kouvel self-assigned this Oct 16, 2019

noahfalk approved these changes Oct 18, 2019

View reviewed changes

Report tier as unknown when it cannot be determined

ef16c6d

Return unknown only on process detach

24730c5

brianrob approved these changes Oct 20, 2019

View reviewed changes

kouvel changed the title ~~Protect against a rare invalid lock acquision attempt during etw processing during abrupt shutdown~~ Protect against a rare invalid lock acquision attempt during etw processing during shutdown Oct 21, 2019

kouvel changed the title ~~Protect against a rare invalid lock acquision attempt during etw processing during shutdown~~ [3.1] Protect against a rare invalid lock acquision attempt during etw processing during shutdown Oct 21, 2019

kouvel merged commit 12f48fe into dotnet:release/3.1 Oct 24, 2019

kouvel deleted the EtwShutdownLockFix31 branch October 24, 2019 20:07

kouvel mentioned this pull request Jan 8, 2020

ETW + GetJitOptimizationTier causes application to hang on exit dotnet/runtime#540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3.1] Protect against a rare invalid lock acquision attempt during etw processing during shutdown #27241

[3.1] Protect against a rare invalid lock acquision attempt during etw processing during shutdown #27241

kouvel commented Oct 16, 2019 •

edited

Loading

noahfalk left a comment

kouvel commented Oct 18, 2019

brianrob commented Oct 18, 2019

kouvel commented Oct 19, 2019

brianrob left a comment

MeiChin-Tsai commented Oct 24, 2019

[3.1] Protect against a rare invalid lock acquision attempt during etw processing during shutdown #27241

[3.1] Protect against a rare invalid lock acquision attempt during etw processing during shutdown #27241

Conversation

kouvel commented Oct 16, 2019 • edited Loading

Issue

Customer impact

Fix description

Risk

noahfalk left a comment

Choose a reason for hiding this comment

kouvel commented Oct 18, 2019

brianrob commented Oct 18, 2019

kouvel commented Oct 19, 2019

brianrob left a comment

Choose a reason for hiding this comment

MeiChin-Tsai commented Oct 24, 2019

kouvel commented Oct 16, 2019 •

edited

Loading