Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[3.1] Protect against a rare invalid lock acquision attempt during etw processing during shutdown #27241

Merged
merged 3 commits into from
Oct 24, 2019

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Oct 16, 2019

Targeted and partial fix for https://github.com/dotnet/coreclr/issues/27129

  • This is not a generic fix for the issue above, it is only a very targeted fix for an issue seen (a new issue introduced in 3.x). For a generic fix and more details, see the fix in 5.0: Process/send etw rundown events only during graceful shutdown and not during process detach (a form of abrupt shutdown) #27238.
  • This change avoids taking a lock during process detach - a point in time when all other threads have already been abruptly shut down by the OS and locks may have been orphaned.
  • The issue leads to a hang during shutdown when ETW tracing is enabled and the .NET process being traced begins the shutdown sequence at an unfortunate time - this is a probably rare timing issue. It would take the shutdown sequence to begin at just the point when a thread holds a particular lock and is terminated by the OS while holding the lock, then the OS sends the process detach event to the CLR, work during which then tries to acquire the lock and cannot because it is orphaned.
  • The generic fix has broader consequences and is unlikely to be a reasonable change to make so late in the cycle, such a change needs some bake time and feedback. Hence this targeted fix for 3.x.

Issue

https://github.com/dotnet/coreclr/issues/27129

A recent regression from adding optimization tier info to method JIT events. A lock is taken to get the optimization tier. On shutdown, ETW rundown was always performed after all except one thread had already been abruptly terminated by the OS. At that point, the lock may be orphaned (probably a rare case), leading to a hang during shutdown.

Customer impact

We have not seen this issue occur frequently yet in customer scenarios, and has shown up relatively frequently when using the JIT diff tool while profiling, see issue above. It is a narrow-window timing issue.

Fix description

The fix is to set global state indicating process detach before doing ETW rundown, which indicates that all other threads have been abruptly terminated already, and the state of the system is unreliable, and to avoid taking the offending lock in that case. Instead, during ETW rundown method JIT events would indicate Unknown for the optimization tier. Indicating Unknown is not ideal but that is the best we can do for now.

The fix for 5.0 (#27238) is a more complete long-term fix and has wider implications (PR has more details), so a more targeted fix is being done for 3.0.x and 3.1.

Risk

There is currently no risk from sending Unknown optimization tiers for method events, as rundown information sent by the runtime, and optimization tiers in general, are not currently used by PerfView. Once a PR and issue are fixed and the information starts getting used by PerfView in a noticeable way, then methods that were already jitted prior to starting profiling would not show an optimization tier. In my opinion that is relatively low risk.

…essing during abrupt shutdown

Targeted and partial fix for https://github.com/dotnet/coreclr/issues/27129
- This is not a generic fix for the issue above, it is only a very targeted fix for an issue seen (a new issue introduced in 3.x). For a generic fix and more details, see the fix in 5.0: dotnet#27238.
- This change avoids taking a lock during process detach - a point in time when all other threads have already been abruptly shut down by the OS and locks may have been orphaned.
- The issue leads to a hang during shutdown when ETW tracing is enabled and the .NET process being traced begins the shutdown sequence at an unfortunate time - this is a probably rare timing issue. It would take the shutdown sequence to begin at just the point when a thread holds a particular lock and is terminated by the OS while holding the lock, then the OS sends the process detach event to the CLR, work during which then tries to acquire the lock and cannot because it is orphaned.
- The generic fix has broader consequences and is unlikely to be a reasonable change to make so late in the cycle, such a change needs some bake time and feedback. Hence this targeted fix for 3.x.
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kouvel
Copy link
Member Author

kouvel commented Oct 18, 2019

Applied feedback from #27242

@brianrob
Copy link
Member

FYI, I don't think you will need this PR. If you submit a fix to 3.0 it should be auto-ported to 3.1.

@kouvel
Copy link
Member Author

kouvel commented Oct 19, 2019

Ah ok, I'll check the port and will close it

Copy link
Member

@brianrob brianrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kouvel kouvel changed the title Protect against a rare invalid lock acquision attempt during etw processing during abrupt shutdown Protect against a rare invalid lock acquision attempt during etw processing during shutdown Oct 21, 2019
@kouvel kouvel changed the title Protect against a rare invalid lock acquision attempt during etw processing during shutdown [3.1] Protect against a rare invalid lock acquision attempt during etw processing during shutdown Oct 21, 2019
@MeiChin-Tsai
Copy link

approved for 3.1

@kouvel kouvel merged commit 12f48fe into dotnet:release/3.1 Oct 24, 2019
@kouvel kouvel deleted the EtwShutdownLockFix31 branch October 24, 2019 20:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants