Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Process/send etw rundown events only during graceful shutdown and not during process detach (a form of abrupt shutdown) #27238

Merged
merged 1 commit into from
Oct 18, 2019

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Oct 16, 2019

Longer-term fix for https://github.com/dotnet/coreclr/issues/27129:

  • Etw rundown events sent during process shutdown are currently (and have for a long time) been sent during process detach. By that time, all other threads have been abruptly terminated by the OS, and as a result the state of the system is fundamentally unpredictable.
  • In this particular case, locks have been orphaned by threads that have been abruptly terminated, so taking locks is not feasible during processing of rundown events, and if acquiring locks were to be avoided based on such knowledge (not recommended, this would get messy), we'd have to resort to providing information that would not accurately reflect the state, in the events
  • I consider any situation where process detach occurs before an opportunity to handle graceful shutdown (that is, the runtime is unaware that a shutdown is about to happen and does not have an opportunity to handle shutdown prior to process detach (before the OS already shuts some things down)), then that is abrupt shutdown and in that scenario all bets are off - in the case of this change, etw rundown events would not be sent
  • This change has the following effects:
    • Graceful shutdown such as returning from Main or Environment.Exit() will send rundown events very slighly earlier than before. Background threads will still be running and there may be other etw events interspersed among rundown events and sent after rundown events.
    • On Windows, Ctrl+C and Ctrl+Break are not handled by the runtime and by default result in abrupt termination. The only indication the runtime gets is the process detach event, by which time the OS has already terminated all other threads
      • When these events are not handled (by the runtime or by the app), this is an abrupt shutdown scenario and rundown events will not be sent
      • When these events are handled by the app and canceled along with Environment.Exit(), that converts these events into graceful shutdown (see above). If an app handles these events and chooses to not cancel the event, the event remains unhandled and leads to abrupt shutdown (see immediately above).
    • On Unixes, there is no significant change. SIGTERM is graceful shutdown as described above and there are no similar issues of abrupt shutdown.
  • There is an option of sending rundown events upon process detach (when we don't have an opportunity to do so gracefully), but as I described above that will get messy and is not a path that we should be headed down

… during process detach (a form of abrupt shutdown)

Longer-term fix for https://github.com/dotnet/coreclr/issues/27129:
- Etw rundown events sent during process shutdown are currently (and have for a long time) been sent during process detach. By that time, all other threads have been abruptly terminated by the OS, and as a result the state of the system is fundamentally unpredictable.
- In this particular case, locks have been orphaned by threads that have been abruptly terminated, so taking locks is not feasible during processing of rundown events, and if acquiring locks were to be avoided based on such knowledge (not recommended, this would get messy), we'd have to resort to providing information that would not accurately reflect the state, in the events
- I consider any situation where process detach occurs before an opportunity to handle graceful shutdown (that is, the runtime is unaware that a shutdown is about to happen and does not have an opportunity to handle shutdown prior to process detach (before the OS already shuts some things down)), then that is abrupt shutdown and in that scenario all bets are off - in the case of this change, etw rundown events would not be sent
- This change has the following effects:
  - Graceful shutdown such as returning from `Main` or `Environment.Exit()` will send rundown events very slighly earlier than before. Background threads will still be running and there may be other etw events interspersed among rundown events and sent after rundown events.
  - On Windows, Ctrl+C and Ctrl+Break are not handled by the runtime and by default result in abrupt termination. The only indication the runtime gets is the process detach event, by which time the OS has already terminated all other threads
    - When these events are not handled (by the runtime or by the app), this is an abrupt shutdown scenario and rundown events will not be sent
    - When these events are handled by the app and canceled along with `Environment.Exit()`, that converts these events into graceful shutdown (see above). If an app handles these events and chooses to not cancel the event, the event remains unhandled and leads to abrupt shutdown (see immediately above).
  - On Unixes, there is no significant change. SIGTERM is graceful shutdown as described above and there are no similar issues of abrupt shutdown.
- There is an option of sending rundown events upon process detach (when we don't have an opportunity to do so gracefully), but as I described above that will get messy and is not a path that we should be headed down
@kouvel
Copy link
Member Author

kouvel commented Oct 16, 2019

Adding one more point that I forgot to mention:

  • PerfView does not seem to use these rundown events. While it shows the event counts in the summary it does not list or use these events to identify methods.
  • For example, when tracing is started after many methods have been jitted, then the process exits (including sending these rundown events), then tracing is stopped, PerfView is not able to identify those methods. Probably a bug in PerfView.
  • It is unlikely that this would have any significant impact on perf tracing due to the above

@brianrob
Copy link
Member

Do we have a sense of where else we see hangs during rundown? I'm trying to rationalize the thought of not getting any rundown events on processes where someone hits CTRL+C. This is fairly common for ASP.NET Core apps, and I wouldn't want to not get rundown events for this.

I'm wondering if there are other paths here that we know of where we've seen hangs, or if it's just this particular lock. If it's just this lock, you could imagine attempting to special case this one.

I'm also surprised that the rundown events don't impact tracing at all. I would expect PerfView to use these events if you stop the trace after the process shuts down, as this would be the only time that module/method rundown occurs.

There are some other concerns around EventPipe as well, as it will need to flush the rest of the file to disk. I'm not sure how this impacts the experience in dotnet-trace, but @noahfalk will probably have a better idea of this.

I guess all of this is to say that I'm not yet ready to give up on rundown in these cases just yet.

@mikedn
Copy link

mikedn commented Oct 16, 2019

On Windows, Ctrl+C and Ctrl+Break are not handled by the runtime and by default result in abrupt termination.

Perhaps they should be handled by the runtime if the application itself does not handle these?

kouvel added a commit to kouvel/coreclr that referenced this pull request Oct 16, 2019
…essing during abrupt shutdown

Targeted and partial fix for https://github.com/dotnet/coreclr/issues/27129
- This is not a generic fix for the issue above, it is only a very targeted fix for an issue seen (a new issue introduced in 3.x). For a generic fix and more details, see the fix in 5.0: dotnet#27238.
- This change avoids taking a lock during process detach - a point in time when all other threads have already been abruptly shut down by the OS and locks may have been orphaned.
- The issue leads to a hang during shutdown when ETW tracing is enabled and the .NET process being traced begins the shutdown sequence at an unfortunate time - this is a probably rare timing issue. It would take the shutdown sequence to begin at just the point when a thread holds a particular lock and is terminated by the OS while holding the lock, then the OS sends the process detach event to the CLR, work during which then tries to acquire the lock and cannot because it is orphaned.
- The generic fix has broader consequences and is unlikely to be a reasonable change to make so late in the cycle, such a change needs some bake time and feedback. Hence this targeted fix for 3.x.
kouvel added a commit to kouvel/coreclr that referenced this pull request Oct 16, 2019
…essing during abrupt shutdown

Targeted and partial fix for https://github.com/dotnet/coreclr/issues/27129
- This is not a generic fix for the issue above, it is only a very targeted fix for an issue seen (a new issue introduced in 3.x). For a generic fix and more details, see the fix in 5.0: dotnet#27238.
- This change avoids taking a lock during process detach - a point in time when all other threads have already been abruptly shut down by the OS and locks may have been orphaned.
- The issue leads to a hang during shutdown when ETW tracing is enabled and the .NET process being traced begins the shutdown sequence at an unfortunate time - this is a probably rare timing issue. It would take the shutdown sequence to begin at just the point when a thread holds a particular lock and is terminated by the OS while holding the lock, then the OS sends the process detach event to the CLR, work during which then tries to acquire the lock and cannot because it is orphaned.
- The generic fix has broader consequences and is unlikely to be a reasonable change to make so late in the cycle, such a change needs some bake time and feedback. Hence this targeted fix for 3.x.
@kouvel
Copy link
Member Author

kouvel commented Oct 16, 2019

Perhaps they should be handled by the runtime if the application itself does not handle these?

Yes if that's the behavior we would want - that those events would result in graceful shutdown, then that should be spec'ed and handling them would be the proper thing to do. That has not been the case for a long time though, and whether that would be desirable is a good question and needs to be determined.

@kouvel
Copy link
Member Author

kouvel commented Oct 16, 2019

I'm trying to rationalize the thought of not getting any rundown events on processes where someone hits CTRL+C. This is fairly common for ASP.NET Core apps, and I wouldn't want to not get rundown events for this.

If Ctrl+C is an expected way to exit the process, I expect that the app would appropriately handle that in a way that would lead to graceful shutdown. I don't know the details though, I'll defer to others who would probably know more - CC @sebastienros @davidfowl

I'm wondering if there are other paths here that we know of where we've seen hangs, or if it's just this particular lock. If it's just this lock, you could imagine attempting to special case this one.

That is what I'm attempting to do for 3.x, however, that is not a reasonable long-term solution since there is a fundamental problem here.

I'm also surprised that the rundown events don't impact tracing at all. I would expect PerfView to use these events if you stop the trace after the process shuts down, as this would be the only time that module/method rundown occurs.

I've pointed this out before,, try it out yourself, it's a pretty easy repro. Aside from a potential fix to this issue from the PerfView side, I think the rundown architecture is generally broken and there are issues it cannot solve, but that is a topic that is beyond the scope of this PR.

There are some other concerns around EventPipe as well, as it will need to flush the rest of the file to disk. I'm not sure how this impacts the experience in dotnet-trace, but @noahfalk will probably have a better idea of this.

This has no effect on EventPipe. Prior to this change, EventPipe is shut down prior to processing rundown events, so EventPipe would already not be getting rundown events in the graceful shutdown scenarios. Only in abrupt shutdown of Crtl+C/Break, there may have been an impact on EventPipe with regards to rundown events. However, with this change, in both cases (graceful and abrupt shutdown) from an EventPipe perspective the behavior would be the same as before. In graceful shutdown EventPipe gets shut down before rundown, so same as before. In abrupt shutdown, prior to this change EventPipe gets shut down before rundown, so no change.

@kouvel
Copy link
Member Author

kouvel commented Oct 16, 2019

Well that last part is not entirely accurate :(:

  • In abrupt shutdown scenarios EventPipe is not shut down (because it is only shut down in graceful shutdown scenarios). So there is a likelihood that EventPipe would have been involved during process detach, even though the shutdown path clearly indicates (logically) that that is not the intention. This is again one of those things that is not meant to happen but is maybe happening unintentionally and accidentally due to a different fundamental problem.

@mjsabby
Copy link

mjsabby commented Oct 16, 2019

It'd be really nice if we could opt out of these rundowns. Fundamentally rundowns are really harmful for low latency programs with large code foot prints. I understand this is improving a case but in general it'd be nice to have this configurable.

@kouvel
Copy link
Member Author

kouvel commented Oct 16, 2019

It'd be really nice if we could opt out of these rundowns. Fundamentally rundowns are really harmful for low latency programs with large code foot prints. I understand this is improving a case but in general it'd be nice to have this configurable.

In the typical scenario ideally we would realize that there are no events that have been subscribed for, and avoid the vast majority of the work. I'm not sure that that is happening currently, as I kind of saw that more work is being done than necessary under a debugger when I didn't have a profiler active. Someone should probably verify. The rundown path should not have any impact on perf when not profiling. Or @mjsabby are you suggesting that you want to avoid the perf of rundown while profiling?

@noahfalk
Copy link
Member

Or @mjsabby are you suggesting that you want to avoid the perf of rundown while profiling?

I am guessing @mjsabby is referring to the issue described here.

@noahfalk will probably have a better idea of this.

EventPipe already bit the bullet here. If the trace is still running when the app abruptly shuts down you get a broken trace and we just tell users not to expect that to work. So far it hasn't created much noise though I imagine the usage level isn't that high yet. My inclination is to go the same way with ETW. Failing to get a complete trace may be annoying for the person who tried to collect it, but trying to gracefully complete work in an ungraceful shutdown feels like a recipe for ongoing bugs. Its hard for me to know how many .NET Core customers have a workflow that expects this work, but I'd be willing to get some unhappy feedback on it during upcoming previews if that lets us put ourselves on a more solid foundation.

@kouvel
Copy link
Member Author

kouvel commented Oct 17, 2019

I'm also surprised that the rundown events don't impact tracing at all. I would expect PerfView to use these events if you stop the trace after the process shuts down, as this would be the only time that module/method rundown occurs.

Looks like I was wrong about that. I see the method unload events and they are being used, as I'm seeing the methods in the CPU stacks view. I'm still seeing most of the samples in unknown address frames though, not sure why, but the info in the event seems to be accurate and the unknown address shown in the stack view falls within the range of the info given in the event. That may be an unrelated issue.

RE ASP.NET, I tested with a web app and shutting down the server with Ctrl+C is causing graceful shutdown of the process, so there shouldn't be any issues there.

@brianrob
Copy link
Member

Ok, good. I feel better. The fact that app-handled CTRL+C includes ASP.NET makes me much less concerned about this. I agree that trying to duck and dodge these sorts of issues isn't really a tenable plan. Let's do this for 5.0 and see how things go throughout the release. If we run into issues, we can re-visit how we handle rundown on shutdown.

@kouvel
Copy link
Member Author

kouvel commented Oct 17, 2019

Looks like I was wrong about that. I see the method unload events and they are being used, as I'm seeing the methods in the CPU stacks view.

On second thought, I think those method unload events are coming from a different path. PerfView is showing the rundown event count (DCStopVerbose, and the IL to native map events) but not showing them even with -KeepAllEvents, and I don't believe it's using them, which is probably why I'm seeing unknown addresses in the stack view. Probably it doesn't use info from the UnloadVerbose event to identify methods. Not sure why I'm seeing the method also in the stack view (but with very few samples), that is probably an unrelated issue.

@kouvel
Copy link
Member Author

kouvel commented Oct 17, 2019

Filed microsoft/perfview#1016 for that issue

@kouvel
Copy link
Member Author

kouvel commented Oct 18, 2019

Regarding https://github.com/dotnet/diagnostics/issues/570, it looks like when the process crashes, we don't even get the process detach event. It would be kind of unfortunate if taking a long trace that would have useful info and then the process crashes we wouldn't get rundown info. I really think we should do rundown when starting to profile, as that would also work with dynamic and collectible methods that may not exist anymore by process shutdown time or by the time profiling is stopped. Filed microsoft/perfview#1017 for that.

@kouvel kouvel deleted the EtwShutdownLockFix branch October 18, 2019 14:35
kouvel added a commit that referenced this pull request Oct 24, 2019
…w processing during shutdown (#27241)

* Protect against a rare invalid lock acquision attempt during etw processing during abrupt shutdown

Targeted and partial fix for https://github.com/dotnet/coreclr/issues/27129
- This is not a generic fix for the issue above, it is only a very targeted fix for an issue seen (a new issue introduced in 3.x). For a generic fix and more details, see the fix in 5.0: #27238.
- This change avoids taking a lock during process detach - a point in time when all other threads have already been abruptly shut down by the OS and locks may have been orphaned.
- The issue leads to a hang during shutdown when ETW tracing is enabled and the .NET process being traced begins the shutdown sequence at an unfortunate time - this is a probably rare timing issue. It would take the shutdown sequence to begin at just the point when a thread holds a particular lock and is terminated by the OS while holding the lock, then the OS sends the process detach event to the CLR, work during which then tries to acquire the lock and cannot because it is orphaned.
- The generic fix has broader consequences and is unlikely to be a reasonable change to make so late in the cycle, such a change needs some bake time and feedback. Hence this targeted fix for 3.x.

* Report tier as unknown when it cannot be determined

* Return unknown only on process detach
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants