-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Process/send etw rundown events only during graceful shutdown and not during process detach (a form of abrupt shutdown) #27238
Conversation
… during process detach (a form of abrupt shutdown) Longer-term fix for https://github.com/dotnet/coreclr/issues/27129: - Etw rundown events sent during process shutdown are currently (and have for a long time) been sent during process detach. By that time, all other threads have been abruptly terminated by the OS, and as a result the state of the system is fundamentally unpredictable. - In this particular case, locks have been orphaned by threads that have been abruptly terminated, so taking locks is not feasible during processing of rundown events, and if acquiring locks were to be avoided based on such knowledge (not recommended, this would get messy), we'd have to resort to providing information that would not accurately reflect the state, in the events - I consider any situation where process detach occurs before an opportunity to handle graceful shutdown (that is, the runtime is unaware that a shutdown is about to happen and does not have an opportunity to handle shutdown prior to process detach (before the OS already shuts some things down)), then that is abrupt shutdown and in that scenario all bets are off - in the case of this change, etw rundown events would not be sent - This change has the following effects: - Graceful shutdown such as returning from `Main` or `Environment.Exit()` will send rundown events very slighly earlier than before. Background threads will still be running and there may be other etw events interspersed among rundown events and sent after rundown events. - On Windows, Ctrl+C and Ctrl+Break are not handled by the runtime and by default result in abrupt termination. The only indication the runtime gets is the process detach event, by which time the OS has already terminated all other threads - When these events are not handled (by the runtime or by the app), this is an abrupt shutdown scenario and rundown events will not be sent - When these events are handled by the app and canceled along with `Environment.Exit()`, that converts these events into graceful shutdown (see above). If an app handles these events and chooses to not cancel the event, the event remains unhandled and leads to abrupt shutdown (see immediately above). - On Unixes, there is no significant change. SIGTERM is graceful shutdown as described above and there are no similar issues of abrupt shutdown. - There is an option of sending rundown events upon process detach (when we don't have an opportunity to do so gracefully), but as I described above that will get messy and is not a path that we should be headed down
Adding one more point that I forgot to mention:
|
Do we have a sense of where else we see hangs during rundown? I'm trying to rationalize the thought of not getting any rundown events on processes where someone hits CTRL+C. This is fairly common for ASP.NET Core apps, and I wouldn't want to not get rundown events for this. I'm wondering if there are other paths here that we know of where we've seen hangs, or if it's just this particular lock. If it's just this lock, you could imagine attempting to special case this one. I'm also surprised that the rundown events don't impact tracing at all. I would expect PerfView to use these events if you stop the trace after the process shuts down, as this would be the only time that module/method rundown occurs. There are some other concerns around EventPipe as well, as it will need to flush the rest of the file to disk. I'm not sure how this impacts the experience in dotnet-trace, but @noahfalk will probably have a better idea of this. I guess all of this is to say that I'm not yet ready to give up on rundown in these cases just yet. |
Perhaps they should be handled by the runtime if the application itself does not handle these? |
…essing during abrupt shutdown Targeted and partial fix for https://github.com/dotnet/coreclr/issues/27129 - This is not a generic fix for the issue above, it is only a very targeted fix for an issue seen (a new issue introduced in 3.x). For a generic fix and more details, see the fix in 5.0: dotnet#27238. - This change avoids taking a lock during process detach - a point in time when all other threads have already been abruptly shut down by the OS and locks may have been orphaned. - The issue leads to a hang during shutdown when ETW tracing is enabled and the .NET process being traced begins the shutdown sequence at an unfortunate time - this is a probably rare timing issue. It would take the shutdown sequence to begin at just the point when a thread holds a particular lock and is terminated by the OS while holding the lock, then the OS sends the process detach event to the CLR, work during which then tries to acquire the lock and cannot because it is orphaned. - The generic fix has broader consequences and is unlikely to be a reasonable change to make so late in the cycle, such a change needs some bake time and feedback. Hence this targeted fix for 3.x.
…essing during abrupt shutdown Targeted and partial fix for https://github.com/dotnet/coreclr/issues/27129 - This is not a generic fix for the issue above, it is only a very targeted fix for an issue seen (a new issue introduced in 3.x). For a generic fix and more details, see the fix in 5.0: dotnet#27238. - This change avoids taking a lock during process detach - a point in time when all other threads have already been abruptly shut down by the OS and locks may have been orphaned. - The issue leads to a hang during shutdown when ETW tracing is enabled and the .NET process being traced begins the shutdown sequence at an unfortunate time - this is a probably rare timing issue. It would take the shutdown sequence to begin at just the point when a thread holds a particular lock and is terminated by the OS while holding the lock, then the OS sends the process detach event to the CLR, work during which then tries to acquire the lock and cannot because it is orphaned. - The generic fix has broader consequences and is unlikely to be a reasonable change to make so late in the cycle, such a change needs some bake time and feedback. Hence this targeted fix for 3.x.
Yes if that's the behavior we would want - that those events would result in graceful shutdown, then that should be spec'ed and handling them would be the proper thing to do. That has not been the case for a long time though, and whether that would be desirable is a good question and needs to be determined. |
If Ctrl+C is an expected way to exit the process, I expect that the app would appropriately handle that in a way that would lead to graceful shutdown. I don't know the details though, I'll defer to others who would probably know more - CC @sebastienros @davidfowl
That is what I'm attempting to do for 3.x, however, that is not a reasonable long-term solution since there is a fundamental problem here.
I've pointed this out before,, try it out yourself, it's a pretty easy repro. Aside from a potential fix to this issue from the PerfView side, I think the rundown architecture is generally broken and there are issues it cannot solve, but that is a topic that is beyond the scope of this PR.
This has no effect on |
Well that last part is not entirely accurate :(:
|
It'd be really nice if we could opt out of these rundowns. Fundamentally rundowns are really harmful for low latency programs with large code foot prints. I understand this is improving a case but in general it'd be nice to have this configurable. |
In the typical scenario ideally we would realize that there are no events that have been subscribed for, and avoid the vast majority of the work. I'm not sure that that is happening currently, as I kind of saw that more work is being done than necessary under a debugger when I didn't have a profiler active. Someone should probably verify. The rundown path should not have any impact on perf when not profiling. Or @mjsabby are you suggesting that you want to avoid the perf of rundown while profiling? |
I am guessing @mjsabby is referring to the issue described here.
EventPipe already bit the bullet here. If the trace is still running when the app abruptly shuts down you get a broken trace and we just tell users not to expect that to work. So far it hasn't created much noise though I imagine the usage level isn't that high yet. My inclination is to go the same way with ETW. Failing to get a complete trace may be annoying for the person who tried to collect it, but trying to gracefully complete work in an ungraceful shutdown feels like a recipe for ongoing bugs. Its hard for me to know how many .NET Core customers have a workflow that expects this work, but I'd be willing to get some unhappy feedback on it during upcoming previews if that lets us put ourselves on a more solid foundation. |
Looks like I was wrong about that. I see the method unload events and they are being used, as I'm seeing the methods in the CPU stacks view. I'm still seeing most of the samples in unknown address frames though, not sure why, but the info in the event seems to be accurate and the unknown address shown in the stack view falls within the range of the info given in the event. That may be an unrelated issue. RE ASP.NET, I tested with a web app and shutting down the server with Ctrl+C is causing graceful shutdown of the process, so there shouldn't be any issues there. |
Ok, good. I feel better. The fact that app-handled CTRL+C includes ASP.NET makes me much less concerned about this. I agree that trying to duck and dodge these sorts of issues isn't really a tenable plan. Let's do this for 5.0 and see how things go throughout the release. If we run into issues, we can re-visit how we handle rundown on shutdown. |
On second thought, I think those method unload events are coming from a different path. PerfView is showing the rundown event count (DCStopVerbose, and the IL to native map events) but not showing them even with |
Filed microsoft/perfview#1016 for that issue |
Regarding https://github.com/dotnet/diagnostics/issues/570, it looks like when the process crashes, we don't even get the process detach event. It would be kind of unfortunate if taking a long trace that would have useful info and then the process crashes we wouldn't get rundown info. I really think we should do rundown when starting to profile, as that would also work with dynamic and collectible methods that may not exist anymore by process shutdown time or by the time profiling is stopped. Filed microsoft/perfview#1017 for that. |
…w processing during shutdown (#27241) * Protect against a rare invalid lock acquision attempt during etw processing during abrupt shutdown Targeted and partial fix for https://github.com/dotnet/coreclr/issues/27129 - This is not a generic fix for the issue above, it is only a very targeted fix for an issue seen (a new issue introduced in 3.x). For a generic fix and more details, see the fix in 5.0: #27238. - This change avoids taking a lock during process detach - a point in time when all other threads have already been abruptly shut down by the OS and locks may have been orphaned. - The issue leads to a hang during shutdown when ETW tracing is enabled and the .NET process being traced begins the shutdown sequence at an unfortunate time - this is a probably rare timing issue. It would take the shutdown sequence to begin at just the point when a thread holds a particular lock and is terminated by the OS while holding the lock, then the OS sends the process detach event to the CLR, work during which then tries to acquire the lock and cannot because it is orphaned. - The generic fix has broader consequences and is unlikely to be a reasonable change to make so late in the cycle, such a change needs some bake time and feedback. Hence this targeted fix for 3.x. * Report tier as unknown when it cannot be determined * Return unknown only on process detach
Longer-term fix for https://github.com/dotnet/coreclr/issues/27129:
Main
orEnvironment.Exit()
will send rundown events very slighly earlier than before. Background threads will still be running and there may be other etw events interspersed among rundown events and sent after rundown events.Environment.Exit()
, that converts these events into graceful shutdown (see above). If an app handles these events and chooses to not cancel the event, the event remains unhandled and leads to abrupt shutdown (see immediately above).