Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define and implement behavior of tracing functionality when EventSource is disabled via a feature switch #43657

Closed
Tracked by #93172 ...
vitek-karas opened this issue Oct 20, 2020 · 15 comments
Assignees
Labels
area-System.Diagnostics.Tracing enhancement Product code improvement that does NOT require public API changes/additions linkable-framework Issues associated with delivering a linker friendly framework
Milestone

Comments

@vitek-karas
Copy link
Member

In .NET 5 we introduced a new feature switch EventSourceSupport - it effectively disables all event sources (typical observed behavior is that they can't be enabled and thus don't produce any events). For the most part this has been implemented for Mono/WASM scenarios which don't support event source anyway, so the code was effectively unused.

If used on CoreCLR, there's still quite a bit of code left in the app (which linker can't remove right now) related to EventPipe - currently all of this code is rooted. It's easy to remove it via the EventSourceSupport feature switch, but it's unclear what that will do to the EventPipe infrastructure.

Specifically:

  • We need to make sure this doesn't cause failures in the application itself
  • We need to determine the desired behavior of all in-proc tracing scenarios (this probably already works to a point - as in in-proc listeners will work, but won't receive any events).
  • We need to determine the desired behavior of all out-of-proc tracing scenarios - specifically dotnet trace collect and similar.

This issue is meant to track both the design an implementation of these changes.

@vitek-karas vitek-karas added area-System.Diagnostics.Tracing linkable-framework Issues associated with delivering a linker friendly framework labels Oct 20, 2020
@vitek-karas vitek-karas added this to the 6.0.0 milestone Oct 20, 2020
@ghost
Copy link

ghost commented Oct 20, 2020

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 20, 2020
@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Oct 20, 2020
@noahfalk noahfalk added feature-request enhancement Product code improvement that does NOT require public API changes/additions and removed feature-request labels Nov 10, 2020
@LakshanF LakshanF self-assigned this Dec 6, 2020
@LakshanF
Copy link
Member

LakshanF commented Dec 6, 2020

Validated the application and in-proc behavior with the latest Corelib changes that includes the fix when native code calls this feature.

@eerhardt added test related to this that should be sufficient.

Need to chat with the diagnostics folks to that I'm validating the tool behavior correctly.

@vitek-karas
Copy link
Member Author

Related to #43635.

@LakshanF
Copy link
Member

LakshanF commented Dec 7, 2020

@noahfalk, Can you review additional testing that would be interesting for this please?

I removed the types EventPipe, NativeRuntimeEventSource from CoreLib's root descriptor to explore the behavior when the feature-switch, EventSourceSupport, is set to false. I tested this lightly as follows;

  1. Smoke test that does not have any references to types in the System.Diagnostics.Tracing ns

  2. Test that has EventSource and EventListener in it: EventSource.GetSources().Count() returns null and other light testing shows reasonable behavior

Interestingly, if I turn the feature switch on, the behavior seems to be as expected for these types (linker is doing its job) but I'm not very confident that I test all interesting scenarios here

  1. dotnet-trace: the tool immediately exits after the initial header information which seems reasonable (no error message though). The resulting trace.nettrace (around 1K) cannot be opened by perfview

FastSerialization.IOStreamStreamReader.Fill(Int32 minimum) ("Error: Exception EventPipe conversion: System.Exception: Read past end of stream.")

@eerhardt
Copy link
Member

eerhardt commented Dec 7, 2020

@eerhardt added test related to this that should be sufficient.

That test only covers the scenario where the app is not trimmed, and the EventSource.IsSupported AppContext switch is set to false.

If we are going to unroot the EventPipe code, we should probably add a "TrimmingTest" that trims the app with the EventSource feature switch set to false, and then ensure the app still works correctly.

See

<TestConsoleAppSourceFiles Include="VerifyResourcesGetTrimmedTest.cs">
<!-- Setting the Trimming feature switch to make sure that the Resources get trimmed by the linker
as this test will ensure exceptions are using Resource keys -->
<ExtraTrimmerArgs>--feature System.Resources.UseSystemResourceKeys true</ExtraTrimmerArgs>
</TestConsoleAppSourceFiles>

for how to set feature switches in our current "TrimmingTests" support.

@tommcdon
Copy link
Member

tommcdon commented Dec 7, 2020

@sywhang @josalem

@sywhang
Copy link
Contributor

sywhang commented Dec 7, 2020

Thanks for the heads up @tommcdon.

If used on CoreCLR, there's still quite a bit of code left in the app (which linker can't remove right now) related to EventPipe - currently all of this code is rooted. It's easy to remove it via the EventSourceSupport feature switch, but it's unclear what that will do to the EventPipe infrastructure.

Actually I've been meaning to remove a large chunk of this rooted EventPipe code. I've managed to get the internal partner team that uses reflection to access these private methods to stop doing that and move to officially supported API. I got the OK from them on removing these, so I'll follow up with a PR to remove the unused bits (there are some tests that use these, so I'll have to rewrite some of them) in a day or two.

@josalem
Copy link
Contributor

josalem commented Dec 8, 2020

@LakshanF, you may want to also check for behavior of other tools, specifically ETW tracing, e.g., PerfView collection.

Another potential pitfall is the infrastructure for forwarding native events to managed EventSources. You need to validate that turning on the native providers via ETW/EventPipe while EventSource* is trimmed out doesn't cause the native diagnostics code to break.

dotnet-trace: the tool immediately exits after the initial header information which seems reasonable (no error message though). The resulting trace.nettrace (around 1K) cannot be opened by perfview

This sounds like something went wrong while tracing, causing the trace to be cut prematurely (In the face of most failures, the IPC connections are closed unceremoniously by the runtime). What providers did you turn on when you were collecting this trace? What was the state of the diagnostics server after this, could you start a second trace?

In my mind, trimming out EventSource shouldn't preclude a user from doing CPU profiling. That doesn't require any managed eventing code if I recall. It also shouldn't break the Diagnostic Port infrastructure (IPC for diagnostic tools) even if eventing sources aren't available.

@marek-safar
Copy link
Contributor

/cc @lateralusX

@LakshanF
Copy link
Member

The consensus is to honor the developer intent when they switch off EventSource feature and not enable any events from the application. The user experience when they use an external tool like dotnet-trace should be to fail gracefully. The current behavior with dotnet-trace is not optimal since it crashes the app and generates a trace file that is corrupted (issue #46047).
The reason for honoring the developer intention with turning the feature switch off is as follows;

  • We don’t expect for example server apps to turn off event source – it provides valuable functionality, so it should be kept
  • We expect for example UI apps, or simple command line tools to choose to disable event source – it doesn’t provide much value and the size requirements are stricter in these cases.
  • We expect to keep event source enabled by default for majority of app models (Issue System.Diagnostics.ProcessTests.ProcessTest.Process_MainModule test failure in CI #14475) it would be a developer explicit choice to disable it.

We also need to document the expected behavior very clearly when this feature is turned off. In terms of diagnostics impact, there may be additional things that disabling EventSource may introduce that are not very obvious to the end user:

  • No metrics can be emitted out of the app because EventCounters will not work.
  • Several VS profilers (i.e. Visual Studio Async Profiler, Database Profiler) will not work – These depend on TplEventSource and DiagnosticSourceEventSource working.
  • OpenTelemetry tracing – tracing with OT will be broken because it depends on DiagnosticSourceEventSource.
  • EventListeners will not work as they are tightly coupled with EventSource. Separating them out will be a significant chunk of work.

@LakshanF
Copy link
Member

LakshanF commented Mar 5, 2021

Still have the NativeRuntimeEventSource to unroot but updating with the latest thinking from the team

External Tooling Experience

Impact on EventPipe-dependent tools (dotnet-trace)
We can add logic in diagnostics server that will check with System.Private.CoreLib.dll to detect that the EventSource support is turned off. Since .NET Core diagnostic tools such as dotnet-trace communicate with target runtime through the diagnostics server, these tools will be able to recognize that the EventSource support is turned off in the impacted application. The tools can then provide a warning to the user that the application will not emit any events. Specifically, to enable dotnet-trace to report the issue as a warning immediately, and also log an event via the CLR public provider. This would make it really easy to (a) see the issue immediately for scenarios where we can control this, and (b) make it easy to tell from a trace why events are missing. Scenario (b) is super important, because we often get traces where something went wrong, and being able to diagnose this from the trace itself is very valuable. (c) - document the boundary of what works and doesn't so that users (and even ourselves) know what to expect. This would be both at the tooling level and also at the API / RPC interfaces where tool authors and runtime developers interact.

Impact on Visual Studio Profiler
For tools that rely on ETW such as VS Profiler, we can emit an event under the “Microsoft-Windows-DotNETRuntime” for indicating that EventSource is turned off that is always fired (from native side) when EventSource support is turned off. For example, VS profiler will not be able to get any events from Windows apps that turned off EventSource – but it will receive an event from “Microsoft-Windows-DotNETRuntime” that indicates the target app turned off EventSource via EventSourceSupport switch.

Impact on Distributed Tracing
DiagnosticSourceEventSource is an EventSource, so it will impact distributed tracing/OpenTelemetry scenarios as well. But DiagnosticSourceEventSource doesn't play a signficant role in distributed tracing scenarios today. Its reasonable for apps that turned off EventSource support to lose this option until later.

Experience with turning on EventSource Providers
External processes that turn on specific EventSource Providers will not get events from these providers. EventPipe, ETW, and LTTng will cache configuration of enabled providers until they come into existence in the application. Since the EventSource support is turned off, these EventSource provider will never come into existence in the impacted application

Ability for a well known provider to give information that EventSource is turned off
The decision to turn off native events like GC events in the case where EventSource support is turned off has not been decided yet. Many of these can work as they are emitted directly from the native side of the runtime – but some of the events being fired from managed components such as the managed threadpool will not work properly. Other managed events that depend on EventSource such as TplEventSource, DiagnosticSourceEventSource will be impacted.

@vitek-karas
Copy link
Member Author

We should not make functional decisions based on the current location of the event firing code. Currently it's easy to keep all of the native code call sites even if EventSource is turned off, but it's something we should not do. It would prevent us from freely choosing if a given functionality can be in native or managed. We're already migrating large parts of the native runtime to managed and if we did the above this would become problematic.

Having one special event fired from native seems OK, but I would try to limit only to that and nothing else. Ideally we would not even have that - in the most size constrained environments we would like to remove all of event tracing code from the app, including the native parts (one day). So ideally the tooling should be able to handle those cases - at least somewhat gracefully.

It's similar to debugging - I could imagine one day we support some special way to produce an app which doesn't even start the debugger thread and doesn't open the debugger IPC. (Both as a security/reliability as well as size feature).

@noahfalk
Copy link
Member

noahfalk commented Mar 10, 2021

+1 on @vitek-karas answer. I think there is a few other reasons that lead to a similar conclusion:

  1. If we disable functionality we have to describe the scope of what was disabled to users so they understand what to expect. Saying "if you do X, all events will be disabled" is fairly straightforward. "If you do X all of the events implemented in managed code will be disabled, the exact list of what this includes varies by runtime SKU and version" is harder to understand.
  2. Disabling all events offers a simple implementation option for EventPipe - fail the initial request to create a session and fail the user's request to do tracing in the tool UI. Dealing with partial success/partial error is much messier.

@sywhang
Copy link
Contributor

sywhang commented Mar 10, 2021

As long as we can describe the diagnostic impact to the users in a clear way when the feature switch is disabled, I think it is fine to just disable all tracing bottoms-up.

As previously discussed, for tools that rely on ETW (i.e. VS Profiler), it may be worth firing an event that shows that the feature switch has been turned off in the app, so that it is diagnosable when no events show up in the trace.

@LakshanF
Copy link
Member

Closing this issue for this round of changes that have been done:

  1. Worked on the 2 types, EventPipe, NativeRuntimeEventSource in https://github.com/dotnet/runtime/blob/main/src/coreclr/System.Private.CoreLib/src/ILLink/ILLinkTrim.xml
  2. Opened a doc issue and the tooling issues (links above) after discussions with the Diagnostics team and other stakeholders
  3. There is around 12% size reduction with the feature switch turned off
  4. The remaining types in the namespace System.Diagnostics.Tracing other than enums are EventSource and a couple of FX event sources (FrameworkEventSource, NativeRuntimeEventSource), and ActivityTracker

@ghost ghost locked as resolved and limited conversation to collaborators Apr 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Diagnostics.Tracing enhancement Product code improvement that does NOT require public API changes/additions linkable-framework Issues associated with delivering a linker friendly framework
Projects
None yet
Development

No branches or pull requests

10 participants