-
Notifications
You must be signed in to change notification settings - Fork 67
[WIP] Library instrumentation guidance review #917
Changes from 1 commit
cd4b50a
9b8fd2e
4899086
4e27fb5
e3b689e
b3a1d3a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,53 +1,115 @@ | ||
# Guidance for instrumenting libraries with Diagnostic Source | ||
|
||
This document provides guidance for adding Diagnostic Source instrumentation to external libraries, which allows Application Insights and other monitoring solutions to collect meaningful and rich telemetry. | ||
This document provides guidance for adding diagnostic instrumentation to the library, which allows any monitoring solutions to collect meaningful and rich telemetry inside the process. | ||
|
||
# Instrumentation principles | ||
|
||
* **No instrumentation in user application**. When tracing system is injected (at runtime or compile-time), it automatically discovers all diagnostic sources in the app and may subscribe to any of them. | ||
* **Minimal performance impact**. In absence of listener, instrumentation has neglectable impact on the performance. When listener is present, it is able to sample events to minimize the reasonable impact instrumentation may have. | ||
* **Baked into the platform**. To make sure everything could be instrumented (e.g. .NET HttpClient or SqlClient), instrumentation primitives are part of .NET | ||
* **General purpose**. Could be consumed by any tracing/monitoring/profiling tool and the tool can choose level of details it gives its users | ||
|
||
## Diagnostic Source and Activities | ||
|
||
[Diagnostic Source][DiagnosticSourceGuide] is a simple module that allows code to be instrumented for production-time logging of rich data payloads for consumption within the process that was instrumented. At runtime, consumers can dynamically discover data sources and subscribe to the ones of interest. | ||
|
||
[Activity][ActivityGuide] is a class that allows storing and accessing diagnostics context and consuming it with logging system. | ||
[Activity][ActivityGuide] is a class that allows storing and accessing diagnostics context and consuming it with logging system. Activity flows with async calls and anything can request Activity.Current and correlate the trace/log with it. | ||
|
||
Both Diagnostic Source and Activity have been used to instrument [System.Net.Http][SystemNetHttp] and [Microsoft.AspNetCore.Hosting][MicrosoftAspNetCoreHosting]. | ||
|
||
More recently two new libraries were instrumented and that work was the basis for this guidance. These libraries are client SDKs for [Azure Event Hubs][MicrosoftEventHubs] and [Azure Service Bus][MicrosoftServiceBus], both of which support high throughput scenarios. | ||
|
||
[This document][DiagnosticSourceActivityHowto] goes into more details on how to efficiently use Diagnostic Source. | ||
|
||
## What should be instrumented | ||
|
||
The goal of instrumentation is to give the users the visibility to how particular operations are being performed inside the library. This information can be later used to diagnose performance or issues. It is up to the library authors to identify operations that the library performs and are worthy of monitoring. These operations can match the exposed API but also cover more specific internal logic (like outgoing service calls, retries, locking, cache utilization, awaiting system events, etc.). This way the users can get a good understanding of what's going on under the hood when they need it. | ||
|
||
## Instrumentation | ||
## Bare Minimum Instrumentation | ||
|
||
In the simplest case, the operation that is being monitored has to be wrapped by an activity. However in order to minimize performance impact the activity should be only created if there is any listener waiting for it. | ||
In the simplest case, the operation that is being monitored has to be wrapped by an activity. | ||
|
||
```csharp | ||
// create DiagnosticListener with unique name | ||
static DiagnosticListener source = new DiagnosticListener("Example.MyLibrary"); | ||
|
||
Activity activity = null; | ||
// create activity only when requested | ||
if (source.IsEnabled() && source.IsEnabled("Example.MyLibrary.MyOperation")) | ||
|
||
// check if listener is interested in this operation | ||
if (source.IsEnabled() && source.IsEnabled("MyOperation")) | ||
{ | ||
activity = new Activity("Example.MyLibrary.MyOperation"); | ||
source.StartActivity(activity, new { Input = input }); | ||
activity = new Activity("MyOperation"); | ||
activity.Start(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can simplify to activity = new Activity("MyOperation").Start(); But we probably want to make a helper 'DiagnosticListener.StartIfEnabled(string)' which does the lines above (and inlines), and thus the guidance is to do
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See other comment. Firing the "Start"-event should be a MUST. It's a complete mess if this is optional and only done by a few libraries. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having "activity.Start()" and "source.StartActivity()" is very confusing anyway and I too think that this should be simplified somehow in the API. |
||
} | ||
|
||
object output = null; | ||
try | ||
{ | ||
// perform the actual operation | ||
output = RunOperation(input); | ||
output = RunOperation(input); | ||
} | ||
catch (Exception ex) | ||
{ | ||
activity?.AddTag("error", true); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tags values are strings so this has to be activity?.AddTag("error", "true"); There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably add special helpers like 'AddErrorTag' for canonical tags as we find they are very generic. That makes the code simpler and allows Activity to encode it more efficiently in the future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. having 'status' on the Activity would be great once we agree on the contract! TL;DR; Status should still provide "OK" or "something else" (like OpenCensus does). Having strict enum-kind of mapping could be an option too, but it seems hard to do this mapping for each exception in the library Since the result of operation is special and essential, I think we should have a I put "error" as a placeholder and do not think we should use it. Boolean value as OpenTracing suggests here does not seem to be informative enough: HTTP, gRPC and perhaps other communication would rather provide a status code. OpenCensus suggests status code + details. While it is useful to stick to the tags suggested by OpenTracing, I don't really understand the guidance there as well. @cwe1ss, could you please advise us, what is the OpenTracing recommendation for errors I see suggestion to use
where possible. What about generic (non http) status recommendation in the tags? This recommendation translates to DiagnosticSource 'Exception' event rather than a tag on the Span/Activity. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Status codeOpenTracing currently only defines the "http.status_code" header and no non-http status codes. As there's no real protocol-independent standard (or none that I know of), I doubt that it makes sense to have a generic "status" tag/property. Instead, I think that libraries that have status codes should add their own tags - e.g. "grpc.status_code". ErrorsAs far as I understand, the OpenTracing "error" tag is for "Exception"-like cases, yes. Most OpenTracing instrumentation I've seen adds the "Exception" details via a Not sure what the best solution is here. The instrumentation code could only set the "error" tag and not log/raise the exception. This means it would rely on some parent span/activity do log it but what if there isn't one? Also, it might make debugging harder as the exception details are associated with an activity/span that did not raise it. Imagine there's a retry-scenario where something fails twice and succeeds on the third attempt. Who's responsible for logging the exceptions? There's also a difference between OpenTracing and Activity because with OpenTracing's "Log" concept, the exception always stays with the span whereas this is more decoupled in |
||
} | ||
finally | ||
{ | ||
// stop activity if started | ||
if (activity != null) | ||
source.StopActivity(activity, new { Input = input, Output = output }); | ||
source.StopActivity(activity, new MyStopPayload { Input = input, Output = output }); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should have a discussion about whether we should guide people to use explicit payload classes. They can improve efficiency and ease of use on the subscriber side, HOWEVER, they also are a mildly painful to use and do couple the receiver to the exact version of the sender. We do have 'fast' reflection APIs (see PropertyFetch class in DIagnosticSourceEventSource) that we can publically expose. These are 'pretty fast (< 10 instructions to fetch), and are probably good enough. If that is not sufficient, and we really want to recommend everyone making their own types, you should include the definition of MyStopPayload in the example and we need give guidance on naming (the type should live in the YOUR_COMPONENT.DiagnosticSource namespace) and we need to describe versioning, and the fact that using Properties (not fields) is better. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we had a bit of discussion about it with @NickCraver in StackExchange/StackExchange.Redis#833 and https://github.com/dotnet/corefx/issues/18114. It seems we do recommend using explicit types in DiagnosticListnener guide already. Property fetching: we have Microsoft.Extensions.DiagnosticAdapter and while it does efficient payload parsing, it does not play well with Activities/sampling scenarios and deprecated events. Anyway, I agree we need more info about naming/versioning. I will add guidance from DiagnosticListnener guide. Eventually we can convert this doc +diagnostic listener + activity guide to several articles on docs.ms.com There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be great if "PropertyFetcher" became a first class citizen. I tried the DiagnosticAdapter project but I removed it again because I thought it was too complicated and of limited use. Using a |
||
} | ||
``` | ||
> ### *__TODO__ - provide a pointer to a document with more advanced instrumentation examples (WIP)* | ||
|
||
This minimum instrumentation give tracing system a chance to trace something like '<timestamp> [Example.MyLibrary] MyOperation, Id 'abc.1.2.3.', took <duration> ms, with properties [{ "error": true }]'. | ||
While in majority of cases some other details are desirable (and we will get to it [here](TBD)), this instrumentation gives bare minimum. | ||
|
||
### Instrumentation control and sampling considerations | ||
|
||
**DO** make `DiagnosticListener` name globally unique. | ||
Every DiagnosticListener has a name. Tracing system discovers all available sources and subscribes to them based on the name. | ||
[DiagnosticSource guide](https://github.com/dotnet/corefx/blob/master/src/System.Diagnostics.DiagnosticSource/src/DiagnosticSourceUsersGuide.md#naming-conventions) provides more naming considerations. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I've said in #913, this document should be merged with the existing UserGuides. |
||
|
||
**DO** call 'IsEnabled()' for every operation. | ||
The first `source.IsEnabled()` call efficiently tell if there is any listener in the system for our source. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/tell/tells maybe |
||
|
||
**DO** make operation name unique within the DiagnosticListener. Operation name is coarse name that remains the same for ALL operations of this kind (e.g. `HttpRequestIn` or `HttpRequestOut`) | ||
|
||
**DO** call 'IsEnabled(OperationName)' for every operation | ||
If there is a listener, we ask tracing system if it is interested in 'MyOperation' with `source.IsEnabled("MyOperation")`. It enables filtering based on the operation name, but more importantly gives a tracing system a chance to sample this trace out based on the current Activity: tracing system samples events consistently to trace all or nothing for high-level operation. | ||
|
||
**CONSIDER** providing more context via 'IsEnabled(OperationName, Input)' overload instaed of 'IsEnabled(OperationName)' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. instead* |
||
In some cases tracing system needs more context to filter operations. A good example is tracing system that tracks HTTP calls and sends telemetry over HTTP as well. To avoid recursive tracing and save performance tracing system want to prevent such calls to be instrumented at all based on some request properties. | ||
|
||
**CONSIDER** firing 'MyOperation.Start' event | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Firing the "Start" event should be a MUST in my opinion. Having only some libraries that do this is a complete mess. It also completely breaks adapters that must do stuff on every activity start event. |
||
In many cases it makes sense to send Start event to tracing system. It may use it to inject some context into the `Input`. However tracing system may not be interested in such event at all, so guard 'Start' event with `IsEnabled` cliche, | ||
but make sure `Activity` is started in any case: | ||
|
||
|
||
```csharp | ||
if (source.IsEnabled("MyOperation.Start")) | ||
source.StartActivity(activity, new MyStartPayload {Input = input}); | ||
else | ||
activity.Start(); | ||
``` | ||
|
||
DO NOT cache results of IsEnabled: tracing system may subscribe/unsubscribe at any moment. Results of IsEnabled are dynamic and depend on the context (`Activity.Current`, `Input`). | ||
|
||
### Context Propagation | ||
|
||
In distributed systems, one user operation may be processed by many applications. To trace operation end-to-end, some context must be propagated along with external call. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs more meat. Basically an example (or leave it out, because as it is it does not help anyone). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You know my concerns about having hard-coded header names etc. in instrumentation code. We should have a proper discussion about this at some point. Until then, "context propagation" should not be documented IMO. |
||
|
||
#### Injection (outgoing calls) | ||
|
||
Library should inject the context into the wired protocols if there is a standard defined for this protocol and library implements this protocol and let tracing system add/change correlation context. | ||
|
||
* DO fire 'MyOperation.Start' (guarded by IsEnabled) event if your operation involves out-of-process process call. Provide raw request (`Input`) in the event payload so tracing system could inject context. | ||
* DO include raw response into the 'MyOperation.Stop' event payload (`Output`): tracing system may want to access response and parse context that back propagated back in response. | ||
* DO propagate context if the standard for the propagation exists. Some servers (e.g. SQL) could support correlating internal logs with external operations and define metadata fields for correlation identifiers, however may require specific identifier(s) format. Use `activity.Id` if possible, otherwise generate a compatible id and add a tag: `activity.AddTag("sqlCorrelationId", myGeneratedId)` | ||
* DO NOT propagate context if your library only uses HTTP protocol. .NET handles HTTP layer, and tracing system may alter it. | ||
* CONSIDER If the standard does not exists, but you see a value in propagating context for correlation, use the names and format suggested by the [W3C distributed tracing](https://w3c.github.io/distributed-tracing). Use [TBD] field in `tracestate` for `activity.Id`. | ||
|
||
#### Extraction (incoming calls) | ||
|
||
TODO | ||
|
||
### Payload | ||
|
||
When starting and stopping an activity, it is possible to pass a payload object, which will be available for the listener. This way the listener can access raw, non-serialized data in real time. It can be useful for pulling out additional diagnostic information but also manipulating data as it is being processed (for example, inject diagnostic context before an outbound call is made). Good examples of such payload are ```HttpRequestMessage``` or messages that are being passed through queues. | ||
|
@@ -56,6 +118,14 @@ Mind that payload is not preserved as part of Activity and is only available whe | |
|
||
#### Payload format | ||
|
||
TODO: | ||
DO Use public concrete types. | ||
DO Use PascalCase for properties | ||
DO ensure back-compat, consider it as public API area | ||
Remove defined payload properties and define tags | ||
Describe payload usage scenarios, | ||
How to efficiently parse payloads for anonymous types | ||
|
||
Diagnostic source event and activity start/stop API allows to specify only a single payload object. However, in order to pass more data and allow future additions the recommendation is to use dynamic objects. Since these are .NET objects the names of particular properties should follow [standard .NET naming convention][DotNetPropertyNamingConvention]. | ||
|
||
Here are some recommendations for typical payload property names: | ||
|
@@ -67,26 +137,10 @@ Here are some recommendations for typical payload property names: | |
| `Status` | The ```TaskStatus``` of a completed asynchronous task | | ||
| `Exception` | The captured exception object | | ||
|
||
### Performance | ||
|
||
In order to avoid unnecessary overhead, it is highly recommended to check if there is any listener for given activity: | ||
|
||
```csharp | ||
source.IsEnabled() && source.IsEnabled("Example.MyLibrary.MyOperation") | ||
``` | ||
|
||
The parameterless ```source.IsEnabled()``` check should be put before any other as it is very efficient and can virtually prevent any overhead in absence of any listener for given diagnostic source. | ||
|
||
It is also possible to specify additional context payload when making that call to allow the listener to make a more informed decision (for example, listeners may only be interested in activities for certain endpoint or partition). However, since this call is performed for every operation, it is NOT recommended to build a dynamic payload object as it was described earlier. Instead the raw input objects should be specified directly in the call - the Diagnostic Source API allows to specify up to two payload objects: | ||
|
||
```csharp | ||
source.IsEnabled("Example.MyLibrary.MyOperation", input1, input2) | ||
``` | ||
|
||
For more detailed performance considerations, refer to [Diagnostic Source][DiagnosticSourceGuide] and [Activity][ActivityGuide] guides. | ||
|
||
### Tags | ||
|
||
TODO: describe tags-only scenarios | ||
|
||
Activities can have additional tracing information in tags. Tags are meant to be easily, efficiently consumable and are expected to be logged with the activity without any processing. As such they should only contain essential information that should be made available to users to determine if the activity is of interest to them. All of the rich details should be made available in the payload. | ||
|
||
Tags can be added to activity at any time of its existence until it is stopped. This way the activity can be enriched with information from operation input, output and/or exceptions. Note however that they should be specified as early as possible so that the listeners can have the most context. In particular, all available tags should be set before starting the activity. | ||
|
@@ -106,7 +160,6 @@ A single application can include numerous libraries instrumented with Diagnostic | |
|
||
[DiagnosticSourceGuide]: https://github.com/dotnet/corefx/blob/master/src/System.Diagnostics.DiagnosticSource/src/DiagnosticSourceUsersGuide.md | ||
[ActivityGuide]: https://github.com/dotnet/corefx/blob/master/src/System.Diagnostics.DiagnosticSource/src/ActivityUserGuide.md | ||
[DiagnosticSourceActivityHowto]: https://github.com/lmolkova/correlation/wiki/How-to-instrument-library-with-Activity-and-DiagnosticSource | ||
[OpenTracingNamingConvention]: https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table | ||
[SystemNetHttp]: https://github.com/dotnet/corefx/blob/master/src/System.Net.Http/src/System/Net/Http/DiagnosticsHandler.cs | ||
[MicrosoftAspNetCoreHosting]: https://github.com/aspnet/Hosting/blob/dev/src/Microsoft.AspNetCore.Hosting/Internal/HostingApplicationDiagnostics.cs | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change Example.MyLibrary, to something like 'NameSpace.ComponentName' to give people guidance on what a good name looks like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This static variable should be explained in a separate code block IMO to better differentiate it from the actual instrumentation code.