diff --git a/oteps/0001-telemetry-without-manual-instrumentation.md b/oteps/0001-telemetry-without-manual-instrumentation.md index 870ac348d77..4a65f59fe22 100644 --- a/oteps/0001-telemetry-without-manual-instrumentation.md +++ b/oteps/0001-telemetry-without-manual-instrumentation.md @@ -1,7 +1,5 @@ # (Open) Telemetry Without Manual Instrumentation -**Status:** `approved` - _Cross-language requirements for automated approaches to extracting portable telemetry data with zero source code modification._ ## Motivation @@ -25,29 +23,28 @@ Many people have correctly observed that “agent” design is highly language-d ### Requirements Without further ado, here are a set of requirements for “official” OpenTelemetry efforts to accomplish zero-source-code-modification instrumentation (i.e., “OpenTelemetry agents”) in any given language: + * _Manual_ source code modifications "very strongly discouraged", with an exception for languages or environments that leave no credible alternatives. Any code changes must be trivial and `O(1)` per source file (rather than per-function, etc). * Licensing must be permissive (e.g., ASL / BSD) * Packaging must allow vendors to “wrap” or repackage the portable (OpenTelemetry) library into a single asset that’s delivered to customers - * That is, vendors do not want to require users to comprehend both an OpenTelemetry package and a vendor-specific package + * That is, vendors do not want to require users to comprehend both an OpenTelemetry package and a vendor-specific package * Explicit, whitebox OpenTelemetry instrumentation must interoperate with the “automatic” / zero-source-code-modification / blackbox instrumentation. - * If the blackbox instrumentation starts a Span, whitebox instrumentation must be able to discover it as the active Span (and vice versa) - * Relatedly, there also must be a way to discover and avoid potential conflicts/overlap/redundancy between explicit whitebox instrumentation and blackbox instrumentation of the same libraries/packages - * That is, if a developer has already added the “official” OpenTelemetry plugin for, say, gRPC, then when the blackbox instrumentation effort adds gRPC support, it should *not* “double-instrument” it and create a mess of extra spans/etc + * If the blackbox instrumentation starts a Span, whitebox instrumentation must be able to discover it as the active Span (and vice versa) + * Relatedly, there also must be a way to discover and avoid potential conflicts/overlap/redundancy between explicit whitebox instrumentation and blackbox instrumentation of the same libraries/packages + * That is, if a developer has already added the “official” OpenTelemetry plugin for, say, gRPC, then when the blackbox instrumentation effort adds gRPC support, it should *not* “double-instrument” it and create a mess of extra spans/etc * From the standpoint of the actual telemetry being gathered, the same standards and expectations (about tagging, metadata, and so on) apply to "whitebox" instrumentation and automatic instrumentation * The code in the OpenTelemetry package must not take a hard dependency on any particular vendor/vendors (that sort of functionality should work via a plugin or registry mechanism) - * Further, the code in the OpenTelemetry package must be isolated to avoid possible conflicts with the host application (e.g., shading in Java, etc) - + * Further, the code in the OpenTelemetry package must be isolated to avoid possible conflicts with the host application (e.g., shading in Java, etc) ### Nice-to-have properties * Run-time integration (vs compile-time integration) * Automated and modular testing of individual library/package plugins - * Note that this also makes it easy to test against multiple different versions of any given library + * Note that this also makes it easy to test against multiple different versions of any given library * A fully pluggable architecture, where plugins can be registered at runtime without requiring changes to the central repo at github.com/open-telemetry - * E.g., for ops teams that want to write a plugin for a proprietary piece of legacy software they are unable to recompile + * E.g., for ops teams that want to write a plugin for a proprietary piece of legacy software they are unable to recompile * Augemntation of whitebox instrumentation by blackbox instrumentation (or, perhaps, vice versa). That is, not only can the trace context be shared by these different flavors of instrumentation, but even things like in-flight Span objects can be shared and co-modified (e.g., to use runtime interposition to grab local variables and attach them to a manually-instrumented span). - ## Trade-offs and mitigations Approaching a problem this language-specific at the cross-language altitude is intrinsically challenging since "different languages are different" – e.g., in Go there is no way to perform the kind of runtime interpositioning that's possible in Python, Ruby, or even Java. @@ -59,6 +56,7 @@ There is also a school of thought that we should only be focusing on the bits an ### What is our desired end state for OpenTelemetry end-users? To reiterate much of the above: + * First and foremost, **portable OpenTelemetry instrumentation can be installed without manual source code modification** * There’s one “clear winner” when it comes to portable, automatic instrumentation; just like with OpenTracing and OpenCensus, this is a situation where choice is not necessarily a good thing. End-users who wish to contribute instrumentation plugins should not have their enthusiasm and generosity diluted across competing projects. * As much as such a thing is possible, consistency across languages @@ -72,7 +70,7 @@ Given the desired end state, the Datadog tracers seem like the closest-fit, perm ### The overarching (technical) process, per-language -* Start with [the Datadog `dd-trace-foo` tracers](https://github.com/DataDog?utf8=✓&q=dd-trace&type=source&language=) +* Start with [the Datadog `dd-trace-foo` tracers](https://github.com/DataDog) * For each language: * Fork the Datadog `datadog/dd-trace-foo` repo into a `open-telemetry/auto-instr-foo` OpenTelemetry repo (exact naming TBD) * In parallel: @@ -102,12 +100,14 @@ Each `auto-instr-foo` repository must have at least one [Maintainer](https://git ## Prior art and alternatives There are many proprietary APM language agents – no need to survey them all here. There is a much smaller list of "APM agents" (or other auto-instrumentation efforts) that are already permissively-licensed OSS. For instance, when we met to discuss options for JVM (longer notes [here](https://docs.google.com/document/d/1ix0WtzB5j-DRj1VQQxraoqeUuvgvfhA6Sd8mF5WLNeY/edit#heading=h.kjctiyv4rxup)), we came away with the following list: + * [Honeycomb's Java beeline](https://github.com/honeycombio/beeline-java) * [Datadog's Java tracer](https://github.com/datadog/dd-trace-java) * [Glowroot](https://glowroot.org/) * [SpecialAgent](https://github.com/opentracing-contrib/java-specialagent) The most obvious "alternative approach" would be to choose "starting points" independently in each language. This has several problems: + * Higher likelihood of "hard forks": we want to avoid an end state where two projects (the OpenTelemetry version, and the original version) evolve – and diverge – independently * Higher likelihood of "concept divergence" across languages: while each language presents unique requirements and challenges, the Datadog auto-instrumentation libraries were written by a single organization with some common concepts and architectural requirements (they were also written to be OpenTracing-compatible, which greatly increases our odds of success given the similarities to OpenTelemetry) * Datadog would also like a uniform strategy here, and this donation requires their consent (unless we want to do a hard fork, which is suboptimal for everyone). So starting with the Datadog libraries in "all but one" (or "all but two", etc) languages makes this less palatable for them diff --git a/oteps/0002-remove-spandata.md b/oteps/0002-remove-spandata.md index 0368af39157..d128e06861c 100644 --- a/oteps/0002-remove-spandata.md +++ b/oteps/0002-remove-spandata.md @@ -1,7 +1,5 @@ # Remove SpanData -**Status:** `approved` - Remove and replace SpanData by adding span start and end options. ## Motivation @@ -24,7 +22,7 @@ I'd like to propose getting rid of SpanData and `tracer.recordSpanData()` and re ## Trade-offs and mitigations -From https://github.com/open-telemetry/opentelemetry-specification/issues/71: If the underlying SDK automatically adds tags to spans such as thread-id, stacktrace, and cpu-usage when a span is started, they would be incorrect for out of band spans as the tracer would not know the difference between in and out of band spans. This can be mitigated by indicating that the span is out of band to prevent attaching incorrect information, possibly with an `isOutOfBand()` option on `startSpan()`. +From : If the underlying SDK automatically adds tags to spans such as thread-id, stacktrace, and cpu-usage when a span is started, they would be incorrect for out of band spans as the tracer would not know the difference between in and out of band spans. This can be mitigated by indicating that the span is out of band to prevent attaching incorrect information, possibly with an `isOutOfBand()` option on `startSpan()`. ## Prior art and alternatives @@ -38,7 +36,7 @@ There also seems to be some hidden dependency between SpanData and the sampler A We might want to include attributes as a start option to give the underlying sampler more information to sample with. We also might want to include optional events, which would be for bulk adding events with explicit timestamps. -We will also want to ensure, assuming the span or subtrace is being created in the same process, that the timestamps use the same precision and are monotonic. +We will also want to ensure, assuming the span or subtrace is being created in the same process, that the timestamps use the same precision and are monotonic. ## Related Issues diff --git a/oteps/0003-measure-metric-type.md b/oteps/0003-measure-metric-type.md index 5e93e5ea2fa..4f1d2a1ceb8 100644 --- a/oteps/0003-measure-metric-type.md +++ b/oteps/0003-measure-metric-type.md @@ -1,12 +1,10 @@ # Consolidate pre-aggregated and raw metrics APIs -**Status:** `approved` - -# Foreword +## Foreword A working group convened on 8/21/2019 to discuss and debate the two metrics RFCs (0003 and 0004) and several surrounding concerns. This document has been revised with related updates that were agreed upon during this working session. See the [meeting notes](https://docs.google.com/document/d/1d0afxe3J6bQT-I6UbRXeIYNcTIyBQv4axfjKF4yvAPA/edit#). -# Overview +## Overview Introduce a `Measure` kind of metric object that supports a `Record` API method. Like existing `Gauge` and `Cumulative` metrics, the new `Measure` metric supports pre-defined labels. A new `RecordBatch` measurement API is introduced for recording multiple metric observations simultaneously. @@ -18,7 +16,7 @@ Since this document will be read in the future after the proposal has been writt The preceding specification used the term `TimeSeries` to describe an instrument bound with a set of pre-defined labels. In this document, [the term "Handle" is used to describe an instrument with bound labels](0009-metric-handles.md). In a future OTEP this will be again changed to "Bound instrument". The term "Handle" is used throughout this document to refer to a bound instrument. -# Motivation +## Motivation In the preceding `Metric.GetOrCreateTimeSeries` API for Gauges and Cumulatives, the caller obtains a `TimeSeries` handle for repeatedly recording metrics with certain pre-defined label values set. This enables an important optimization for exporting pre-aggregated metrics, since the implementation is able to compute the aggregate summary "entry" using a pointer or fast table lookup. The efficiency gain requires that the aggregation keys be a subset of the pre-defined labels. @@ -28,7 +26,7 @@ The preceding raw statistics API did not specify support for pre-defined labels. The preceding raw statistics API supported all-or-none recording for interdependent measurements using a common label set. This RFC introduces a `RecordBatch` API to support recording batches of measurements in a single API call, where a `Measurement` is now defined as a pair of `MeasureMetric` and `Value` (integer or floating point). -# Explanation +## Explanation The common use for `MeasureMetric`, like the preceding raw statistics API, is for reporting information about rates and distributions over structured, numerical event data. Measure metrics are the most general-purpose of metrics. Informally, the individual metric event has a logical format expressed as one primary key=value (the metric name and a numerical value) and any number of secondary key=values (the labels, resources, and context). @@ -72,7 +70,7 @@ Metric instrument Handles combine a metric instrument with a set of pre-defined By separation of API and implementation in OpenTelemetry, we know that an implementation is free to do _anything_ in response to a metric API call. By the low-level interpretation defined above, all metric events have the same structural representation, only their logical interpretation varies according to the metric definition. Therefore, we select metric kinds based on two primary concerns: 1. What should be the default implementation behavior? Unless configured otherwise, how should the implementation treat this metric variable? -1. How will the program source code read? Each metric uses a different verb, which helps convey meaning and describe default behavior. Cumulatives have an `Add()` method. Gauges have a `Set()` method. Measures have a `Record()` method. +2. How will the program source code read? Each metric uses a different verb, which helps convey meaning and describe default behavior. Cumulatives have an `Add()` method. Gauges have a `Set()` method. Measures have a `Record()` method. To guide the user in selecting the right kind of metric for an application, we'll consider the following questions about the primary intent of reporting given data. We use "of primary interest" here to mean information that is almost certainly useful in understanding system behavior. Consider these questions: @@ -106,7 +104,7 @@ For gauge metrics, the default OpenTelemetry implementation exports the last val Measure metrics express a distribution of measured values. This kind of metric should be used when the count or rate of events is meaningful and either: 1. The sum is of interest in addition to the count (rate) -1. Quantile information is of interest. +2. Quantile information is of interest. The key property of a measure metric event is that computing quantiles and/or summarizing a distribution (e.g., via a histogram) may be expensive. Not only will implementations have various capabilities and algorithms for this task, users may wish to control the quality and cost of aggregating measure metrics. @@ -135,7 +133,7 @@ Applications sometimes want to act upon multiple metric instruments in a single A single measurement is defined as: -- Instrument: the measure instrument (not a Handle) +- Instrument: the measure instrument (not a Handle) - Value: the recorded floating point or integer data The batch measurement API uses a language-specific method name (e.g., `RecordBatch`). The entire batch of measurements takes place within a (implicit or explicit) context. @@ -148,7 +146,7 @@ Prometheus supports the notion of vector metrics, which are those that support p ### `GetHandle` argument ordering -Argument ordering has been proposed as the way to pass pre-defined label values in `GetHandle`. The argument list must match the parameter list exactly, and if it doesn't we generally find out at runtime or not at all. This model has more optimization potential, but is easier to misuse than the alternative. The alternative approach is to always pass label:value pairs to `GetOrCreateTimeseries`, as opposed to an ordered list of values. +Argument ordering has been proposed as the way to pass pre-defined label values in `GetHandle`. The argument list must match the parameter list exactly, and if it doesn't we generally find out at runtime or not at all. This model has more optimization potential, but is easier to misuse than the alternative. The alternative approach is to always pass label:value pairs to `GetOrCreateTimeseries`, as opposed to an ordered list of values. ### `RecordBatch` argument ordering diff --git a/oteps/0005-global-init.md b/oteps/0005-global-init.md index e68c4a51e6a..6c8ec8e0378 100644 --- a/oteps/0005-global-init.md +++ b/oteps/0005-global-init.md @@ -1,6 +1,6 @@ # Global SDK initialization -*Status: proposed* +**Status**: proposed Specify the behavior of OpenTelemetry APIs and implementations at startup. diff --git a/oteps/0006-sampling.md b/oteps/0006-sampling.md index fd39bd4dc72..92ec4c9573d 100644 --- a/oteps/0006-sampling.md +++ b/oteps/0006-sampling.md @@ -1,14 +1,14 @@ # Sampling API -*Status: approved* - ## TL;DR + This section tries to summarize all the changes proposed in this RFC: + 1. Move the `Sampler` interface from the API to SDK package. Apply some minor changes to the `Sampler` API. - 1. Add capability to record `Attributes` that can be used for sampling decision during the `Span` + 2. Add capability to record `Attributes` that can be used for sampling decision during the `Span` creation time. - 1. Remove `addLink` APIs from the `Span` interface, and allow recording links only during the span + 3. Remove `addLink` APIs from the `Span` interface, and allow recording links only during the span construction time. ## Motivation @@ -50,19 +50,21 @@ OpenTelemetry according to their needs. Honeycomb ``` + ## Explanation We outline five different use cases (who may be overlapping sets of people), and how they should interact with OpenTelemetry: ### Library developer + Examples: gRPC, Express, Django developers. - * They must only depend upon the OpenTelemetry API and not upon the SDK. - * For testing only they may depend on the SDK with InMemoryExporter. - * They are shipping source code that will be linked into others' applications. - * They have no explicit runtime control over the application. - * They know some signal about what traces may be interesting (e.g. unusual control plane requests) +* They must only depend upon the OpenTelemetry API and not upon the SDK. + * For testing only they may depend on the SDK with InMemoryExporter. +* They are shipping source code that will be linked into others' applications. +* They have no explicit runtime control over the application. +* They know some signal about what traces may be interesting (e.g. unusual control plane requests) or uninteresting (e.g. health-checks), but have to write fully generically. **Solution:** @@ -72,135 +74,153 @@ This is intentional to avoid premature optimizations, and it is based on the fac backwards incompatible compared to adding a new API. ### Infrastructure package/binary developer + Examples: HBase, Envoy developers. - * They are shipping self-contained binaries that may accept YAML or similar run-time configuration, - but are not expected to support extensibility/plugins beyond the default OpenTelemetry SDK, +* They are shipping self-contained binaries that may accept YAML or similar run-time configuration, + but are not expected to support extensibility/plugins beyond the default OpenTelemetry SDK, OpenTelemetry SDKTracer, and OpenTelemetry wire format exporter. - * They may have their own recommendations for sampling rates, but don't run the binaries in +* They may have their own recommendations for sampling rates, but don't run the binaries in production, only provide packaged binaries. So their sampling rate configs, and sampling strategies need to be a finite "built in" set from OpenTelemetry's SDK. - * They need to deal with upstream sampling decisions made by services that call them. +* They need to deal with upstream sampling decisions made by services that call them. **Solution:** - * Allow different sampling strategies by default in OpenTelemetry SDK, all configurable easily via + +* Allow different sampling strategies by default in OpenTelemetry SDK, all configurable easily via YAML or feature flags. See [default samplers](#default-samplers). ### Application developer + These are the folks we've been thinking the most about for OpenTelemetry in general. - * They have full control over the OpenTelemetry implementation or SDK configuration. When using the +* They have full control over the OpenTelemetry implementation or SDK configuration. When using the SDK they can configure custom exporters, custom code/samplers, etc. - * They can choose to implement runtime configuration via a variety of means (e.g. baking in feature +* They can choose to implement runtime configuration via a variety of means (e.g. baking in feature flags, reading YAML files, etc.), or even configure the library in code. - * They make heavy usage of OpenTelemetry for instrumenting application-specific behavior, beyond +* They make heavy usage of OpenTelemetry for instrumenting application-specific behavior, beyond what may be provided by the libraries they use such as gRPC, Django, etc. **Solution:** - * Allow application developers to link in custom samplers or write their own when using the + +* Allow application developers to link in custom samplers or write their own when using the official SDK. - * These might include dynamic per-field sampling to achieve a target rate - (e.g. https://github.com/honeycombio/dynsampler-go) - * Sampling decisions are made within the start Span operation, after attributes relevant to the + * These might include dynamic per-field sampling to achieve a target rate + (e.g. ) +* Sampling decisions are made within the start Span operation, after attributes relevant to the span have been added to the Span start operation but before a concrete Span object exists (so that either a NoOpSpan can be made, or an actual Span instance can be produced depending upon the sampler's decision). - * Span.IsRecording() needs to be present to allow costly span attribute/log computation to be +* Span.IsRecording() needs to be present to allow costly span attribute/log computation to be skipped if the span is a NoOp span. - + ### Application operator + Often the same people as the application developers, but not necessarily - - * They care about adjusting sampling rates and strategies to meet operational needs, debugging, + +* They care about adjusting sampling rates and strategies to meet operational needs, debugging, and cost. - + **Solution:** - * Use config files or feature flags written by the application developers to control the + +* Use config files or feature flags written by the application developers to control the application sampling logic. - * Use the config files to configure libraries and infrastructure package behavior. +* Use the config files to configure libraries and infrastructure package behavior. ### Telemetry infrastructure owner + They are the people who provide an implementation for the OpenTelemetry API by using the SDK with -custom `Exporter`s, `Sampler`s, hooks, etc. or by writing a custom implementation, as well as +custom `Exporter`s, `Sampler`s, hooks, etc. or by writing a custom implementation, as well as running the infrastructure for collecting exported traces. - * They care about a variety of things, including efficiency, cost effectiveness, and being able to +* They care about a variety of things, including efficiency, cost effectiveness, and being able to gather spans in a way that makes sense for them. **Solution:** - * Infrastructure owners receive information attached to the span, after sampling hooks have already + +* Infrastructure owners receive information attached to the span, after sampling hooks have already been run. ## Internal details -In Dapper based systems (or systems without a deferred sampling decision) all exported spans are + +In Dapper based systems (or systems without a deferred sampling decision) all exported spans are stored to the backend, thus some of these systems usually don't scale to a high volume of traces, -or the cost to store all the Spans may be too high. In order to support this use-case and to +or the cost to store all the Spans may be too high. In order to support this use-case and to ensure the quality of the data we send, OpenTelemetry needs to natively support sampling with some requirements: - * Send as many complete traces as possible. Sending just a subset of the spans from a trace is + +* Send as many complete traces as possible. Sending just a subset of the spans from a trace is less useful because in this case the interaction between the spans may be missing. - * Allow application operator to configure the sampling frequency. - -For new modern systems that need to collect all the Spans and later may or may not make a deferred -sampling decision, OpenTelemetry needs to natively support a way to configure the library to +* Allow application operator to configure the sampling frequency. + +For new modern systems that need to collect all the Spans and later may or may not make a deferred +sampling decision, OpenTelemetry needs to natively support a way to configure the library to collect and export all the Spans. This is possible (even though OpenTelemetry supports sampling) by setting a default config to always collect all the spans. ### Sampling flags + OpenTelemetry API has two flags/properties: - * `RecordEvents` - * This property is exposed in the `Span` interface (e.g. `Span.isRecordingEvents()`). - * If `true` the current `Span` records tracing events (attributes, events, status, etc.), + +* `RecordEvents` + * This property is exposed in the `Span` interface (e.g. `Span.isRecordingEvents()`). + * If `true` the current `Span` records tracing events (attributes, events, status, etc.), otherwise all tracing events are dropped. - * Users can use this property to determine if expensive trace events can be avoided. - * `SampledFlag` - * This flag is propagated via the `TraceOptions` to the child Spans (e.g. + * Users can use this property to determine if expensive trace events can be avoided. +* `SampledFlag` + * This flag is propagated via the `TraceOptions` to the child Spans (e.g. `TraceOptions.isSampled()`). For more details see the w3c definition [here][trace-flags]. - * In Dapper based systems this is equivalent to `Span` being `sampled` and exported. - + * In Dapper based systems this is equivalent to `Span` being `sampled` and exported. + The flag combination `SampledFlag == false` and `RecordEvents == true` means that the current `Span` -does record tracing events, but most likely the child `Span` will not. This combination is +does record tracing events, but most likely the child `Span` will not. This combination is necessary because: - * Allow users to control recording for individual Spans. - * OpenCensus has this to support z-pages, so we need to keep backwards compatibility. -The flag combination `SampledFlag == true` and `RecordEvents == false` can cause gaps in the +* Allow users to control recording for individual Spans. +* OpenCensus has this to support z-pages, so we need to keep backwards compatibility. + +The flag combination `SampledFlag == true` and `RecordEvents == false` can cause gaps in the distributed trace, and because of this OpenTelemetry API should NOT allow this combination. -It is safe to assume that users of the API should only access the `RecordEvents` property when +It is safe to assume that users of the API should only access the `RecordEvents` property when instrumenting code and never access `SampledFlag` unless used in context propagators. ### Sampler interface + The interface for the Sampler class that is available only in the OpenTelemetry SDK: - * `TraceID` - * `SpanID` - * Parent `SpanContext` if any - * `Links` - * Span name - * `SpanKind` - * Initial set of `Attributes` for the `Span` being constructed + +* `TraceID` +* `SpanID` +* Parent `SpanContext` if any +* `Links` +* Span name +* `SpanKind` +* Initial set of `Attributes` for the `Span` being constructed It produces an output called `SamplingResult` that includes: - * A `SamplingDecision` enum [`NOT_RECORD`, `RECORD`, `RECORD_AND_PROPAGATE`]. - * A set of span Attributes that will also be added to the `Span`. - * These attributes will be added after the initial set of `Attributes`. - * (under discussion in separate RFC) the SamplingRate float. - + +* A `SamplingDecision` enum [`NOT_RECORD`, `RECORD`, `RECORD_AND_PROPAGATE`]. +* A set of span Attributes that will also be added to the `Span`. + * These attributes will be added after the initial set of `Attributes`. +* (under discussion in separate RFC) the SamplingRate float. + ### Default Samplers + These are the default samplers implemented in the OpenTelemetry SDK: - * ALWAYS_ON - * ALWAYS_OFF - * ALWAYS_PARENT - * Trust parent sampling decision (trusting and propagating parent `SampledFlag`). - * For root Spans (no parent available) returns `NOT_RECORD`. - * Probability - * Allows users to configure to ignore the parent `SampledFlag`. - * Allows users to configure if probability applies only for "root spans", "root spans and remote + +* ALWAYS_ON +* ALWAYS_OFF +* ALWAYS_PARENT + * Trust parent sampling decision (trusting and propagating parent `SampledFlag`). + * For root Spans (no parent available) returns `NOT_RECORD`. +* Probability + * Allows users to configure to ignore the parent `SampledFlag`. + * Allows users to configure if probability applies only for "root spans", "root spans and remote parent", or "all spans". - * Default is to apply only for "root spans and remote parent". - * Remote parent property should be added to the SpanContext see specs [PR/216][specs-pr-216] - * Sample with 1/N probability - + * Default is to apply only for "root spans and remote parent". + * Remote parent property should be added to the SpanContext see specs [PR/216][specs-pr-216] + * Sample with 1/N probability + **Root Span Decision:** |Sampler|RecordEvents|SampledFlag| @@ -220,26 +240,31 @@ These are the default samplers implemented in the OpenTelemetry SDK: |Probability|`Same as SampledFlag`|`ParentSampledFlag OR Probability`| ### Links + This RFC proposes that Links will be recorded only during the start `Span` operation, because: + * Link's `SampledFlag` can be used in the sampling decision. * OpenTracing supports adding references only during the `Span` creation. -* OpenCensus supports adding links at any moment, but this was mostly used to record child Links +* OpenCensus supports adding links at any moment, but this was mostly used to record child Links which are not supported in OpenTelemetry. -* Allowing links to be recorded after the sampling decision is made will cause samplers to not +* Allowing links to be recorded after the sampling decision is made will cause samplers to not work correctly and unexpected behaviors for sampling. -### When does sampling happen? +### When does sampling happen + The sampling decision will happen before a real `Span` object is returned to the user, because: - * If child spans are created they need to know the 'SampledFlag'. - * If `SpanContext` is propagated on the wire the 'SampledFlag' needs to be set. - * If user records any tracing event the `Span` object needs to know if the data are kept or not. + +* If child spans are created they need to know the 'SampledFlag'. +* If `SpanContext` is propagated on the wire the 'SampledFlag' needs to be set. +* If user records any tracing event the `Span` object needs to know if the data are kept or not. It may be possible to always collect all the events until the sampling decision is made but this is an important optimization. There are two important use-cases to be considered: - * All information that may be used for sampling decisions are available at the moment when the + +* All information that may be used for sampling decisions are available at the moment when the logical `Span` operation should start. This is the most common case. - * Some information that may be used for sampling decision are NOT available at the moment when the +* Some information that may be used for sampling decision are NOT available at the moment when the logical `Span` operation should start (e.g. `http.route` may be determine later). The current [span creation logic][span-creation] facilitates the first use-case very well, but @@ -253,64 +278,69 @@ address the delayed sampling in a different RFC when that becomes a high priorit The SDK must call the `Sampler` every time a `Span` is created during the start span operation. **Alternatives considerations:** - * We considered, to offer a delayed span construction mechanism: - * For languages where a `Builder` pattern is used to construct a `Span`, to allow users to + +* We considered, to offer a delayed span construction mechanism: + * For languages where a `Builder` pattern is used to construct a `Span`, to allow users to create a `Builder` where the start time of the Span is considered when the `Builder` is created. - * For languages where no intermediate object is used to construct a `Span`, to allow users maybe + * For languages where no intermediate object is used to construct a `Span`, to allow users maybe via a `StartSpanOption` object to start a `Span`. The `StartSpanOption` allows users to set all the start `Span` properties. - * Pros: - * Would resolve the second use-case posted above. - * Cons: - * We could not identify too many real case examples for the second use-case and decided to + * Pros: + * Would resolve the second use-case posted above. + * Cons: + * We could not identify too many real case examples for the second use-case and decided to postpone the decision to avoid premature decisions. - * We considered, instead of requiring that sampling decision happens before the `Span` is +* We considered, instead of requiring that sampling decision happens before the `Span` is created to add an explicit `MakeSamplingDecision(SamplingHint)` on the `Span`. Attempts to create a child `Span`, or to access the `SpanContext` would fail if `MakeSamplingDecision()` had not yet been run. - * Pros: - * Simplifies the case when all the attributes that may be used for sampling are not available + * Pros: + * Simplifies the case when all the attributes that may be used for sampling are not available when the logical `Span` operation should start. - * Cons: - * The most common case would have required an extra API call. - * Error prone, users may forget to call the extra API. - * Unexpected and hard to find errors if user tries to create a child `Span` before calling + * Cons: + * The most common case would have required an extra API call. + * Error prone, users may forget to call the extra API. + * Unexpected and hard to find errors if user tries to create a child `Span` before calling MakeSamplingDecision(). - * We considered allowing the sampling decision to be arbitrarily delayed, but guaranteed before +* We considered allowing the sampling decision to be arbitrarily delayed, but guaranteed before any child `Span` is created, or `SpanContext` is accessed, or before `Span.end()` finished. - * Pros: - * Similar and smaller API that supports both use-cases defined ahead. - * Cons: - * If `SamplingHint` needs to also be delayed recorded then an extra API on Span is required + * Pros: + * Similar and smaller API that supports both use-cases defined ahead. + * Cons: + * If `SamplingHint` needs to also be delayed recorded then an extra API on Span is required to set this. - * Does not allow optimization to not record tracing events, all tracing events MUST be + * Does not allow optimization to not record tracing events, all tracing events MUST be recorded before the sampling decision is made. ## Prior art and alternatives + Prior art for Zipkin, and other Dapper based systems: all client-side sampling decisions are made at head. Thus, we need to retain compatibility with this. ## Open questions + This RFC does not necessarily resolve the question of how to propagate sampling rate values between different spans and processes. A separate RFC will be opened to cover this case. ## Future possibilities + In the future, we propose that library developers may be able to defer the decision on whether to recommend the trace be sampled or not sampled until mid-way through execution; ## Related Issues - * [opentelemetry-specification/189](https://github.com/open-telemetry/opentelemetry-specification/issues/189) - * [opentelemetry-specification/187](https://github.com/open-telemetry/opentelemetry-specification/issues/187) - * [opentelemetry-specification/164](https://github.com/open-telemetry/opentelemetry-specification/issues/164) - * [opentelemetry-specification/125](https://github.com/open-telemetry/opentelemetry-specification/issues/125) - * [opentelemetry-specification/87](https://github.com/open-telemetry/opentelemetry-specification/issues/87) - * [opentelemetry-specification/66](https://github.com/open-telemetry/opentelemetry-specification/issues/66) - * [opentelemetry-specification/65](https://github.com/open-telemetry/opentelemetry-specification/issues/65) - * [opentelemetry-specification/53](https://github.com/open-telemetry/opentelemetry-specification/issues/53) - * [opentelemetry-specification/33](https://github.com/open-telemetry/opentelemetry-specification/issues/33) - * [opentelemetry-specification/32](https://github.com/open-telemetry/opentelemetry-specification/issues/32) - * [opentelemetry-specification/31](https://github.com/open-telemetry/opentelemetry-specification/issues/31) - -[trace-flags]: https://github.com/w3c/trace-context/blob/master/spec/20-http_header_format.md#trace-flags + +* [opentelemetry-specification/189](https://github.com/open-telemetry/opentelemetry-specification/issues/189) +* [opentelemetry-specification/187](https://github.com/open-telemetry/opentelemetry-specification/issues/187) +* [opentelemetry-specification/164](https://github.com/open-telemetry/opentelemetry-specification/issues/164) +* [opentelemetry-specification/125](https://github.com/open-telemetry/opentelemetry-specification/issues/125) +* [opentelemetry-specification/87](https://github.com/open-telemetry/opentelemetry-specification/issues/87) +* [opentelemetry-specification/66](https://github.com/open-telemetry/opentelemetry-specification/issues/66) +* [opentelemetry-specification/65](https://github.com/open-telemetry/opentelemetry-specification/issues/65) +* [opentelemetry-specification/53](https://github.com/open-telemetry/opentelemetry-specification/issues/53) +* [opentelemetry-specification/33](https://github.com/open-telemetry/opentelemetry-specification/issues/33) +* [opentelemetry-specification/32](https://github.com/open-telemetry/opentelemetry-specification/issues/32) +* [opentelemetry-specification/31](https://github.com/open-telemetry/opentelemetry-specification/issues/31) + +[trace-flags]: https://github.com/w3c/trace-context/blob/master/spec/20-http_request_header_format.md#trace-flags [specs-pr-216]: https://github.com/open-telemetry/opentelemetry-specification/pull/216 -[span-creation]: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#span-creation +[span-creation]: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/api.md#span-creation diff --git a/oteps/0007-no-out-of-band-reporting.md b/oteps/0007-no-out-of-band-reporting.md index 36ede4e40e8..286b86e19a3 100644 --- a/oteps/0007-no-out-of-band-reporting.md +++ b/oteps/0007-no-out-of-band-reporting.md @@ -1,44 +1,49 @@ # Remove support to report out-of-band telemetry from the API -*Status: approved* - ## TL;DR + This section tries to summarize all the changes proposed in this RFC: + 1. Remove API requirement to support reporting out-of-band telemetry. -1. Move Resource to SDK, API will always report telemetry for the current application so no need to +2. Move Resource to SDK, API will always report telemetry for the current application so no need to allow configuring the Resource in any instrumentation. -1. New APIs should be designed without this requirement. +3. New APIs should be designed without this requirement. ## Motivation + Currently the API package is designed with a goal to support reporting out-of-band telemetry, but this requirements forces a lot of trade-offs and unnecessary complicated APIs (e.g. `Resource` must be exposed in the API package to allow telemetry to be associated with the source of the telemetry). Reporting out-of-band telemetry is a required for the OpenTelemetry ecosystem, but this can be done via a few different other options that does not require to use the API package: + * The OpenTelemetry Service, users can write a simple [receiver][otelsvc-receiver] that parses and produces the OpenTelemetry data. * Using the SDK's exporter framework, users can write directly OpenTelemetry data. ## Internal details + Here is a list of decisions and trade-offs related to supporting out-of-band reporting: + 1. Add `Resource` concept into the API. * Example in the create metric we need to allow users to specify the resource, see [here][create-metric]. The developer that writes the instrumentation has no knowledge about where the monitored resource is deployed so there is no way to configure the right resource. -1. [RFC](./0002-remove-spandata.md) removes support to report SpanData. +2. [RFC](./0002-remove-spandata.md) removes support to report SpanData. * This will require that the trace API has to support all the possible fields to be configured via the API, for example need to allow users to set a pre-generated `SpanId` that can be avoided if we do not support out-of-band reporting. -1. Sampling logic for out-of-band spans will get very complicated because it will be incorrect to +3. Sampling logic for out-of-band spans will get very complicated because it will be incorrect to sample these data. -1. Associating the source of the telemetry with the telemetry data gets very simple. All data +4. Associating the source of the telemetry with the telemetry data gets very simple. All data produced by one instance of the API implementation belongs to only one Application. This can be rephrased as "one API implementation instance" can report telemetry about only the current Application. ### Resource changes + This RFC does not suggest to remove the `Resource` concept or to modify any API in this interface, it only suggests to move this concept to the SDK level. @@ -48,8 +53,9 @@ Application running (e.g. Java application server), every application will have instance configured with it's own `Resource`. ## Related Issues - * [opentelemetry-specification/62](https://github.com/open-telemetry/opentelemetry-specification/issues/62) - * [opentelemetry-specification/61](https://github.com/open-telemetry/opentelemetry-specification/issues/61) - + +* [opentelemetry-specification/62](https://github.com/open-telemetry/opentelemetry-specification/issues/62) +* [opentelemetry-specification/61](https://github.com/open-telemetry/opentelemetry-specification/issues/61) + [otelsvc-receiver]: https://github.com/open-telemetry/opentelemetry-service#config-receivers -[create-metric]: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-metrics.md#create-metric +[create-metric]: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/metrics/api.md#create-metric diff --git a/oteps/0009-metric-handles.md b/oteps/0009-metric-handles.md index d1cd689541e..a8370d5a335 100644 --- a/oteps/0009-metric-handles.md +++ b/oteps/0009-metric-handles.md @@ -1,7 +1,5 @@ # Metric Handle API specification -**Status:** `accepted` - Specify the behavior of the Metrics API "Handle" type, for efficient repeated-use of metric instruments. ## Motivation @@ -37,4 +35,3 @@ OpenCensus has the notion of a metric attachment, allowing the application to in [Agreements reached on handles and naming in the working group convened on 8/21/2019](https://docs.google.com/document/d/1d0afxe3J6bQT-I6UbRXeIYNcTIyBQv4axfjKF4yvAPA/edit#). [`record` should take a generic `Attachment` class instead of having tracing dependency](https://github.com/open-telemetry/opentelemetry-specification/issues/144) - diff --git a/oteps/0010-cumulative-to-counter.md b/oteps/0010-cumulative-to-counter.md index 391e7d92c45..1b6bc95b1f8 100644 --- a/oteps/0010-cumulative-to-counter.md +++ b/oteps/0010-cumulative-to-counter.md @@ -1,7 +1,5 @@ # Rename "Cumulative" to "Counter" in the metrics API -**Status:** `approved` - Prefer the name "Counter" as opposed to "Cumulative". ## Motivation @@ -39,7 +37,7 @@ It is possible that reducing all of these cases into the broad term "Counter" cr ## Internal details -Simply replace every "Cumulative" with "Counter", then edit for grammar. +Simply replace every "Cumulative" with "Counter", then edit for grammar. ## Prior art and alternatives diff --git a/oteps/0016-named-tracers.md b/oteps/0016-named-tracers.md index 1268da44a57..e85bad15976 100644 --- a/oteps/0016-named-tracers.md +++ b/oteps/0016-named-tracers.md @@ -1,7 +1,5 @@ # Named Tracers and Meters -**Status:** `approved` - _Associate Tracers and Meters with the name and version of the instrumentation library which reports telemetry data by parameterizing the API which the library uses to acquire the Tracer or Meter._ ## Suggested reading @@ -20,7 +18,7 @@ For an operator of an application using OpenTelemetry, there is currently no way ### Instrumentation library identification -If an instrumentation library hasn't implemented [semantic conventions](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/data-semantic-conventions.md) correctly or those conventions change over time, it's currently hard to interpret and sanitize data produced by it selectively. The produced Spans or Metrics cannot later be associated with the library which reported them, either in the processing pipeline or the backend. +If an instrumentation library hasn't implemented [semantic conventions](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/overview.md#semantic-conventions) correctly or those conventions change over time, it's currently hard to interpret and sanitize data produced by it selectively. The produced Spans or Metrics cannot later be associated with the library which reported them, either in the processing pipeline or the backend. ### Disable instrumentation of pre-instrumented libraries @@ -29,8 +27,9 @@ It is the eventual goal of OpenTelemetry that library vendors implement the Open ## Solution This proposal attempts to solve the stated problems by introducing the concept of: - * _Named Tracers and Meters_ which are associated with the **name** (e.g. _"io.opentelemetry.contrib.mongodb"_) and **version** (e.g._"semver:1.0.0"_) of the library which acquired them. - * A `TracerProvider` / `MeterProvider` as the only means of acquiring a Tracer or Meter. + +* _Named Tracers and Meters_ which are associated with the **name** (e.g. _"io.opentelemetry.contrib.mongodb"_) and **version** (e.g._"semver:1.0.0"_) of the library which acquired them. +* A `TracerProvider` / `MeterProvider` as the only means of acquiring a Tracer or Meter. Based on the name and version, a Provider could provide a no-op Tracer or Meter to specific instrumentation libraries, or a Sampler could be implemented that discards Spans or Metrics from certain libraries. Also, by providing custom Exporters, Span or Metric data could be sanitized before it gets processed in a back-end system. However, this is beyond the scope of this proposal, which only provides the fundamental mechanisms. @@ -38,7 +37,8 @@ Based on the name and version, a Provider could provide a no-op Tracer or Meter From a user perspective, working with *Named Tracers / Meters* and `TracerProvider` / `MeterProvider` is conceptually similar to how e.g. the [Java logging API](https://docs.oracle.com/javase/7/docs/api/java/util/logging/Logger.html#getLogger(java.lang.String)) and logging frameworks like [log4j](https://www.slf4j.org/apidocs/org/slf4j/LoggerFactory.html) work. In analogy to requesting Logger objects through LoggerFactories, an instrumentation library would create specific Tracer / Meter objects through a TracerProvider / MeterProvider. -New Tracers or Meters can be created by providing the name and version of an instrumentation library. The version (following the convention proposed in https://github.com/open-telemetry/oteps/pull/38) is basically optional but *should* be supplied since only this information enables following scenarios: +New Tracers or Meters can be created by providing the name and version of an instrumentation library. The version (following the convention proposed in ) is basically optional but *should* be supplied since only this information enables following scenarios: + * Only a specific range of versions of a given instrumentation library need to be suppressed, while other versions are allowed (e.g. due to a bug in those specific versions). * Go modules allow multiple versions of the same middleware in a single build so those need to be determined at runtime. @@ -82,20 +82,24 @@ On an SDK level, the SpanData class and its Metrics counterpart are extended wit ## Glossary of Terms -#### Instrumentation library +### Instrumentation library + Also known as the trace/metrics reporter, this may be either a library/module/plugin provided by OpenTelemetry that instruments an existing library, a third party integration which instruments some library, or a library that has implemented the OpenTelemetry API in order to instrument itself. In any case, the instrumentation library is the library which provides tracing and metrics data to OpenTelemetry. examples: + * `@opentelemetry/plugin-http` * `io.opentelemetry.redis` * `redis-client` (in this case, `redis-client` has instrumented itself with the OpenTelemetry API) -#### Tracer / Meter name and version +### Tracer / Meter name and version + When an instrumentation library acquires a Tracer/Meter, it provides its own name and version to the Tracer/Meter Provider. This name/version two-tuple is said to be the Tracer/Meter's _name_ and _version_. Note that this is the name and version of the library which acquires the Tracer/Meter, and not the library it is monitoring. In cases where the library is instrumenting itself using the OpenTelemetry API, they may be the same. example: If the `http` version `semver:3.0.0` library is being instrumented by a library with the name `io.opentelemetry.contrib.http` and version `semver:1.3.2`, then the tracer name and version are also `io.opentelemetry.contrib.http` and `semver:1.3.2`. If that same `http` library has built-in instrumentation through use of the OpenTelemetry API, then the tracer name and version would be `http` and `semver:3.0.0`. -#### Meter namespace +### Meter namespace + Meter name is used as a namespace for all metrics created by it. This allows a telemetry library to register a metric using any name, such as `latency`, without worrying about collisions with a metric registered under the same name by a different library. example: The libraries `redis` and `io.opentelemetry.redis` may both register metrics with the name `latency`. These metrics can still be uniquely identified even though they have the same name because they are registered under different namespaces (`redis` and `io.opentelemetry.redis` respectively). In this case, the operator may disable one of these metrics because they are measuring the same thing. @@ -115,7 +119,7 @@ Overall, this would not change a lot compared to the `TracerProvider` since the Instead of setting the `component` property based on the given Tracer names, those names could also be used as *prefixes* for produced span names (e.g. ``). However, with regard to data quality and semantic conventions, a dedicated `component` set on spans is probably preferred. -Instead of using plain strings as an argument for creating new Tracers, a `Resource` identifying an instrumentation library could be used. Such resources must have a _version_ and a _name_ label (there could be semantic convention definitions for those labels). This implementation alternative mainly depends on the availability of the `Resource` data type on an API level (see https://github.com/open-telemetry/opentelemetry-specification/pull/254). +Instead of using plain strings as an argument for creating new Tracers, a `Resource` identifying an instrumentation library could be used. Such resources must have a _version_ and a _name_ label (there could be semantic convention definitions for those labels). This implementation alternative mainly depends on the availability of the `Resource` data type on an API level (see ). ```java // Create resource for given instrumentation library information (name + version) diff --git a/oteps/0035-opentelemetry-protocol.md b/oteps/0035-opentelemetry-protocol.md index 55c0110455b..49bd39adc62 100644 --- a/oteps/0035-opentelemetry-protocol.md +++ b/oteps/0035-opentelemetry-protocol.md @@ -1,38 +1,35 @@ # OpenTelemetry Protocol Specification -_Author: Tigran Najaryan, Omnition Inc._ - -**Status:** `approved` +**Author**: Tigran Najaryan, Omnition Inc. OpenTelemetry Protocol (OTLP) specification describes the encoding, transport and delivery mechanism of telemetry data between telemetry sources, intermediate nodes such as collectors and telemetry backends. ## Table of Contents - - [Motivation](#motivation) - - [Protocol Details](#protocol-details) - - [Export Request and Response](#export-request-and-response) - - [OTLP over gRPC](#otlp-over-grpc) - - [Export Response](#export-response) - - [Throttling](#throttling) - - [gRPC Service Definition](#grpc-service-definition) - - [Other Transports](#other-transports) - - [Implementation Recommendations](#implementation-recommendations) - - [Multi-Destination Exporting](#multi-destination-exporting) - - [Trade-offs and mitigations](#trade-offs-and-mitigations) - - [Request Acknowledgements](#request-acknowledgements) - - [Duplicate Data](#duplicate-data) - - [Partial Success](#partial-success) - - [Future Versions and Interoperability](#future-versions-and-interoperability) - - [Prior Art, Alternatives and Future Possibilities](#prior-art-alternatives-and-future-possibilities) - - [Open Questions](#open-questions) - - [Appendix A - Protocol Buffer Definitions](#appendix-a---protocol-buffer-definitions) - - [Appendix B - Performance Benchmarks](#appendix-b---performance-benchmarks) - - [Throughput - Sequential vs Concurrent](#throughput---sequential-vs-concurrent) - - [CPU Usage - gRPC vs WebSocket/Experimental](#cpu-usage---grpc-vs-websocketexperimental) - - [Benchmarking Raw Results](#benchmarking-raw-results) - - [Glossary](#glossary) - - [Acknowledgements](#acknowledgements) - +- [Motivation](#motivation) +- [Protocol Details](#protocol-details) + - [Export Request and Response](#export-request-and-response) + - [OTLP over gRPC](#otlp-over-grpc) + - [Export Response](#export-response) + - [Throttling](#throttling) + - [gRPC Service Definition](#grpc-service-definition) + - [Other Transports](#other-transports) +- [Implementation Recommendations](#implementation-recommendations) + - [Multi-Destination Exporting](#multi-destination-exporting) +- [Trade-offs and mitigations](#trade-offs-and-mitigations) + - [Request Acknowledgements](#request-acknowledgements) + - [Duplicate Data](#duplicate-data) + - [Partial Success](#partial-success) +- [Future Versions and Interoperability](#future-versions-and-interoperability) +- [Prior Art, Alternatives and Future Possibilities](#prior-art-alternatives-and-future-possibilities) +- [Open Questions](#open-questions) +- [Appendix A - Protocol Buffer Definitions](#appendix-a---protocol-buffer-definitions) +- [Appendix B - Performance Benchmarks](#appendix-b---performance-benchmarks) + - [Throughput - Sequential vs Concurrent](#throughput---sequential-vs-concurrent) + - [CPU Usage - gRPC vs WebSocket/Experimental](#cpu-usage---grpc-vs-websocketexperimental) + - [Benchmarking Raw Results](#benchmarking-raw-results) +- [Glossary](#glossary) +- [Acknowledgements](#acknowledgements) ## Motivation @@ -319,7 +316,7 @@ Benchmarking of OTLP vs other telemetry protocols was done using [reference impl ### Throughput - Sequential vs Concurrent -Using 20 concurrent requests shows the following throughput advantage in benchmarks compared to sequential for various values of network roundtrip latency: +Using 20 concurrent requests shows the following throughput advantage in benchmarks compared to sequential for various values of network roundtrip latency: ``` +-----------+-----------------------+ diff --git a/oteps/0038-version-semantic-attribute.md b/oteps/0038-version-semantic-attribute.md index 2c72fcf006f..9c7f28797b4 100644 --- a/oteps/0038-version-semantic-attribute.md +++ b/oteps/0038-version-semantic-attribute.md @@ -1,7 +1,5 @@ # Version Semantic Attribute -**Status:** `approved` - Add a standard `version` semantic attribute. ## Motivation diff --git a/oteps/0049-metric-label-set.md b/oteps/0049-metric-label-set.md index bc1d97764ec..b2c7096aa69 100644 --- a/oteps/0049-metric-label-set.md +++ b/oteps/0049-metric-label-set.md @@ -1,7 +1,5 @@ # Metric `LabelSet` specification -**Status:** `proposed` - Introduce a first-class `LabelSet` API type as a handle on a pre-defined set of labels for the Metrics API. ## Motivation @@ -20,7 +18,7 @@ Metric instrument APIs which presently take labels in the form `{ Key: Value, .. var ( cumulative = metric.NewFloat64Cumulative("my_counter") gauge = metric.NewFloat64Gauge("my_gauge") -) +) ``` Use a `LabelSet` to construct multiple Handles: diff --git a/oteps/0059-otlp-trace-data-format.md b/oteps/0059-otlp-trace-data-format.md index bca74d813f4..eb72afb8630 100644 --- a/oteps/0059-otlp-trace-data-format.md +++ b/oteps/0059-otlp-trace-data-format.md @@ -1,8 +1,6 @@ # OTLP Trace Data Format -_Author: Tigran Najaryan, Splunk_ - -**Status:** `approved` +**Author**: Tigran Najaryan, Splunk OTLP Trace Data Format specification describes the structure of the trace data that is transported by OpenTelemetry Protocol (RFC0035). @@ -336,32 +334,34 @@ One of the original aspiring goals for OTLP was to _"support very fast pass-thro The following shows [benchmarking of encoding/decoding in Go](https://github.com/tigrannajaryan/exp-otelproto/) using various schemas. Legend: + - OpenCensus - OpenCensus protocol schema. - OTLP/AttrMap - OTLP schema using map for attributes. - OTLP/AttrList - OTLP schema using list of key/values for attributes and with reduced nesting for values. - OTLP/AttrList/TimeWrapped - Same as OTLP/AttrList, except using google.protobuf.Timestamp instead of int64 for timestamps. Suffixes: + - Attributes - a span with 3 attributes. - TimedEvent - a span with 3 timed events. ``` -BenchmarkEncode/OpenCensus/Attributes-8 10 605614915 ns/op -BenchmarkEncode/OpenCensus/TimedEvent-8 10 1025026687 ns/op -BenchmarkEncode/OTLP/AttrAsMap/Attributes-8 10 519539723 ns/op -BenchmarkEncode/OTLP/AttrAsMap/TimedEvent-8 10 841371163 ns/op -BenchmarkEncode/OTLP/AttrAsList/Attributes-8 50 128790429 ns/op -BenchmarkEncode/OTLP/AttrAsList/TimedEvent-8 50 175874878 ns/op -BenchmarkEncode/OTLP/AttrAsList/TimeWrapped/Attributes-8 50 153184772 ns/op -BenchmarkEncode/OTLP/AttrAsList/TimeWrapped/TimedEvent-8 30 232705272 ns/op -BenchmarkDecode/OpenCensus/Attributes-8 10 644103382 ns/op -BenchmarkDecode/OpenCensus/TimedEvent-8 5 1132059855 ns/op -BenchmarkDecode/OTLP/AttrAsMap/Attributes-8 10 529679038 ns/op -BenchmarkDecode/OTLP/AttrAsMap/TimedEvent-8 10 867364162 ns/op -BenchmarkDecode/OTLP/AttrAsList/Attributes-8 50 228834160 ns/op -BenchmarkDecode/OTLP/AttrAsList/TimedEvent-8 20 321160309 ns/op -BenchmarkDecode/OTLP/AttrAsList/TimeWrapped/Attributes-8 30 277597851 ns/op -BenchmarkDecode/OTLP/AttrAsList/TimeWrapped/TimedEvent-8 20 443386880 ns/op +BenchmarkEncode/OpenCensus/Attributes-8 10 605614915 ns/op +BenchmarkEncode/OpenCensus/TimedEvent-8 10 1025026687 ns/op +BenchmarkEncode/OTLP/AttrAsMap/Attributes-8 10 519539723 ns/op +BenchmarkEncode/OTLP/AttrAsMap/TimedEvent-8 10 841371163 ns/op +BenchmarkEncode/OTLP/AttrAsList/Attributes-8 50 128790429 ns/op +BenchmarkEncode/OTLP/AttrAsList/TimedEvent-8 50 175874878 ns/op +BenchmarkEncode/OTLP/AttrAsList/TimeWrapped/Attributes-8 50 153184772 ns/op +BenchmarkEncode/OTLP/AttrAsList/TimeWrapped/TimedEvent-8 30 232705272 ns/op +BenchmarkDecode/OpenCensus/Attributes-8 10 644103382 ns/op +BenchmarkDecode/OpenCensus/TimedEvent-8 5 1132059855 ns/op +BenchmarkDecode/OTLP/AttrAsMap/Attributes-8 10 529679038 ns/op +BenchmarkDecode/OTLP/AttrAsMap/TimedEvent-8 10 867364162 ns/op +BenchmarkDecode/OTLP/AttrAsList/Attributes-8 50 228834160 ns/op +BenchmarkDecode/OTLP/AttrAsList/TimedEvent-8 20 321160309 ns/op +BenchmarkDecode/OTLP/AttrAsList/TimeWrapped/Attributes-8 30 277597851 ns/op +BenchmarkDecode/OTLP/AttrAsList/TimeWrapped/TimedEvent-8 20 443386880 ns/op ``` The benchmark encodes/decodes 1000 batches of 100 spans, each span containing 3 attributes or 3 timed events. The total uncompressed, encoded size of each batch is around 20KBytes. diff --git a/oteps/0066-separate-context-propagation.md b/oteps/0066-separate-context-propagation.md index f8f8dae0be6..d7cea02f35f 100644 --- a/oteps/0066-separate-context-propagation.md +++ b/oteps/0066-separate-context-propagation.md @@ -1,4 +1,4 @@ -# Context Propagation: A Layered Approach +# Context Propagation: A Layered Approach * [Motivation](#Motivation) * [OpenTelemetry layered architecture](#OpenTelemetry-layered-architecture) @@ -23,272 +23,274 @@ ![drawing](img/0066_context_propagation_overview.png) -A proposal to refactor OpenTelemetry into a set of separate cross-cutting concerns which +A proposal to refactor OpenTelemetry into a set of separate cross-cutting concerns which operate on a shared context propagation mechanism. -# Motivation +## Motivation This RFC addresses the following topics: -**Separation of concerns** +### Separation of concerns + * Cleaner package layout results in an easier to learn system. It is possible to understand Context Propagation without needing to understand Observability. -* Allow for multiple types of context propagation, each self contained with - different rules. For example, TraceContext may be sampled, while +* Allow for multiple types of context propagation, each self contained with + different rules. For example, TraceContext may be sampled, while CorrelationContext never is. -* Allow the Observability and Context Propagation to have different defaults. - The Observability systems ships with a no-op implementation and a pluggable SDK, +* Allow the Observability and Context Propagation to have different defaults. + The Observability systems ships with a no-op implementation and a pluggable SDK, the context propagation system ships with a canonical, working implementation. -**Extensibility** -* A clean separation allows the context propagation mechanisms to be used on - their own, so they may be consumed by other systems which do not want to +### Extensibility + +* A clean separation allows the context propagation mechanisms to be used on + their own, so they may be consumed by other systems which do not want to depend on an observability tool for their non-observability concerns. -* Allow developers to create new applications for context propagation. For +* Allow developers to create new applications for context propagation. For example: A/B testing, authentication, and network switching. +## OpenTelemetry layered architecture -# OpenTelemetry layered architecture +The design of OpenTelemetry is based on the principles of [aspect-oriented +programming](https://en.wikipedia.org/wiki/Aspect-oriented_programming), +adopted to the needs of distributed systems. -The design of OpenTelemetry is based on the principles of [aspect-oriented -programming](https://en.wikipedia.org/wiki/Aspect-oriented_programming), -adopted to the needs of distributed systems. - -Some concerns "cut across" multiple abstractions in a program. Logging -exemplifies aspect orientation because a logging strategy necessarily affects -every logged part of the system. Logging thereby "cross-cuts" across all logged -classes and methods. Distributed tracing takes this strategy to the next level, -and cross-cuts across all classes and methods in all services in the entire -transaction. This requires a distributed form of the same aspect-oriented +Some concerns "cut across" multiple abstractions in a program. Logging +exemplifies aspect orientation because a logging strategy necessarily affects +every logged part of the system. Logging thereby "cross-cuts" across all logged +classes and methods. Distributed tracing takes this strategy to the next level, +and cross-cuts across all classes and methods in all services in the entire +transaction. This requires a distributed form of the same aspect-oriented programming principles in order to be implemented cleanly. -OpenTelemetry approaches this by separating it's design into two layers. The top -layer contains a set of independent **cross-cutting concerns**, which intertwine -with a program's application logic and cannot be cleanly encapsulated. All -concerns share an underlying distributed **context propagation** layer, for -storing state and accessing data across the lifespan of a distributed +OpenTelemetry approaches this by separating it's design into two layers. The top +layer contains a set of independent **cross-cutting concerns**, which intertwine +with a program's application logic and cannot be cleanly encapsulated. All +concerns share an underlying distributed **context propagation** layer, for +storing state and accessing data across the lifespan of a distributed transaction. +## Cross-Cutting Concerns -# Cross-Cutting Concerns +### Observability API -## Observability API -Distributed tracing is one example of a cross-cutting concern. Tracing code is -interleaved with regular code, and ties together independent code modules which -would otherwise remain encapsulated. Tracing is also distributed, and requires +Distributed tracing is one example of a cross-cutting concern. Tracing code is +interleaved with regular code, and ties together independent code modules which +would otherwise remain encapsulated. Tracing is also distributed, and requires transaction-level context propagation in order to execute correctly. -The various observability APIs are not described here directly. However, in this new -design, all observability APIs would be modified to make use of the generalized -context propagation mechanism described below, rather than the tracing-specific +The various observability APIs are not described here directly. However, in this new +design, all observability APIs would be modified to make use of the generalized +context propagation mechanism described below, rather than the tracing-specific propagation system it uses today. -Note that OpenTelemetry APIs calls should *always* be given access to the entire -context object, and never just a subset of the context, such as the value in a -single key. This allows the SDK to make improvements and leverage additional +Note that OpenTelemetry APIs calls should *always* be given access to the entire +context object, and never just a subset of the context, such as the value in a +single key. This allows the SDK to make improvements and leverage additional data that may be available, without changes to all of the call sites. The following are notes on the API, and not meant as final. **`StartSpan(context, options) -> context`** -When a span is started, a new context is returned, with the new span set as the -current span. +When a span is started, a new context is returned, with the new span set as the +current span. **`GetSpanPropagator() -> (HTTP_Extractor, HTTP_Injector)`** -When a span is extracted, the extracted value is stored in the context seprately +When a span is extracted, the extracted value is stored in the context seprately from the current span. +### Correlations API -## Correlations API - -In addition to trace propagation, OpenTelemetry provides a simple mechanism for -propagating indexes, called the Correlations API. Correlations are -intended for indexing observability events in one service with attributes -provided by a prior service in the same transaction. This helps to establish a -causal relationship between these events. For example, determining that a -particular browser version is associated with a failure in an image processing +In addition to trace propagation, OpenTelemetry provides a simple mechanism for +propagating indexes, called the Correlations API. Correlations are +intended for indexing observability events in one service with attributes +provided by a prior service in the same transaction. This helps to establish a +causal relationship between these events. For example, determining that a +particular browser version is associated with a failure in an image processing service. -The Correlations API is based on the [W3C Correlation-Context specification](https://w3c.github.io/correlation-context/), -and implements the protocol as it is defined in that working group. There are -few details provided here as it is outside the scope of this OTEP to finalize +The Correlations API is based on the [W3C Correlation-Context specification](https://w3c.github.io/correlation-context/), +and implements the protocol as it is defined in that working group. There are +few details provided here as it is outside the scope of this OTEP to finalize this API. -While Correlations can be used to prototype other cross-cutting concerns, this -mechanism is primarily intended to convey values for the OpenTelemetry -observability systems. +While Correlations can be used to prototype other cross-cutting concerns, this +mechanism is primarily intended to convey values for the OpenTelemetry +observability systems. -For backwards compatibility, OpenTracing Baggage is propagated as Correlations -when using the OpenTracing bridge. New concerns with different criteria should -be modeled separately, using the same underlying context propagation layer as +For backwards compatibility, OpenTracing Baggage is propagated as Correlations +when using the OpenTracing bridge. New concerns with different criteria should +be modeled separately, using the same underlying context propagation layer as building blocks. The following is an example API, and not meant as final. **`GetCorrelation(context, key) -> value`** -To access the value for a label set by a prior event, the Correlations API -provides a function which takes a context and a key as input, and returns a +To access the value for a label set by a prior event, the Correlations API +provides a function which takes a context and a key as input, and returns a value. **`SetCorrelation(context, key, value) -> context`** -To record the value for a label, the Correlations API provides a function which -takes a context, a key, and a value as input, and returns an updated context +To record the value for a label, the Correlations API provides a function which +takes a context, a key, and a value as input, and returns an updated context which contains the new value. **`RemoveCorrelation(context, key) -> context`** -To delete a label, the Correlations API provides a function -which takes a context and a key as input, and returns an updated context which +To delete a label, the Correlations API provides a function +which takes a context and a key as input, and returns an updated context which no longer contains the selected key-value pair. **`ClearCorrelations(context) -> context`** -To avoid sending any labels to an untrusted process, the Correlation API +To avoid sending any labels to an untrusted process, the Correlation API provides a function to remove all Correlations from a context. **`GetCorrelationPropagator() -> (HTTP_Extractor, HTTP_Injector)`** -To deserialize the previous labels set by prior processes, and to serialize the -current total set of labels and send them to the next process, the Correlations -API provides a function which returns a Correlation-specific implementation of +To deserialize the previous labels set by prior processes, and to serialize the +current total set of labels and send them to the next process, the Correlations +API provides a function which returns a Correlation-specific implementation of the `HTTPExtract` and `HTTPInject` functions found in the Propagation API. -# Context Propagation +## Context Propagation -## Context API +### Context API -Cross-cutting concerns access data in-process using the same, shared context -object. Each concern uses its own namespaced set of keys in the context, +Cross-cutting concerns access data in-process using the same, shared context +object. Each concern uses its own namespaced set of keys in the context, containing all of the data for that cross-cutting concern. The following is an example API, and not meant as final. **`CreateKey(name) -> key`** -To allow concerns to control access to their data, the Context API uses keys -which cannot be guessed by third parties which have not been explicitly handed -the key. It is recommended that concerns mediate data access via an API, rather +To allow concerns to control access to their data, the Context API uses keys +which cannot be guessed by third parties which have not been explicitly handed +the key. It is recommended that concerns mediate data access via an API, rather than provide direct public access to their keys. **`GetValue(context, key) -> value`** -To access the local state of an concern, the Context API provides a function +To access the local state of an concern, the Context API provides a function which takes a context and a key as input, and returns a value. **`SetValue(context, key, value) -> context`** -To record the local state of a cross-cutting concern, the Context API provides a -function which takes a context, a key, and a value as input, and returns a -new context which contains the new value. Note that the new value is not present +To record the local state of a cross-cutting concern, the Context API provides a +function which takes a context, a key, and a value as input, and returns a +new context which contains the new value. Note that the new value is not present in the old context. **`RemoveValue(context, key) -> context`** -RemoveValue returns a new context with the key cleared. Note that the removed +RemoveValue returns a new context with the key cleared. Note that the removed value still remains present in the old context. +#### Optional: Automated Context Management -### Optional: Automated Context Management -When possible, the OpenTelemetry context should automatically be associated -with the program execution context. Note that some languages do not provide any -facility for setting and getting a current context. In these cases, the user is +When possible, the OpenTelemetry context should automatically be associated +with the program execution context. Note that some languages do not provide any +facility for setting and getting a current context. In these cases, the user is responsible for managing the current context. **`GetCurrent() -> context`** -To access the context associated with program execution, the Context API +To access the context associated with program execution, the Context API provides a function which takes no arguments and returns a Context. **`SetCurrent(context)`** -To associate a context with program execution, the Context API provides a +To associate a context with program execution, the Context API provides a function which takes a Context. -## Propagation API +### Propagation API -Cross-cutting concerns send their state to the next process via propagators: -functions which read and write context into RPC requests. Each concern creates a -set of propagators for every type of supported medium - currently only HTTP +Cross-cutting concerns send their state to the next process via propagators: +functions which read and write context into RPC requests. Each concern creates a +set of propagators for every type of supported medium - currently only HTTP requests. The following is an example API, and not meant as final. **`Extract(context, []http_extractor, headers) -> context`** -In order to continue transmitting data injected earlier in the transaction, -the Propagation API provides a function which takes a context, a set of -HTTP_Extractors, and a set of HTTP headers, and returns a new context which +In order to continue transmitting data injected earlier in the transaction, +the Propagation API provides a function which takes a context, a set of +HTTP_Extractors, and a set of HTTP headers, and returns a new context which includes the state sent from the prior process. **`Inject(context, []http_injector, headers) -> headers`** -To send the data for all concerns to the next process in the transaction, the -Propagation API provides a function which takes a context, a set of -HTTP_Injectors, and adds the contents of the context in to HTTP headers to +To send the data for all concerns to the next process in the transaction, the +Propagation API provides a function which takes a context, a set of +HTTP_Injectors, and adds the contents of the context in to HTTP headers to include an HTTP Header representation of the context. **`HTTP_Extractor(context, headers) -> context`** -Each concern must implement an HTTP_Extractor, which can locate the headers -containing the http-formatted data, and then translate the contents into an -in-memory representation, set within the returned context object. +Each concern must implement an HTTP_Extractor, which can locate the headers +containing the http-formatted data, and then translate the contents into an +in-memory representation, set within the returned context object. **`HTTP_Injector(context, headers) -> headers`** -Each concern must implement an HTTP_Injector, which can take the in-memory -representation of its data from the given context object, and add it to an +Each concern must implement an HTTP_Injector, which can take the in-memory +representation of its data from the given context object, and add it to an existing set of HTTP headers. -### Optional: Global Propagators -It may be convenient to create a list of propagators during program -initialization, and then access these propagators later in the program. -To facilitate this, global injectors and extractors are optionally available. +#### Optional: Global Propagators + +It may be convenient to create a list of propagators during program +initialization, and then access these propagators later in the program. +To facilitate this, global injectors and extractors are optionally available. However, there is no requirement to use this feature. **`GetExtractors() -> []http_extractor`** -To access the global extractor, the Propagation API provides a function which +To access the global extractor, the Propagation API provides a function which returns an extractor. **`SetExtractors([]http_extractor)`** -To update the global extractor, the Propagation API provides a function which +To update the global extractor, the Propagation API provides a function which takes an extractor. **`GetInjectors() -> []http_injector`** -To access the global injector, the Propagation API provides a function which +To access the global injector, the Propagation API provides a function which returns an injector. **`SetInjectors([]http_injector)`** -To update the global injector, the Propagation API provides a function which +To update the global injector, the Propagation API provides a function which takes an injector. -# Prototypes +## Prototypes -**Erlang:** https://github.com/open-telemetry/opentelemetry-erlang-api/pull/4 -**Go:** https://github.com/open-telemetry/opentelemetry-go/pull/381 -**Java:** https://github.com/open-telemetry/opentelemetry-java/pull/655 -**Python:** https://github.com/open-telemetry/opentelemetry-python/pull/325 -**Ruby:** https://github.com/open-telemetry/opentelemetry-ruby/pull/147 -**C#/.NET:** https://github.com/open-telemetry/opentelemetry-dotnet/pull/399 +**Erlang:** +**Go:** +**Java:** +**Python:** +**Ruby:** +**C#/.NET:** -# Examples +## Examples -It might be helpful to look at some examples, written in pseudocode. Note that -the pseudocode only uses simple functions and immutable values. Most mutable, -object-orient languages will use objects, such as a Span object, to encapsulate +It might be helpful to look at some examples, written in pseudocode. Note that +the pseudocode only uses simple functions and immutable values. Most mutable, +object-orient languages will use objects, such as a Span object, to encapsulate the context object and hide it from the user in most cases. -Let's describe -a simple scenario, where `service A` responds to an HTTP request from a `client` +Let's describe +a simple scenario, where `service A` responds to an HTTP request from a `client` with the result of a request to `service B`. ``` client -> service A -> service B ``` -Now, let's assume the `client` in the above system is version 1.0. With version -v2.0 of the `client`, `service A` must call `service C` instead of `service B` +Now, let's assume the `client` in the above system is version 1.0. With version +v2.0 of the `client`, `service A` must call `service C` instead of `service B` in order to return the correct data. ``` client -> service A -> service C ``` -In this example, we would like `service A` to decide on which backend service -to call, based on the client version. We would also like to trace the entire -system, in order to understand if requests to `service C` are slower or faster +In this example, we would like `service A` to decide on which backend service +to call, based on the client version. We would also like to trace the entire +system, in order to understand if requests to `service C` are slower or faster than `service B`. What might `service A` look like? -## Global initialization -First, during program initialization, `service A` configures correlation and tracing -propagation, and include them in the global list of injectors and extractors. -Let's assume this tracing system is configured to use B3, and has a specific +### Global initialization + +First, during program initialization, `service A` configures correlation and tracing +propagation, and include them in the global list of injectors and extractors. +Let's assume this tracing system is configured to use B3, and has a specific propagator for that format. Initializing the propagators might look like this: ```php @@ -303,26 +305,27 @@ func InitializeOpentelemetry() { } ``` -## Extracting and injecting from HTTP headers -These propagators can then be used in the request handler for `service A`. The -tracing and correlations concerns use the context object to handle state without +### Extracting and injecting from HTTP headers + +These propagators can then be used in the request handler for `service A`. The +tracing and correlations concerns use the context object to handle state without breaking the encapsulation of the functions they are embedded in. ```php func ServeRequest(context, request, project) -> (context) { - // Extract the context from the HTTP headers. Because the list of - // extractors includes a trace extractor and a correlations extractor, the - // contents for both systems are included in the request headers into the + // Extract the context from the HTTP headers. Because the list of + // extractors includes a trace extractor and a correlations extractor, the + // contents for both systems are included in the request headers into the // returned context. extractors = Propagation::GetExtractors() context = Propagation::Extract(context, extractors, request.Headers) - // Start a span, setting the parent to the span context received from + // Start a span, setting the parent to the span context received from // the client process. The new span will then be in the returned context. context = Tracer::StartSpan(context, [span options]) - // Determine the version of the client, in order to handle the data - // migration and allow new clients access to a data source that older + // Determine the version of the client, in order to handle the data + // migration and allow new clients access to a data source that older // clients are unaware of. version = Correlations::GetCorrelation( context, "client-version") @@ -344,7 +347,7 @@ func ServeRequest(context, request, project) -> (context) { func FetchDataFromServiceB(context) -> (context, data) { request = NewRequest([request options]) - // Inject the contexts to be propagated. Note that there is no direct + // Inject the contexts to be propagated. Note that there is no direct // reference to tracing or correlations. injectors = Propagation::GetInjectors() request.Headers = Propagation::Inject(context, injectors, request.Headers) @@ -356,13 +359,14 @@ func FetchDataFromServiceB(context) -> (context, data) { } ``` -## Simplify the API with automated context propagation -In this version of pseudocode above, we assume that the context object is -explicit, and is pass and returned from every function as an ordinary parameter. -This is cumbersome, and in many languages, a mechanism exists which allows +### Simplify the API with automated context propagation + +In this version of pseudocode above, we assume that the context object is +explicit, and is pass and returned from every function as an ordinary parameter. +This is cumbersome, and in many languages, a mechanism exists which allows context to be propagated automatically. -In this version of pseudocode, assume that the current context can be stored as +In this version of pseudocode, assume that the current context can be stored as a thread local, and is implicitly passed to and returned from every function. ```php @@ -398,18 +402,19 @@ func FetchDataFromServiceB() -> (data) { } ``` -## Implementing a propagator -Digging into the details of the tracing system, what might the internals of a -span context propagator look like? Here is a crude example of extracting and +### Implementing a propagator + +Digging into the details of the tracing system, what might the internals of a +span context propagator look like? Here is a crude example of extracting and injecting B3 headers, using an explicit context. ```php func B3Extractor(context, headers) -> (context) { - context = Context::SetValue( context, - "trace.parentTraceID", + context = Context::SetValue( context, + "trace.parentTraceID", headers["X-B3-TraceId"]) context = Context::SetValue( context, - "trace.parentSpanID", + "trace.parentSpanID", headers["X-B3-SpanId"]) return context } @@ -422,26 +427,27 @@ injecting B3 headers, using an explicit context. } ``` -## Implementing a concern -Now, have a look at a crude example of how StartSpan might make use of the -context. Note that this code must know the internal details about the context -keys in which the propagators above store their data. For this pseudocode, let's +### Implementing a concern + +Now, have a look at a crude example of how StartSpan might make use of the +context. Note that this code must know the internal details about the context +keys in which the propagators above store their data. For this pseudocode, let's assume again that the context is passed implicitly in a thread local. ```php func StartSpan(options) { spanData = newSpanData() - + spanData.parentTraceID = Context::GetValue( "trace.parentTraceID") spanData.parentSpanID = Context::GetValue( "trace.parentSpanID") - + spanData.traceID = newTraceID() spanData.spanID = newSpanID() - + Context::SetValue( "trace.parentTraceID", spanData.traceID) Context::SetValue( "trace.parentSpanID", spanData.spanID) - - // store the spanData object as well, for in-process propagation. Note that + + // store the spanData object as well, for in-process propagation. Note that // this key will not be propagated, it is for local use only. Context::SetValue( "trace.currentSpanData", spanData) @@ -449,21 +455,22 @@ assume again that the context is passed implicitly in a thread local. } ``` -## The scope of current context +### The scope of current context + Let's look at a couple other scenarios related to automatic context propagation. -When are the values in the current context available? Scope management may be -different in each language, but as long as the scope does not change (by -switching threads, for example) the current context follows the execution of -the program. This includes after a function returns. Note that the context -objects themselves are immutable, so explicit handles to prior contexts will not +When are the values in the current context available? Scope management may be +different in each language, but as long as the scope does not change (by +switching threads, for example) the current context follows the execution of +the program. This includes after a function returns. Note that the context +objects themselves are immutable, so explicit handles to prior contexts will not be updated when the current context is changed. ```php func Request() { emptyContext = Context::GetCurrent() - Context::SetValue( "say-something", "foo") + Context::SetValue( "say-something", "foo") secondContext = Context::GetCurrent() print(Context::GetValue("say-something")) // prints "foo" @@ -480,17 +487,18 @@ func Request() { } func DoWork(){ - Context::SetValue( "say-something", "bar") + Context::SetValue( "say-something", "bar") } ``` -## Referencing multiple contexts -If context propagation is automatic, does the user ever need to reference a -context object directly? Sometimes. Even when automated context propagation is -an available option, there is no restriction which says that concerns must only -ever access the current context. +### Referencing multiple contexts -For example, if a concern wanted to merge the data between two contexts, at +If context propagation is automatic, does the user ever need to reference a +context object directly? Sometimes. Even when automated context propagation is +an available option, there is no restriction which says that concerns must only +ever access the current context. + +For example, if a concern wanted to merge the data between two contexts, at least one of them will not be the current context. ```php @@ -498,27 +506,28 @@ mergedContext = MergeCorrelations( Context::GetCurrent(), otherContext) Context::SetCurrent(mergedContext) ``` -## Falling back to explicit contexts -Sometimes, suppling an additional version of a function which uses explicit -contexts is necessary, in order to handle edge cases. For example, in some cases -an extracted context is not intended to be set as current context. An +### Falling back to explicit contexts + +Sometimes, suppling an additional version of a function which uses explicit +contexts is necessary, in order to handle edge cases. For example, in some cases +an extracted context is not intended to be set as current context. An alternate extract method can be added to the API to handle this. ```php // Most of the time, the extract function operates on the current context. Extract(headers) -// When a context needs to be extracted without changing the current +// When a context needs to be extracted without changing the current // context, fall back to the explicit API. otherContext = ExtractWithContext(Context::GetCurrent(), headers) ``` - -# Internal details +## Internal details ![drawing](img/0066_context_propagation_details.png) -## Example Package Layout +### Example Package Layout + ``` Context ContextAPI @@ -539,14 +548,16 @@ otherContext = ExtractWithContext(Context::GetCurrent(), headers) HttpExtractorInterface ``` -## Edge Cases -There are some complications that can arise when managing a span context extracted off the wire and in-process spans for tracer operations that operate on an implicit parent. In order to ensure that a context key references an object of the expected type and that the proper implicit parent is used, the following conventions have been established: +### Edge Cases +There are some complications that can arise when managing a span context extracted off the wire and in-process spans for tracer operations that operate on an implicit parent. In order to ensure that a context key references an object of the expected type and that the proper implicit parent is used, the following conventions have been established: ### Extract -When extracting a remote context, the extracted span context MUST be stored separately from the current span. + +When extracting a remote context, the extracted span context MUST be stored separately from the current span. ### Default Span Parentage + When a new span is created from a context, a default parent for the span can be assigned. The order is of assignment is as follows: * The current span. @@ -554,73 +565,75 @@ When a new span is created from a context, a default parent for the span can be * The root span. ### Inject -When injecting a span to send over the wire, the default order is of + +When injecting a span to send over the wire, the default order is of assignment is as follows: * The current span. * The extracted span. -## Default HTTP headers -OpenTelemetry currently uses two standard header formats for context propagation. +### Default HTTP headers + +OpenTelemetry currently uses two standard header formats for context propagation. Their properties and requirements are integrated into the OpenTelemetry APIs. -**Span Context -** The OpenTelemetry Span API is modeled on the `traceparent` -and `tracestate` headers defined in the [W3C Trace Context specification](https://www.w3.org/TR/trace-context/). +**Span Context -** The OpenTelemetry Span API is modeled on the `traceparent` +and `tracestate` headers defined in the [W3C Trace Context specification](https://www.w3.org/TR/trace-context/). -**Correlation Context -** The OpenTelemetry Correlations API is modeled on the -`Correlation-Context` headers defined in the [W3C Correlation Context specification](https://w3c.github.io/correlation-context/). +**Correlation Context -** The OpenTelemetry Correlations API is modeled on the +`Correlation-Context` headers defined in the [W3C Correlation Context specification](https://w3c.github.io/correlation-context/). -## Context management and in-process propagation +### Context management and in-process propagation -In order for Context to function, it must always remain bound to the execution -of code it represents. By default, this means that the programmer must pass a -Context down the call stack as a function parameter. However, many languages -provide automated context management facilities, such as thread locals. -OpenTelemetry should leverage these facilities when available, in order to +In order for Context to function, it must always remain bound to the execution +of code it represents. By default, this means that the programmer must pass a +Context down the call stack as a function parameter. However, many languages +provide automated context management facilities, such as thread locals. +OpenTelemetry should leverage these facilities when available, in order to provide automatic context management. -## Pre-existing context implementations +### Pre-existing context implementations -In some languages, a single, widely used context implementation exists. In other -languages, there many be too many implementations, or none at all. For example, -Go has a the `context.Context` object, and widespread conventions for how to -pass it down the call stack. Java has MDC, along with several other context -implementations, but none are so widely used that their presence can be +In some languages, a single, widely used context implementation exists. In other +languages, there many be too many implementations, or none at all. For example, +Go has a the `context.Context` object, and widespread conventions for how to +pass it down the call stack. Java has MDC, along with several other context +implementations, but none are so widely used that their presence can be guaranteed or assumed. -In the cases where an extremely clear, pre-existing option is not available, +In the cases where an extremely clear, pre-existing option is not available, OpenTelemetry should provide its own context implementation. +## FAQ -# FAQ - -## What about complex propagation behavior? +### What about complex propagation behavior -Some OpenTelemetry proposals have called for more complex propagation behavior. -For example, falling back to extracting B3 headers if W3C Trace-Context headers -are not found. "Fallback propagators" and other complex behavior can be modeled as -implementation details behind the Propagator interface. Therefore, the -propagation system itself does not need to provide an mechanism for chaining +Some OpenTelemetry proposals have called for more complex propagation behavior. +For example, falling back to extracting B3 headers if W3C Trace-Context headers +are not found. "Fallback propagators" and other complex behavior can be modeled as +implementation details behind the Propagator interface. Therefore, the +propagation system itself does not need to provide an mechanism for chaining together propagators or other additional facilities. -# Prior art and alternatives +## Prior art and alternatives Prior art: + * OpenTelemetry distributed context * OpenCensus propagators * OpenTracing spans * gRPC context -# Risks +## Risks -The Correlations API is related to the [W3C Correlation-Context](https://w3c.github.io/correlation-context/) -specification. Work on this specification has begun, but is not complete. While -unlikely, it is possible that this W3C specification could diverge from the +The Correlations API is related to the [W3C Correlation-Context](https://w3c.github.io/correlation-context/) +specification. Work on this specification has begun, but is not complete. While +unlikely, it is possible that this W3C specification could diverge from the design or guarantees needed by the Correlations API. -# Future possibilities +## Future possibilities -Cleanly splitting OpenTelemetry into Aspects and Context Propagation layer may -allow us to move the Context Propagation layer into its own, stand-alone -project. This may facilitate adoption, by allowing us to share Context +Cleanly splitting OpenTelemetry into Aspects and Context Propagation layer may +allow us to move the Context Propagation layer into its own, stand-alone +project. This may facilitate adoption, by allowing us to share Context Propagation with gRPC and other projects. diff --git a/oteps/0070-metric-bound-instrument.md b/oteps/0070-metric-bound-instrument.md index 0c730356695..357a24f63d5 100644 --- a/oteps/0070-metric-bound-instrument.md +++ b/oteps/0070-metric-bound-instrument.md @@ -1,7 +1,5 @@ # Rename metric instrument Handles to "Bound Instruments" -*Status: proposed 11/26/2019* - The OpenTelemetry metrics API specification refers to a concept known as ["metric handles"](0009-metric-handles.md), which is a metric instrument bound to a `LabelSet`. This OTEP proposes to change that diff --git a/oteps/0072-metric-observer.md b/oteps/0072-metric-observer.md index 643cab4073a..eeecb8537ab 100644 --- a/oteps/0072-metric-observer.md +++ b/oteps/0072-metric-observer.md @@ -44,7 +44,7 @@ purpose. If the simpler alternative suggested above--registering non-instrument-specific callbacks--were implemented instead, callers would demand a way to ask whether an instrument was "recording" or not, similar to the [`Span.IsRecording` -API](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#isrecording). +API](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/api.md#isrecording). Observer instruments are semantically equivalent to gauge instruments, except they support callbacks instead of a `Set()` operation. diff --git a/oteps/0088-metric-instrument-optional-refinements.md b/oteps/0088-metric-instrument-optional-refinements.md index b5f7a4598f2..47672a33eb9 100644 --- a/oteps/0088-metric-instrument-optional-refinements.md +++ b/oteps/0088-metric-instrument-optional-refinements.md @@ -54,11 +54,11 @@ refinements) use callbacks to capture measurements. All measurement APIs produce metric events consisting of [timestamp, instrument descriptor, label set, and numerical -value](api-metrics.md#metric-event-format). Synchronous instrument -events additionally have [Context](api-context.md), describing +value](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/metrics/api.md#metric-event-format). Synchronous instrument +events additionally have [Context](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/context/context.md), describing properties of the associated trace and distributed correlation values. -#### Terminology: Kinds of Aggregation +### Terminology: Kinds of Aggregation _Aggregation_ refers to the technique used to summarize many measurements and/or observations into _some_ kind of summary of the @@ -104,11 +104,11 @@ such as "what is the average last value of a metric at a point in time?". Observer instruments define the Last Value relationship without referring to the collection interval and without ambiguity. -#### Last-value and Measure instruments +### Last-value and Measure instruments Measure instruments do not define a Last Value relationship. One reason is that [synchronous events can happen -simultaneously](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-metrics.md#time). +simultaneously](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/metrics/api.md#time). For Measure instruments, it is possible to compute an aggregation that computes the last-captured value in a collection interval, but it is diff --git a/oteps/0091-logs-vocabulary.md b/oteps/0091-logs-vocabulary.md index 23c636dd331..6131ef1ee99 100644 --- a/oteps/0091-logs-vocabulary.md +++ b/oteps/0091-logs-vocabulary.md @@ -31,8 +31,8 @@ additional qualifiers should be used (e.g. `Log Record`). ### Embedded Log -`Log Records` embedded inside a [Span](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#span) -object, in the [Events](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#add-events) list. +`Log Records` embedded inside a [Span](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/api.md#span) +object, in the [Events](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/api.md#add-events) list. ### Standalone Log diff --git a/oteps/0092-logs-vision.md b/oteps/0092-logs-vision.md index 7941e9dd30a..e7736f1f0ee 100644 --- a/oteps/0092-logs-vision.md +++ b/oteps/0092-logs-vision.md @@ -1,6 +1,6 @@ # OpenTelemetry Logs Vision -The following are high-level items that define our long-term vision for +The following are high-level items that define our long-term vision for Logs support in OpenTelemetry project, what we aspire to achieve. This a vision document that reflects our current desires. It is not a commitment @@ -9,38 +9,39 @@ document is to ensure all contributors work in alignment. As our vision changes over time maintainers reserve the right to add, modify, and remove items from this document. -This document uses vocabulary introduced in https://github.com/open-telemetry/oteps/pull/91. +This document uses vocabulary introduced in . -### First-class Citizen +## First-class Citizen Logs are a first-class citizen in observability, along with traces and metrics. We will aim to have best-in-class support for logs at OpenTelemetry. -### Correlation +## Correlation OpenTelemetry will define how logs will be correlated with traces and metrics and how this correlation information will be stored. Correlation will work across 2 major dimensions: + - To correlate telemetry emitted for the same request (also known as Request or Trace Context Correlation), - To correlate telemetry emitted from the same source (also known as Resource Context Correlation). -### Logs Data Model +## Logs Data Model We will design a Log Data model that will aim to correctly represent all types of logs. The purpose of the data model is to have a common understanding of what a log record is, what data needs to be recorded, transferred, stored and interpreted by a logging system. -Existing log formats can be unambiguously mapped to this data model. Reverse -mapping from this data model is also possible to the extent that the target log +Existing log formats can be unambiguously mapped to this data model. Reverse +mapping from this data model is also possible to the extent that the target log format has equivalent capabilities. We will produce mapping recommendations for commonly used log formats. -### Log Protocol +## Log Protocol Armed with the Log Data model we will aim to design a high performance protocol for logs, which will pursue the same [design goals](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/protocol/design-goals.md) @@ -55,7 +56,7 @@ The reason for this design is to have a single OpenTelemetry protocol that can deliver logs, traces and metrics via one connection and satisfy all design goals. -### Unified Collection +## Unified Collection We aim to have high-performance, unified [Collector](https://github.com/open-telemetry/opentelemetry-collector/) that @@ -67,6 +68,7 @@ The unified Collector will support multiple log protocols including the newly designed OpenTelemetry log protocol. Unified collection is important for the following reasons: + - One agent (or one collector) to deploy and manage. - One place of configuration for target endpoints, authentication tokens, etc. - Uniform tagging of all 3 types of telemetry data (enrichment by attributes @@ -109,7 +111,7 @@ system logs, infrastructure logs, third-party and first-party application logs. ### Standalone and Embedded Logs -OpenTelemetry will support both logs embedded inside [Spans](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#span) +OpenTelemetry will support both logs embedded inside [Spans](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/api.md#span) and standalone logs recorded elsewhere. The support of embedded logs is important for OpenTelemetry's primary use cases, where errors and exceptions need to be embedded in Spans. The support of standalone logs is important for