Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Pipeline Component Telemetry #11406

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
212 changes: 212 additions & 0 deletions docs/rfcs/component-universal-telemetry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# Pipeline Component Telemetry

## Motivation and Scope

The collector should be observable and this must naturally include observability of its pipeline components. Pipeline components
are those components of the collector which directly interact with data, specifically receivers, processors, exporters, and connectors.

It is understood that each _type_ (`filelog`, `batch`, etc) of component may emit telemetry describing its internal workings,
and that these internally derived signals may vary greatly based on the concerns and maturity of each component. Naturally
though, there is much we can do to normalize the telemetry emitted from and about pipeline components.

Two major challenges in pursuit of broadly normalized telemetry are (1) consistent attributes, and (2) automatic capture.

This RFC represents an evolving consensus about the desired end state of component telemetry. It does _not_ claim
to describe the final state of all component telemetry, but rather seeks to document some specific aspects. It proposes a set of
attributes which are both necessary and sufficient to identify components and their instances. It also articulates one specific
mechanism by which some telemetry can be automatically captured. Finally, it describes some specific metrics and logs which should
be automatically captured for each kind of pipeline component.

## Goals

1. Define attributes that are (A) specific enough to describe individual component[_instances_](https://github.com/open-telemetry/opentelemetry-collector/issues/10534)
and (B) consistent enough for correlation across signals.
2. Articulate a mechanism which enables us to _automatically_ capture telemetry from _all pipeline components_.
3. Define specific metrics for each kind of pipeline component.
4. Define specific logs for all kinds of pipeline component.

## Attributes

All signals should use the following attributes:

### Receivers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this states "pipeline component telemetry", and that extensions aren't technically part of a pipeline, but it feels wrong to leave them out: otel.component.kind and .id could definitely apply to them as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of this effort has been increased a lot already. Can we leave extensions for another proposal? Personally I don't feel I have enough expertise with extensions to author such details.


- `otel.component.kind`: `receiver`
djaglowski marked this conversation as resolved.
Show resolved Hide resolved
- `otel.component.id`: The component ID
- `otel.signal`: `logs`, `metrics`, `traces`, `profiles`

### Processors

- `otel.component.kind`: `processor`
- `otel.component.id`: The component ID
- `otel.pipeline.id`: The pipeline ID
- `otel.signal`: `logs`, `metrics`, `traces`, `profiles`

### Exporters

- `otel.component.kind`: `exporter`
- `otel.component.id`: The component ID
- `otel.signal`: `logs`, `metrics` `traces`, `profiles`

### Connectors

- `otel.component.kind`: `connector`
- `otel.component.id`: The component ID
- `otel.signal`: `logs`, `metrics` `traces`
- `otel.signal.output`: `logs`, `metrics` `traces`, `profiles`

Note: The `otel.signal`, `otel.signal.output`, or `otel.pipeline.id` attributes may be omitted if the corresponding component instances
are unified by the component implementation. For example, the `otlp` receiver is a singleton, so its telemetry is not specific to a signal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs normative language. Setting the otel.signal for the OTLP receiver for its bootstrap operations are certainly misleading to people operating the collector who are unaware of the inner working of this specific component (ie, that it's a singleton).

So:

attributes MUST be omitted if the corresponding component instances

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky topic and I'm not sure we can be so strict. It certainly makes sense to me that e.g. logs generated while initializing the singleton should not be attributed to one signal or pipeline. However, we can still attribute metrics to a particular signal (e.g. if otlp receiver emits 10 logs and 20 metrics, do you want a count of "30 items" or "10 logs" and "20 metrics". Maybe this is a good argument for splitting the proposed metrics by signal type, e.g. produced_metrics, produced_logs, etc. This would allow those metrics to share the same set of attributes with other signals produced by the instance.

Similarly, the `memory_limiter` processor is a singleton, so its telemetry is not specific to a pipeline.

## Auto-Instrumentation Mechanism

The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a
component passes data to another component, and, at each point where a component consumes data from another component. In terms of the
component graph, every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the
consuming component. Importantly, each layer generates telemetry ascribed to a single component instance, so by having two layers per
edge we can describe both sides of each handoff independently.
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

Telemetry captured by this mechanism should be associated with an instrumentation scope corresponding to the package which implements
the mechanism. Currently, that package is `service/internal/graph`, but this may change in the future. Notably, this telemetry is not
ascribed to individual component packages, both because the instrumentation scope is intended to describe the origin of the telemetry,
and because no mechanism is presently identified which would allow us to determine the characteristics of a component-specific scope.

### Auto-Instrumented Metrics

There are two straightforward measurements that can be made on any pdata:
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

1. A count of "items" (spans, data points, or log records). These are low cost but broadly useful, so they should be enabled by default.
2. A measure of size, based on [ProtoMarshaler.Sizer()](https://github.com/open-telemetry/opentelemetry-collector/blob/9907ba50df0d5853c34d2962cf21da42e15a560d/pdata/ptrace/pb.go#L11).
These are high cost to compute, so by default they should be disabled (and not calculated).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How costly? I remember talking to someone about this in the past, and they mentioned that it's not that expensive, given that it just delegates to what already exists in protobuf:

It would be nice to have benchmarks to have data backing this (or the other) claim. I would definitely see as very useful to have a histogram of item/batch sizes and having it as optional means that people might only find out about it when they'd benefit from having historical data in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These are high cost to compute, so by default they should be disabled (and not calculated).
These may be high cost to compute, so by default they should be disabled (and not calculated). This default setting may change in the future if it is demonstrated that the cost is generally acceptable.

How's this wording?


The location of these measurements can be described in terms of whether the data is "consumed" or "produced", from the perspective of the
component to which the telemetry is attributed. Metrics which contain the term "produced" describe data which is emitted from the component,
while metrics which contain the term "consumed" describe data which is received by the component.

For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to
whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with `outcome` as
`failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, produced
measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and
`success` otherwise.

```yaml
otelcol.receiver.produced.items:
enabled: true
description: Number of items emitted from the receiver.
unit: "{items}"
sum:
value_type: int
monotonic: true
otelcol.processor.consumed.items:
enabled: true
description: Number of items passed to the processor.
unit: "{items}"
sum:
value_type: int
monotonic: true
otelcol.processor.produced.items:
enabled: true
description: Number of items emitted from the processor.
unit: "{items}"
sum:
value_type: int
monotonic: true
otelcol.connector.consumed.items:
enabled: true
description: Number of items passed to the connector.
unit: "{items}"
sum:
value_type: int
monotonic: true
otelcol.connector.produced.items:
enabled: true
description: Number of items emitted from the connector.
unit: "{items}"
sum:
value_type: int
monotonic: true
otelcol.exporter.consumed.items:
enabled: true
description: Number of items passed to the exporter.
unit: "{items}"
sum:
value_type: int
monotonic: true

otelcol.receiver.produced.size:
enabled: false
description: Size of items emitted from the receiver.
unit: "By"
sum:
value_type: int
monotonic: true
otelcol.processor.consumed.size:
enabled: false
description: Size of items passed to the processor.
unit: "By"
sum:
value_type: int
monotonic: true
otelcol.processor.produced.size:
enabled: false
description: Size of items emitted from the processor.
unit: "By"
sum:
value_type: int
monotonic: true
otelcol.connector.consumed.size:
enabled: false
description: Size of items passed to the connector.
unit: "By"
sum:
value_type: int
monotonic: true
otelcol.connector.produced.size:
enabled: false
description: Size of items emitted from the connector.
unit: "By"
sum:
value_type: int
monotonic: true
otelcol.exporter.consumed.size:
enabled: false
description: Size of items passed to the exporter.
unit: "By"
sum:
value_type: int
monotonic: true
```
### Auto-Instrumented Logs
Metrics provide most of the observability we need but there are some gaps which logs can fill. Although metrics would describe the overall
item counts, it is helpful in some cases to record more granular events. e.g. If a produced batch of 10,000 spans results in an error, but
100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearly a tracing case for me :-) The rule of thumb to me is: is the information related to a particular transaction? Then it should go into a span.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense if we agree that we should capture a span for the consume function.

reports only that a 50% success rate is observed.
For security and performance reasons, it would not be appropriate to log the contents of telemetry.
It's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, only the errors that are not
handled automatically will be of interest to most users.
With the above considerations, this proposal includes only that we add a DEBUG log for each individual outcome. This should be sufficient for
detailed troubleshooting but does not impact users otherwise.
In the future, it may be helpful to define triggers for reporting repeated failures at a higher severity level. e.g. N number of failures in
a row, or a moving average success %. For now, the criteria and necessary configurability is unclear so this is mentioned only as an example
of future possibilities.
### Auto-Instrumented Spans
It is not clear that any spans can be captured automatically with the proposed mechanism. We have the ability to insert instrumentation both
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is passed down, isn't it? We can definitely instrument the ingress part of the component, and ask components to add span links if they are messing with the context. This way, the trace for a pipeline with a batch processor would end at the batching processor, but a new trace with a span link would be created pointing to the originating batch request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so instead of closing the span when data comes out, we close it when the consume func returns?

I think the duration of the span would be meaningful only for some synchronous processors, and could be meaningful for syncronous connectors (e.g. if they create and link spans to represent the work associated with the incoming data). But what about asynchronous components? Do we accept that the span is just measuring a quick handoff to the internal state of the component? Is this going to be misleading to users?

before and after processors and connectors. However, we generally cannot assume a 1:1 relationship between consumed and produced data.
## Additional Context
This proposal pulls from a number of issues and PRs:
- [Demonstrate graph-based metrics](https://github.com/open-telemetry/opentelemetry-collector/pull/11311)
- [Attributes for component instancing](https://github.com/open-telemetry/opentelemetry-collector/issues/11179)
- [Simple processor metrics](https://github.com/open-telemetry/opentelemetry-collector/issues/10708)
- [Component instancing is complicated](https://github.com/open-telemetry/opentelemetry-collector/issues/10534)
Loading