Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Pipeline Component Telemetry #11406

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

djaglowski
Copy link
Member

@djaglowski djaglowski commented Oct 9, 2024

This PR adds a RFC for normalized telemetry across all pipeline components. See #11343

edit by @mx-psi:

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.56%. Comparing base (68f0264) to head (7925012).
Report is 100 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11406      +/-   ##
==========================================
- Coverage   92.15%   91.56%   -0.60%     
==========================================
  Files         432      441       +9     
  Lines       20291    23856    +3565     
==========================================
+ Hits        18700    21844    +3144     
- Misses       1228     1640     +412     
- Partials      363      372       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@djaglowski djaglowski marked this pull request as ready for review October 10, 2024 13:36
@djaglowski djaglowski requested a review from a team as a code owner October 10, 2024 13:36
@djaglowski djaglowski added Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests labels Oct 10, 2024
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this as a RFC @djaglowski!

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
@djaglowski djaglowski changed the title RFC - Auto-instrumentation of pipeline components RFC - Pipeline Component Telemetry Oct 16, 2024
@djaglowski
Copy link
Member Author

Based on some offline feedback, I've broadened the scope of the RFC, while simultaneously clarifying that it is intended to evolve as we identify additional standards.

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions, I really like this proposal overall :)

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
Co-authored-by: Damien Mathieu <[email protected]>
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
- `otel.output.signal`: `logs`, `metrics` `traces`, `profiles`

Note: The `otel.signal`, `otel.output.signal`, or `otel.pipeline.id` attributes may be omitted if the corresponding component instances
are unified by the component implementation. For example, the `otlp` receiver is a singleton, so its telemetry is not specific to a signal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs normative language. Setting the otel.signal for the OTLP receiver for its bootstrap operations are certainly misleading to people operating the collector who are unaware of the inner working of this specific component (ie, that it's a singleton).

So:

attributes MUST be omitted if the corresponding component instances

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky topic and I'm not sure we can be so strict. It certainly makes sense to me that e.g. logs generated while initializing the singleton should not be attributed to one signal or pipeline. However, we can still attribute metrics to a particular signal (e.g. if otlp receiver emits 10 logs and 20 metrics, do you want a count of "30 items" or "10 logs" and "20 metrics". Maybe this is a good argument for splitting the proposed metrics by signal type, e.g. produced_metrics, produced_logs, etc. This would allow those metrics to share the same set of attributes with other signals produced by the instance.


All signals should use the following attributes:

### Receivers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this states "pipeline component telemetry", and that extensions aren't technically part of a pipeline, but it feels wrong to leave them out: otel.component.kind and .id could definitely apply to them as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of this effort has been increased a lot already. Can we leave extensions for another proposal? Personally I don't feel I have enough expertise with extensions to author such details.


1. A count of "items" (spans, data points, or log records). These are low cost but broadly useful, so they should be enabled by default.
2. A measure of size, based on [ProtoMarshaler.Sizer()](https://github.com/open-telemetry/opentelemetry-collector/blob/9907ba50df0d5853c34d2962cf21da42e15a560d/pdata/ptrace/pb.go#L11).
These are high cost to compute, so by default they should be disabled (and not calculated).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How costly? I remember talking to someone about this in the past, and they mentioned that it's not that expensive, given that it just delegates to what already exists in protobuf:

It would be nice to have benchmarks to have data backing this (or the other) claim. I would definitely see as very useful to have a histogram of item/batch sizes and having it as optional means that people might only find out about it when they'd benefit from having historical data in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These are high cost to compute, so by default they should be disabled (and not calculated).
These may be high cost to compute, so by default they should be disabled (and not calculated). This default setting may change in the future if it is demonstrated that the cost is generally acceptable.

How's this wording?


Metrics provide most of the observability we need but there are some gaps which logs can fill. Although metrics would describe the overall
item counts, it is helpful in some cases to record more granular events. e.g. If a produced batch of 10,000 spans results in an error, but
100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearly a tracing case for me :-) The rule of thumb to me is: is the information related to a particular transaction? Then it should go into a span.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense if we agree that we should capture a span for the consume function.


### Auto-Instrumented Spans

It is not clear that any spans can be captured automatically with the proposed mechanism. We have the ability to insert instrumentation both
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is passed down, isn't it? We can definitely instrument the ingress part of the component, and ask components to add span links if they are messing with the context. This way, the trace for a pipeline with a batch processor would end at the batching processor, but a new trace with a span link would be created pointing to the originating batch request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so instead of closing the span when data comes out, we close it when the consume func returns?

I think the duration of the span would be meaningful only for some synchronous processors, and could be meaningful for syncronous connectors (e.g. if they create and link spans to represent the work associated with the incoming data). But what about asynchronous components? Do we accept that the span is just measuring a quick handoff to the internal state of the component? Is this going to be misleading to users?

@jpkrohling
Copy link
Member

Some of my comments might have been discussed before, in which case, feel free to ignore me and just mark the items as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.