Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefix should be consistent for internal metrics #9315

Closed
codeboten opened this issue Jan 18, 2024 · 9 comments · Fixed by #9759
Closed

prefix should be consistent for internal metrics #9315

codeboten opened this issue Jan 18, 2024 · 9 comments · Fixed by #9759

Comments

@codeboten
Copy link
Contributor

Currently, metrics generated by the collector include the prefix otelcol as configured here:

otelprom.WithNamespace("otelcol"),

This is not the case when configuring metrics to be emitted via OTLP, as the prefix configuration is specific to prometheus. I'd like to propose the prefix be applied to the metric names, so that they remain consistent when users switch from one mechanism to another.

@codeboten
Copy link
Contributor Author

Note that there's also an OTEP that has specific metrics that may require a rename later on open-telemetry/oteps#238, but adding a prefix would still be useful in the short term.

@bogdandrutu
Copy link
Member

bogdandrutu commented Jan 18, 2024

My vote is to NOT do this, and to if needed use relevant names for the meter that produces these metrics.

@codeboten
Copy link
Contributor Author

My vote is to NOT do this, and to if needed use relevant names for the meter that produces these metrics.

@bogdandrutu As an end-user, wouldn't this potentially cause a problem? In the sense that this would rely on my backend is to interpret the meter name and apply it as a prefix if I wanted to be able to continue using my dashboards? Or I would end up having to double generate the telemetry? And then cut-over once I'm satisfied w/ the level of historical data about my collector to.

Would an alternative here would be to strip the prefix from the prometheus generated metrics via feature gate move towards using target_info to provide a namespace? I'm not sure if this would work, maybe @dashpole has an idea here?

I guess ultimately, what I'm trying to achieve here is a recommended path moving forward for end-users that would like to transition from prom metrics to OTLP metrics (or maybe even use both together if there's a case for this, i don't know)

@dashpole
Copy link
Contributor

dashpole commented Jan 23, 2024

Would an alternative here would be to strip the prefix from the prometheus generated metrics via feature gate move towards using target_info to provide a namespace? I'm not sure if this would work, maybe @dashpole has an idea here?

The instrumentation scope is the recommended mechanism to use to "namespace" metrics in prometheus going forward. It will appear as a label on prometheus metrics, and you will get an "otel_scope_info" metric if you add scope attributes.

But I would encourage us to try to move away from having an additional prefix only for prometheus metrics. I'm working to allow the same names to be used in otel + prometheus in the near future, at which point we should have the exact same names.

Edit: Note that i'm not arguing for or against a prefix for the metric. I'm just saying we should use the same name for prometheus and other exporters, rather than having special naming for prometheus.

@jmacd
Copy link
Contributor

jmacd commented Jan 23, 2024

I went to look for supporting evidence in the specification for metrics.
The SDK spec says that meter name should be treated as namespaces for detecting duplicates, for example: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#meter
There's the idea that meter name is the name of the instrumentation library, which may vary according to instrumentation setup (in my interpretation): https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#get-a-meter

The way I understand scope name, in any case, is that it is meant to distinguish alternative producers of a set of metric names. The reason I like this proposal, to have a "otelcol" prefix placed into the metric name, not the scope name, is that OTel Collector is the one specifying the interpretation of the metrics. If I want to build my own alternate collector, let's say, with a different mechanism for the obsreport functionality, I should be able to do so and I will emit whatever are the specified metric names using a different scope name. I would not expect the consumer to interpret my alternate implementation of the standard metrics as different metrics.

The reason to support this issue, that we retain an otelcol prefix, is not only pragmatic. The most immediate source of confusion that we get from not using a prefix here, is that OTel SDKs require conventional metric names for all the same things that collectors do. If the collectors try to say that exporter.spans.sent means something specifically, then what will the OTel SDK do for its own conventions about counting spans that were exported successfully? It's the same idea, but the meaning of SDK drops and Collector drops is sufficiently different.

That is why, in OTEP 238, I propose that we use otelcol as a prefix for Collector pipeline metrics and otelsdk as a prefix for the SDK pipeline metrics, where the interpretation of the metric names is consistent but the meaning (an SDK dropping vs a collector dropping) is different.

@bogdandrutu
Copy link
Member

bogdandrutu commented Feb 5, 2024

I completely disagree with the name prefix for the same reason we don't want a "cpu.usage" metric to be named "otecol.cpu.usage" and "otelsdk.cpu.usage" those are wrong names in my opinion and we moving backwards from an ideal goal. I still strongly believe that otecol for our case should be part of the "scope/instrumentation" that signal that we have an otelcol instrumentation library that instruments this. For the same reason if (mylibrary) an http library instruments its own metric we will not name them mylibrary.http.request.count but http.request.count and the instrumentation scope will be mylibrary.

@jmacd
Copy link
Contributor

jmacd commented Feb 6, 2024

@bogdandrutu I would like you to be clear about why you think cpu.usage and http.request.count are the same as the specialized metric definitions used to characterize the behavior of an OpenTelemetry Collector.

Neither cpu.usage nor http.request.count is an existing semantic convention, so they're not great to use for this hypothetical discussion.

We specify utilization instead of CPU usage because the limit is governed as a rate, i.e., it's a function of time -- because the limit is a function of time, not an absolute quantity. Nevertheless, a cpu.utilization is too generic because it can be measured in multiple ways. If we're counting system utilization from the host's perspective, we'll use system.cpu.utilization and if we're counting process utilization from the system's perspective, we'll use process.cpu.utilization, I suppose. The qualifier here is not a namespace, it's a descriptor saying what kind of CPU that you are measuring the utilization of because the generic cpu is not specific enough.

http.request.count is also not specified -- as you've written it, the .count indicates an up-down-counter semantic: https://opentelemetry.io/docs/specs/semconv/general/metrics/#naming-rules-for-counters-and-updowncounters. The semantic convention we do have is http.{client,server}.request.duration, a histogram that defines the count you're referring to. Anyway, if we have a semantic convention to for counting http requests, it has to be a very generic counter.

In the case of metrics designed to monitor the OpenTelemetry collector, I'm not sure a generic name like "exporter_spans_sent" is specific enough, because any piece of code that exports spans could reasonably use that metric name. I'm specifically looking for metric names that an SDK can emit (how many spans did you export?) which I can compare with metrics that a Collector can emit (how many spans did you export?). In my work on open-telemetry/oteps#238, (which is closed pending further work), I came to recommend that SDKs and Collectors not use the same name even if the task is nearly the same ("exporting spans") for different reasons.

Still, the name exporter_spans_sent is very generic, considering all the environments in which a Collector might run. If a vendor incorporates the OTel collector code into an agent of their own, or runs it alongside their own agents, they'll have to ensure a different metric name is used for this generic concept. So then I'll have myvendor_exporter_spans_sent and the OTC will have exporter_spans_sent? I don't think that's reasonable. If you have a very specific semantic convention, the name of the metric instrument should be specific too.

Further, you didn't respond to this point:

The way I understand scope name, in any case, is that it is meant to distinguish alternative producers of a set of metric names.

We can instrument an OTC pipeline by integrating with each of the helper libraries, connecting an OTel SDK, and generating metrics using a conventional approach. We leave the scope_name field in the data model free for use so that alternative approaches can generate the same metric names. I might, for example, imagine a single event log written by the collector describing every component invocation, including additional detail not present in the conventional instrumentation. I will be able to process that log and synthesize the same metrics that conventional instrumentation would produce, but I will need to use a different scope name, since a different instrumentation library produced the metrics.

@andrzej-stencel
Copy link
Member

I'm for the short term solution of adding the otelcol prefix to the metrics. If I understand correctly, the long term solution will be discussed in open-telemetry/oteps#238 / open-telemetry/oteps#249.

@codeboten
Copy link
Contributor Author

As discussed on the SIG call on 6-Mar-2024, the concensus that prefixes were better in the short term to be consistent. Additionally, using _ instead of / will allow both metric outputs to be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants