Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add type and unit metadata labels #39

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions proposals/2024-09-25_metadata-labels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
## Type and Unit Metadata Labels

* **Owners:**
* David Ashpole [@dashpole](https://github.com/dashpole)

* **Implementation Status:** `Not implemented`

* **Related Issues and PRs:**
* https://github.com/open-telemetry/opentelemetry-specification/issues/2497

* **Other docs or links:**
* Survey Results: https://opentelemetry.io/blog/2024/prometheus-compatibility-survey/
* Slack thread: https://cloud-native.slack.com/archives/C01AUBA4PFE/p1726399373207819
* Doc with Options: https://docs.google.com/document/d/1t4ARkyOoI4lLNdKb0ixbUz7k7Mv_eCiq7sRKHAGZ9vg
* Prometheus PoC: https://github.com/prometheus/prometheus/compare/main...dashpole:prometheus:type_and_unit_labels

This document proposes adding the metric type and unit as labels on metrics.

## Why

Per [dev-summit consensus](https://docs.google.com/document/d/1uurQCi5iVufhYHGlBZ8mJMK_freDFKPG0iYBQqJ9fvA/edit#bookmark=id.q6upqm7itl24), we would like to avoid adding type and unit suffixes to metric names when translating from OpenTelemetry to Prometheus. These suffixes are currently required by the [compatibility specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/compatibility/prometheus_and_openmetrics.md). However, OpenTelemetry currently considers the metric type and unit identifying, whereas Prometheus does not. Simply removing suffixes might result in "collisions" between distinct OpenTelemetry metrics which have the same name, but different types (less commonly) or units.
dashpole marked this conversation as resolved.
Show resolved Hide resolved

### Pitfalls of the current solution

Roughly half of OpenTelemetry users surveyed preferred keeping metric names unmodified when translating to Prometheus. With our goal of being the default choice to store OpenTelemetry metrics, we should find a way to preserve metric names without introducing potential issues for users.

## Goals

Goals and use cases for the solution as proposed in [How](#how):

* [Required] OpenTelemetry users can query for the original names of their metrics.
* [Required] Users can filter by the metric type or unit in PromQL queries.
* [Required] PromQL returns a warning when querying across metrics with different units or types, or when using "inappropriate" operations for a metric type.
* [Nice to have] OpenTelemetry users can grep for the original names in the text exposition.

### Audience

This document is for Prometheus server maintainers, PromQL maintainers, and anyone interested in furthering compatibility between Prometheus and OpenTelemetry.

## Non-Goals

* Prometheus will not attempt to auto-convert between units when there is a conflict.
* Prometheus will not attempt to auto-convert between types (e.g. native histogram vs float series).
* Prometheus client libraries will not allow mixing metrics with the same name, but different type or unit in the exposition format. See potential future extensions.

## User Experience

When querying for a metric, users can filter for a type or unit by specifying a filter on the `__unit__` or `__type__` labels, which use the reserved `__` prefix to ensure they do not collide with user-provided labels.

For example:

* `my_metric{}` returns all series with any type or unit, including `__type__` and `__unit__` labels.
dashpole marked this conversation as resolved.
Show resolved Hide resolved
* `my_metric{__unit__="seconds", __type__="counter"}` returns only series with the specified type and unit.
dashpole marked this conversation as resolved.
Show resolved Hide resolved

When a query for a metric returns multiple metrics with a different `__type__` or `__unit__` label, but the same `__name__`, users see a warning in the UI.

Users don't see the `__type__` or `__unit__` labels in the Prometheus UI next to other labels, but the unit displayed next to the value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe the type next to the metric name and labels, but as a colored badge or something (cf. how it's done in the PGW, which utilizes the full exposition data model so it even has the help string, which is of course out of scope here).

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so this is UI, but what we envision in PromQL response? Should it give those labels back in JSON? If yes, in what order?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes back to TSDB API too. I assume TSDB API provide those labels as normal labels (worth to clarify?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the alternative explains it, should we mention that both API and PromQL contains those labels as-is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the TSDB API. Can you link to it?


Users see no difference to exposition formats.
dashpole marked this conversation as resolved.
Show resolved Hide resolved

## How

### PromQL Changes

When a query for a metric returns multiple metrics with a different `__type__` or `__unit__` label, but the same `__name__`, an info annotation will be returned with the PromQL response, which is otherwise unmodified.
dashpole marked this conversation as resolved.
Show resolved Hide resolved

### Prometheus UI Changes

When displaying a metric's labels in the table or in the graph views, the UI will hide labels starting with `__` (double underscore) by default, similar to the current handling of `__name__`. A "Show System Labels" check-box will be added, which shows hidden labels when checked.

### Prometheus Server Ingestion

When receiving OTLP or PRW 2.0, or when scraping the text, OM, or proto formats, the type and unit of the metric are added as the `__type__` and `__unit__` labels, if not already present.
ArthurSens marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exposition formats, I guess it means this would allow:

TYPE metric counter
UNIT metric seconds
HELP ...
metric{__type__="gauge"}

Ideally I do opposite to what's written, so TYPE and UNIT wins here, or this could be blocked even by spec. Essentially I wonder about efficiency of parsers here.

For ingestion protocols, at least it feels metadata should override it (not what is written now) so it's easier for everyone.

Any further concerns around perf of parsing here for this @bboreham?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about what would be on the /federate endpoint if the Prometheus server scraped conflicting metrics from applications. Our choices are:

  1. [current] Only #UNIT metadata, and no __unit__ labels.
  2. Only __unit__ labels, and no #UNIT metadata.
  3. [proposal] Mixed __unit__ labels, and a #UNIT with one of the units the metric has.
  • 1 will result in collisions, but maybe that is acceptable for the /federate endpoint. It isn't a regression from what we have today.
  • 2 works, but would be breaking for existing users that are consuming #UNIT.
  • 3 is confusing because of the mixing, but I wanted to leave that open as an option.

I'm OK with disallowing mixing __unit__ and #UNIT, since we could relax that in the future if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline, and:

  1. would epic to follow 1 and allow multiple the same named metric for different types and units in OM, let's try proposing that, it would help massively with exporters and federate
  2. type and unit from metadata is primary (in case of mix)
  3. Should we allow type and unit at all in exposition, SHOULD NOT maybe?

Copy link
Member

@ArthurSens ArthurSens Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too fond of adding __unit__ and __type__ as labels in the exposition format. I think they should stay as a single line at the top of the metric family like it is today.

The Unit/Type will be the same for all time series in a metric family. If they become labels in the exposition format, it will just be repeated information for all time series. It increases the payload, and we waste CPU time parsing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have "Users see no difference to exposition formats." stated above, but if you put the type and especially the unit in the # UNIT part then you need them in the metric name as well, defeating the purpose. You're other option is to put them into the series on each line.

Looking at /metrics is for humans and follows what you see is what you get in terms of showing what you can query - implying that if you can query by __type__ or the API/UI returns the __type__ then you should put them on each line.

And the implication of that is that there needs to be at least one more efficient exposition format that is not so verbose for machines to read.

So I'm not sure this would work without either putting the new labels on every line or modifying OpenMetrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course the above argument doesn't stand if we follow @beorn7 's advice and not show __type__, __unit__ on the query API (/query , /query_range) , although I would love to have it on the metadata queries endpoint like (/series) for Grafana to be able to know what's a float series and what's a native histogram.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should show __type__ and __unit__ in the query APIs. I'm just saying, in the zeroth step, we should remove those labels in PromQL operations (like the name is removed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with showing __type__ and __unit__ in query APIs, just like we show __name__.

We have "Users see no difference to exposition formats." stated above, but if you put the type and especially the unit in the # UNIT part then you need them in the metric name as well, defeating the purpose. You're other option is to put them into the series on each line.

That's a really good catch! This is something we'll probably need to tackle with OM 2.0. Is this what you had in mind with this issue @bwplotka prometheus/OpenMetrics#280?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "Users see no difference to exposition formats", I meant that this proposal does not change how metrics are currently exposed in exposition formats. I'll leave that for follow-up discussions in prometheus/OpenMetrics#280 and elsewhere. I updated the language in 0a545c8.

I am open to ways of querying for type and unit that align better with the existing #TYPE and #UNIT comments, if we want to preserve the "what you see is what you get" experience of Prometheus. We do support querying for __name__, so there is at least some precedence.


PRW 1.0 is omitted because metadata is sent separately from timeseries, making it infeasible to add the labels at ingestion time.

### Implementation Plan

#### Milestone 1: Feature flag for adding labels

Add a feature flag: `--enable-feature=metadata-labels`. When enabled `__type__` and `__unit__` labels are added when receiving OTLP or PRW 2.0, or when scraping the text, OM, or proto formats.
dashpole marked this conversation as resolved.
Show resolved Hide resolved

#### Milestone 2: UI and PromQL changes

During this stage, implement the UI and PromQL changes above. Iterate based on feedback. Changes should have no effect unless the metadata-labels feature flag is enabled.

#### Milestone 3: Add NoNameChanges option for OTLP translation

Add an option, `NoNameChanges` for the OTLP translation strategy. When enabled, it disables UTF-8 sanitization and the addition of suffixes. In the documentation for this option, recommend that `--enable-feature=metadata-labels` is enabled.

## Alternatives

### “Real” Type and Unit suffixes in **name**
dashpole marked this conversation as resolved.
Show resolved Hide resolved
dashpole marked this conversation as resolved.
Show resolved Hide resolved

Similar to suffixes of _<unit> and _<type/total> but make it an explicit suffix using a delimiter not currently permitted in metric names. Specifying suffixes is optional when querying for a metric. When the type or unit suffix is omitted from a query, it would (design TBD) return results which include any type or unit suffix which exists for that name.

NOTE: Dot for units and slash for type is just one example, there might be better operators/characters to use.
dashpole marked this conversation as resolved.
Show resolved Hide resolved

Writing queries that include the type and unit would be recommended as a best-practice by the community.

For example:

* Querying for foo/histogram would return results that include both foo.seconds/histogram and foo.milliseconds/histogram.
* Querying for foo.seconds would return results that include both foo.seconds/histogram and foo.seconds/counter.
* Querying for http_server_duration would return results that include both foo.seconds/histogram and foo.milliseconds/counter.
* Querying for an OpenTelemetry metric, such as http.server.duration, with suffixes would require querying for ”http.server.duration”.seconds/histogram. Note that suffixes are outside of quotes.

This solution is not chosen because:

* Requires PromQL changes (intrusive), touches on “dot” operator ideas.
dashpole marked this conversation as resolved.
Show resolved Hide resolved
* Adding suffixes outside of quotes looks strange: {“http.server.duration”.seconds/histogram}
dashpole marked this conversation as resolved.
Show resolved Hide resolved
* Rolling this out would be breaking for existing Prometheus users: E.g. {foo_seconds} becomes {foo.seconds/histogram}. Could this be part of OM 2.0?
dashpole marked this conversation as resolved.
Show resolved Hide resolved
* Mitigation: users just stay with {foo_seconds.seconds/histogram}
* Users might be surprised by, or dislike the additional suffixes and delimiters in the metric name results
* Mitigation: Opt-in for query engines?

### "Hide" __type__ and __unit__ labels in PromQL, instead of UI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how safe this is. If PromQL does not hide the return, new labels will suddenly appear in users' queries.

I'm also not sure about the potential impact other than dashboards/alerts having new labels, could that be a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely open to this alternative. I like having the option of showing __type__ and __unit__ in the UI, but if it has to wait for Prom 4.0 because it is breaking, that seems like a significant downside.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it could be also merge of both, where PromQL only returns those labels based on something e.g. on conflict or when type or unit is selectors.

The downside is that we now have to really deeply design groupings and joins.. should they use those labels or not. With the current proposal the answer is trivial and straightforward. This helps with creating alerts and recording rules etc

I think I would be keen to try the current approach e.g. behind feature flag and see who will use it and how to tell more. cc @juliusv @roidelapluie WDYT from UX point of view for non-UI things?


Existing UIs don't handle the `__type__` and `__unit__` labels. To mitigate this, PromQL could omit the `__type__` and `__unit__` labels from the query response. Doing this would avoid requiring UIs to update to handle the new labels.

This solution is not chosen because:

* It deviates from the current handling of the `__name__` label.
* We expect metadata, like type and unit to be useful to display in the UI, and want to enable these use-cases.
* It should be a small amount of effort to hide these labels.

## Potential Future Extensions

### __type__ and __unit__ from client libraries

One obvious extension of this proposal would be for Prometheus clients to start sending `__type__` and `__unit__` labels with the exposition format. This would:

* Allow mixing metrics with the same name, but different types and units in the same endpoint (an explicit non-goal of this proposal).

This is excluded from this proposal because:

* OpenTelemetry users can use [Views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view) to resolve collisions. That should be "good enough".
* Doing this would require some changes to client libraries, which is significantly more work, and is harder to experiment with.
* There can be conflicts between `# TYPE` and `# UNIT` metadata for a metric, and `__type__` and `__unit__` labels. This adds complexity to understanding the exposition format, and requires establishing rules for dealing with conflicts.
ArthurSens marked this conversation as resolved.
Show resolved Hide resolved

### More metadata labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I like the idea to model OTel's notion of "identifying metadata" as labels in Prometheus (to be distinguished from "non-identifying metadata", which is closer to Prometheus's original metadata idea).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately OTel resource attributes are all considered identifying today (correct me if I'm wrong David).

There is work in progress to replace resource attributes with a new thing called Entity. Entity will be able to distinguish, using otel lingo, Identifying from Descriptive attributes, but this will take some years to get popular in the industry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTel resource attributes are all considered identifying today, but we don't treat them all as identifying when translating from OTel -> Prometheus. Instead, we just treat service.name + service.instance.id + service.namespace as identifying. That is correct if the OTLP originates from an OTel-instrumented application, but not if it wasn't (e.g. comes from an otel collector receiver).

Anyways, what I had in mind was more things that users don't think of as "labels", like schema url, or maybe scope name + version. I think resource attributes make sense as "real" labels, rather than metadata.


OpenTelemetry has lots of other metadata that may be a good fit for this "metadata label" pattern. For example, OpenTelemetry's scope name and version, or the schema URL are technically identifying for a time series, but are unlikely to be something that we want to display prominently in the UI.

If the pattern of adding `__type__` and `__unit__` works well for this metadata, we could consider making the pattern more generic:

* UIs should hide _all_ labels starting with `__` by default, not just `__name__`, `__type__`, and `__unit__`.
* Introduce a mechanism to allow OTel libraries to provide additional metadata labels. However, this has the potential to introduce collisions, since `__` has been reserved thus far. Maybe a specific allowlist (e.g. `__otel`-prefixed labels) could work.

## Action Plan

* [ ] Feature flag for adding labels
* [ ] UI and PromQL changes
* [ ] Add NoNameChanges option for OTLP translation
Loading