Add metric advice to capture only a specific set of values for a given attribute, and normalize the rest to a value which represents "other" #3545

trask · 2023-06-08T04:15:47Z

This would be needed if we go with Limit HTTP request method cardinality: use one attribute (option C/E) as the solution for #3470.

The metric advice could take something like the following arguments:

attribute name (string)
attribute values to preserve (array of <value-type>)
attribute value representing "other" (<value-type>)

Open questions:

Should we have a dedicated "other" value?

For string attributes, the "other" value could be something obscure like __OTEL_CAPPED.

A downside to this is that it will not render as nicely as something like GET/POST/OTHER, but it would allow backends to reliably detect when an attribute value was capped (we could use something less obscure like OTHER but that seems like it could occur in instrumentation's natural habitat).

For numeric attributes, I'm not sure what could make sense as a dedicated "other" value across all use cases. One known numeric attribute which could possibly benefit from this feature is http.response.status_code.

It may be ok to have a dedicated "other" value only for string attributes.

How to address the correlation problem

The main argument against Limit HTTP request method cardinality: use one attribute (option C/E) is that you can't correlate http.request.method between metrics and spans for values that were capped (only) on the metric side.

But if you know the values that were preserved, and you know the "other" value, then you could still do this correlation (even if maybe less efficiently):

metrics with the "other" value correlate to spans with any value other than the list of preserved values

Potentially a backend could figure out the list of preserved values (at least those which would have correlated with any spans), by looking at the complete set of http.request.method values that were not capped in the given time period (and with given resource and instrumentation scope name, since the set of preserved values could vary across resources and instrumentation libraries).

But this feels very complex and brittle, so I'm not sure it's a great answer to the "correlation problem".

The text was updated successfully, but these errors were encountered:

reyang · 2023-06-08T05:04:56Z

The main argument against Limit HTTP request method cardinality: use one attribute (option C/E) is that you can't correlate http.request.method between metrics and spans for values that were capped (only) on the metric side.

I actually think the opposite - you can correlate things in an even better way. Here are two examples:

Example 1

Correlation with exemplars - exemplars will give the exact value that we've originally observed, rather than something like "OTHER".

During the time range (T₀, T₁]:
- method = GET, status = 200, duration = 50 (ms)
- method = GET, status = 200, duration = 100 (ms)
- method = Foo, status = 500, duration = 100 (ms)
- method = Bar, status = 500, duration = 100 (ms)

We could get metrics saying:

(T₀, T₁]
- attributes: {method = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
- attributes: {verb = __OTEL_CAPPED, status = 500}, count: 2, min: 100 (ms), max: 100 (ms)
  - exemplars:
    - method = Bar, status = 500, duration = 100 (ms)

Now one could go to their logging system and find all the Bar entries.

Example 2

Correlation with logs and spans - __OTEL_CAPPED is a very specific signal, and the follow up would be:

query the metrics system to get all the method values for the given time range.
take the list of values from 1), remove the special __OTEL_CAPPED entry.
query the logging/tracing system "show me all the logs/spans (within the given time range) that have method value which doesn't belong to the 2) list".

All these steps can be automated and eventually baked into the systems, and the semantic/intention is very clear. I think the key here is that we don't collapse logs/spans to artificially "align/correlate" with metrics - e.g. by having "OTHER" in all logs/traces/metrics. If we have strong reason to collapse the data (e.g. dimension capping for metrics to avoid cardinality explosion) - we should certainly do it, but not in a way that we force logs and traces to "align" by collapsing their data (I think once the data collapsed, the information loss is hard to be restored), instead, we should find better way to correlate them while maintaining low information loss.

trask · 2023-06-08T14:53:32Z

Example 2

I think this is the same as what I tried to explain above ("this feels very complex and brittle").

I think the one complexity left out in your example is that different resources could have different lists of preserved values (not saying this is a great practice), which I think makes that approach a good bit more difficult.

To be clear, I think a valid decision here would be to say that the benefit of simplicity (single attribute) outweighs this use case of generically and easily correlating capped metric attributes with span attributes, given that there are still other options for this kind of correlation:

exemplars
simple-ish non-generic correlation is possible when you have knowledge of the preserved values in your own system
potentially complex generic correlation could still be written

jsuereth · 2023-06-08T16:58:10Z

So a few points:

Exemplars are not guaranteed to cover important o11y scenarios. These are sampled and limited values. They are used to debug a specific instance of a problem. They are very poor at understanding aggregate problems.
@trask mentions this, but having different attribute values selected across processes really hurts global observability. Not only would we be hurting x-signal journeys, we'd be hurting spatial-aggregation of metrics themselves. This can wreak havoc on trying to define stable definitions for alerts. Think about trying to create an alert on a metric that looks for non-standard attributes across various HTTP microservice versions. How would that work?
If we were to go with this route, you will need to also provide this functionality for Logs => Metrics + Span=>Metric features.

The main benefit I see here, is if users really want high cardinality metrics, they could override the "hint" API with a view preserving the raw values. That would need to be a company-wide decision given point #2 above.

I still prefer Option A, but wanted to add my thoughts on this one.

reyang · 2023-06-08T17:06:46Z

To be clear, I think a valid decision here would be to say that the benefit of simplicity (single attribute) outweighs this use case of generically and easily correlating capped metric attributes with span attributes, given that there are still other options for this kind of correlation:

More importantly, this pattern would apply to all semantic conventions, rather than having to deal with specific domains (e.g. Now it is HTTP method vs. "strict" canonical HTTP method (restricted to 9 values) vs. "loose" canonical HTTP method (restricted to 37 values)), later it will be SQL command vs. canonical SQL command. And as HTTP/SQL/etc. evolve, we'll keep inventing new field names and end up with a category of confusing names for anyone who doesn't understand the full history).

lmolkova · 2023-06-08T17:16:00Z

ather than having to deal with specific domains (e.g. Now it is HTTP method vs. "strict" canonical HTTP method (restricted to 9 values) vs. "loose" canonical HTTP method (restricted to 37 values)), later it will be SQL command vs. canonical SQL command

this is irrelevant - we need a default set and configurability with both options. They are only different in configuration mechanism (where Option C/E is superior).

trask added the spec:metrics Related to the specification/metrics directory label Jun 8, 2023

github-actions bot assigned arminru Jun 8, 2023

trask mentioned this issue Jun 8, 2023

Limit HTTP request method cardinality: use one attribute (option C/E) open-telemetry/semantic-conventions#35

Closed

reyang assigned reyang and unassigned arminru Jun 8, 2023

lmolkova mentioned this issue Jun 8, 2023

Add experimental metric attributes advice API #3546

Merged

trask mentioned this issue Jun 13, 2023

Limit HTTP request method cardinality: original_method (option A) open-telemetry/semantic-conventions#17

Merged

This was referenced Jun 15, 2023

Document general attribute normalization conventions open-telemetry/semantic-conventions#117

Open

Advice/hint APIs should be applied to all signals (when applicable) #3557

Open

austinlparker unassigned reyang May 14, 2024

austinlparker added the triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metric advice to capture only a specific set of values for a given attribute, and normalize the rest to a value which represents "other" #3545

Add metric advice to capture only a specific set of values for a given attribute, and normalize the rest to a value which represents "other" #3545

trask commented Jun 8, 2023

reyang commented Jun 8, 2023 •

edited

Loading

trask commented Jun 8, 2023

Example 2

jsuereth commented Jun 8, 2023

reyang commented Jun 8, 2023

lmolkova commented Jun 8, 2023 •

edited

Loading

Add metric advice to capture only a specific set of values for a given attribute, and normalize the rest to a value which represents "other" #3545

Add metric advice to capture only a specific set of values for a given attribute, and normalize the rest to a value which represents "other" #3545

Comments

trask commented Jun 8, 2023

Open questions:

Should we have a dedicated "other" value?

How to address the correlation problem

reyang commented Jun 8, 2023 • edited Loading

Example 1

Example 2

trask commented Jun 8, 2023

Example 2

jsuereth commented Jun 8, 2023

reyang commented Jun 8, 2023

lmolkova commented Jun 8, 2023 • edited Loading

reyang commented Jun 8, 2023 •

edited

Loading

lmolkova commented Jun 8, 2023 •

edited

Loading