Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric advice to capture only a specific set of values for a given attribute, and normalize the rest to a value which represents "other" #3545

Open
trask opened this issue Jun 8, 2023 · 5 comments
Labels
spec:metrics Related to the specification/metrics directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted

Comments

@trask
Copy link
Member

trask commented Jun 8, 2023

This would be needed if we go with Limit HTTP request method cardinality: use one attribute (option C/E) as the solution for #3470.

The metric advice could take something like the following arguments:

  • attribute name (string)
  • attribute values to preserve (array of <value-type>)
  • attribute value representing "other" (<value-type>)

Open questions:

Should we have a dedicated "other" value?

For string attributes, the "other" value could be something obscure like __OTEL_CAPPED.

A downside to this is that it will not render as nicely as something like GET/POST/OTHER, but it would allow backends to reliably detect when an attribute value was capped (we could use something less obscure like OTHER but that seems like it could occur in instrumentation's natural habitat).

For numeric attributes, I'm not sure what could make sense as a dedicated "other" value across all use cases. One known numeric attribute which could possibly benefit from this feature is http.response.status_code.

It may be ok to have a dedicated "other" value only for string attributes.

How to address the correlation problem

The main argument against Limit HTTP request method cardinality: use one attribute (option C/E) is that you can't correlate http.request.method between metrics and spans for values that were capped (only) on the metric side.

But if you know the values that were preserved, and you know the "other" value, then you could still do this correlation (even if maybe less efficiently):

metrics with the "other" value correlate to spans with any value other than the list of preserved values

Potentially a backend could figure out the list of preserved values (at least those which would have correlated with any spans), by looking at the complete set of http.request.method values that were not capped in the given time period (and with given resource and instrumentation scope name, since the set of preserved values could vary across resources and instrumentation libraries).

But this feels very complex and brittle, so I'm not sure it's a great answer to the "correlation problem".

@reyang
Copy link
Member

reyang commented Jun 8, 2023

The main argument against Limit HTTP request method cardinality: use one attribute (option C/E) is that you can't correlate http.request.method between metrics and spans for values that were capped (only) on the metric side.

I actually think the opposite - you can correlate things in an even better way. Here are two examples:

Example 1

Correlation with exemplars - exemplars will give the exact value that we've originally observed, rather than something like "OTHER".

  • During the time range (T0, T1]:
    • method = GET, status = 200, duration = 50 (ms)
    • method = GET, status = 200, duration = 100 (ms)
    • method = Foo, status = 500, duration = 100 (ms)
    • method = Bar, status = 500, duration = 100 (ms)

We could get metrics saying:

  • (T0, T1]
    • attributes: {method = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = __OTEL_CAPPED, status = 500}, count: 2, min: 100 (ms), max: 100 (ms)
      • exemplars:
        • method = Bar, status = 500, duration = 100 (ms)

Now one could go to their logging system and find all the Bar entries.

Example 2

Correlation with logs and spans - __OTEL_CAPPED is a very specific signal, and the follow up would be:

  1. query the metrics system to get all the method values for the given time range.
  2. take the list of values from 1), remove the special __OTEL_CAPPED entry.
  3. query the logging/tracing system "show me all the logs/spans (within the given time range) that have method value which doesn't belong to the 2) list".

All these steps can be automated and eventually baked into the systems, and the semantic/intention is very clear. I think the key here is that we don't collapse logs/spans to artificially "align/correlate" with metrics - e.g. by having "OTHER" in all logs/traces/metrics. If we have strong reason to collapse the data (e.g. dimension capping for metrics to avoid cardinality explosion) - we should certainly do it, but not in a way that we force logs and traces to "align" by collapsing their data (I think once the data collapsed, the information loss is hard to be restored), instead, we should find better way to correlate them while maintaining low information loss.

@reyang reyang assigned reyang and unassigned arminru Jun 8, 2023
@trask trask changed the title Add metric advice to capture only a specific set of values for a given attribute, and bucket the rest under a value which represents "other" Add metric advice to capture only a specific set of values for a given attribute, and normalize the rest to a value which represents "other" Jun 8, 2023
@trask
Copy link
Member Author

trask commented Jun 8, 2023

Example 2

I think this is the same as what I tried to explain above ("this feels very complex and brittle").

I think the one complexity left out in your example is that different resources could have different lists of preserved values (not saying this is a great practice), which I think makes that approach a good bit more difficult.

To be clear, I think a valid decision here would be to say that the benefit of simplicity (single attribute) outweighs this use case of generically and easily correlating capped metric attributes with span attributes, given that there are still other options for this kind of correlation:

  • exemplars
  • simple-ish non-generic correlation is possible when you have knowledge of the preserved values in your own system
  • potentially complex generic correlation could still be written

@jsuereth
Copy link
Contributor

jsuereth commented Jun 8, 2023

So a few points:

  1. Exemplars are not guaranteed to cover important o11y scenarios. These are sampled and limited values. They are used to debug a specific instance of a problem. They are very poor at understanding aggregate problems.
  2. @trask mentions this, but having different attribute values selected across processes really hurts global observability. Not only would we be hurting x-signal journeys, we'd be hurting spatial-aggregation of metrics themselves. This can wreak havoc on trying to define stable definitions for alerts. Think about trying to create an alert on a metric that looks for non-standard attributes across various HTTP microservice versions. How would that work?
  3. If we were to go with this route, you will need to also provide this functionality for Logs => Metrics + Span=>Metric features.

The main benefit I see here, is if users really want high cardinality metrics, they could override the "hint" API with a view preserving the raw values. That would need to be a company-wide decision given point #2 above.

I still prefer Option A, but wanted to add my thoughts on this one.

@reyang
Copy link
Member

reyang commented Jun 8, 2023

To be clear, I think a valid decision here would be to say that the benefit of simplicity (single attribute) outweighs this use case of generically and easily correlating capped metric attributes with span attributes, given that there are still other options for this kind of correlation:

More importantly, this pattern would apply to all semantic conventions, rather than having to deal with specific domains (e.g. Now it is HTTP method vs. "strict" canonical HTTP method (restricted to 9 values) vs. "loose" canonical HTTP method (restricted to 37 values)), later it will be SQL command vs. canonical SQL command. And as HTTP/SQL/etc. evolve, we'll keep inventing new field names and end up with a category of confusing names for anyone who doesn't understand the full history).

@lmolkova
Copy link
Contributor

lmolkova commented Jun 8, 2023

ather than having to deal with specific domains (e.g. Now it is HTTP method vs. "strict" canonical HTTP method (restricted to 9 values) vs. "loose" canonical HTTP method (restricted to 37 values)), later it will be SQL command vs. canonical SQL command

this is irrelevant - we need a default set and configurability with both options. They are only different in configuration mechanism (where Option C/E is superior).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted
Projects
None yet
Development

No branches or pull requests

6 participants