Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification needed on how values for attributes filtered by a view are used #2905

Closed
MrAlias opened this issue Oct 27, 2022 · 25 comments
Closed
Assignees
Labels
question Question for discussion spec:metrics Related to the specification/metrics directory

Comments

@MrAlias
Copy link
Contributor

MrAlias commented Oct 27, 2022

Currently the specification states:

The SDK MUST provide the means to register Views with a MeterProvider. Here are the inputs:
...

  • The configuration for the resulting metrics stream:
    ...
    • A list of attribute keys (optional). If provided, the attributes that are not in the list will be ignored. If not provided, all the attribute keys will be used by default (TODO: once the Hint API is available, the default behavior should respect the Hint if it is available).

For asynchronous counter instruments (counter and up-down-counter), how should their sums be reported when this "filter" is defined?

For example, if there is an asynchronous counter that observes 1 for the attribute set version=1 and then, in the same callback, 2 for the attribute set version=2 and a list of attribute keys equaling {"foo"} is used in a view, both attribute sets (version=1 and version=2) should be ignored according to the specification. Therefore, they both become observations of the empty attribute set. How should their sums be combined? Should the SDK report 2, the last recorded value for the empty attribute set? Or, should it report 3 the combination?

I had assumed the latter would be expected, but when looking at the python implementation they use the former. This is based on the limit_num_of_attrs.py example:

import random
import time
from typing import Iterable

from opentelemetry.metrics import (
    CallbackOptions,
    Observation,
    get_meter_provider,
    set_meter_provider,
)
from opentelemetry.sdk.metrics import MeterProvider, ObservableGauge
from opentelemetry.sdk.metrics.export import (
    ConsoleMetricExporter,
    PeriodicExportingMetricReader,
)
from opentelemetry.sdk.metrics.view import View

view_with_attributes_limit = View(
    instrument_type=ObservableGauge,
    instrument_name="observable_gauge",
    attribute_keys={"foo"},
)

exporter = ConsoleMetricExporter()

reader = PeriodicExportingMetricReader(exporter, export_interval_millis=1_000)
provider = MeterProvider(
    metric_readers=[
        reader,
    ],
    views=[
        view_with_attributes_limit,
    ],
)
set_meter_provider(provider)

meter = get_meter_provider().get_meter("reduce-cardinality-with-view", "0.1.2")


def observable_gauge_func(options: CallbackOptions) -> Iterable[Observation]:
    yield Observation(1, {"version": 1})
    yield Observation(2, {"version": 2})


# Async gauge
observable_gauge = meter.create_observable_gauge(
    "observable_gauge",
    [observable_gauge_func],
)

while 1:
    time.sleep(1)

(caveat, my python understanding is limited so I could have missed something here)

When running this:

$ python limit_num_of_attrs.py
{"resource_metrics": [{"resource": {"attributes": {"telemetry.sdk.language": "python", "telemetry.sdk.name": "opentelemetry", "telemetry.sdk.version": "1.12.0", "service.name": "unknown_service"}, "schema_url": ""}, "scope_metrics": [{"scope": {"name": "reduce-cardinality-with-view", "version": "0.1.2", "schema_url": ""}, "metrics": [{"name": "observable_gauge", "description": "", "unit": "", "data": {"data_points": [{"attributes": {}, "start_time_unix_nano": 0, "time_unix_nano": 1666898388887952217, "value": 2}]}}], "schema_url": ""}], "schema_url": ""}]}
{"resource_metrics": [{"resource": {"attributes": {"telemetry.sdk.language": "python", "telemetry.sdk.name": "opentelemetry", "telemetry.sdk.version": "1.12.0", "service.name": "unknown_service"}, "schema_url": ""}, "scope_metrics": [{"scope": {"name": "reduce-cardinality-with-view", "version": "0.1.2", "schema_url": ""}, "metrics": [{"name": "observable_gauge", "description": "", "unit": "", "data": {"data_points": [{"attributes": {}, "start_time_unix_nano": 0, "time_unix_nano": 1666898389956287449, "value": 2}]}}], "schema_url": ""}], "schema_url": ""}]}
#...

The value reported is 2.

cc @jmacd @bogdandrutu @reyang @open-telemetry/python-approvers

@MrAlias MrAlias added question Question for discussion spec:metrics Related to the specification/metrics directory labels Oct 27, 2022
@MrAlias
Copy link
Contributor Author

MrAlias commented Oct 27, 2022

I would be interested to hear how Java, .NET, and other stable implementation behave here. Unfortunately, my ability to write examples there are non-existent.

@reyang
Copy link
Member

reyang commented Oct 27, 2022

The Python implementation sounds buggy to me.

For additive (e.g. Counter, UpDownCounter, Asynchronous Counter, Asynchronous UpDownCounter) values, removing a spatial dimension should result in sum.

@MrAlias
Copy link
Contributor Author

MrAlias commented Oct 27, 2022

@reyang can you comment on how .NET handles this? Does that project add them together?

@cijothomas
Copy link
Member

#1874 Old issue discussing this.

I can confirm that .NET does not handle this correctly, and has the same bug as python. (I'll create an issue to track)

@cijothomas
Copy link
Member

cijothomas commented Oct 27, 2022

#1874 Old issue discussing this.

I can confirm that .NET does not handle this correctly, and has the same bug as python. (I'll create an issue to track)

https://github.com/open-telemetry/opentelemetry-dotnet/blob/73f8d3cb0160f633661854f41866dcdb70b81069/test/OpenTelemetry.Tests/Metrics/MetricAPITest.cs#L800-L803 UnitTest showing the same bug in .NET. (I looked at it at that time and it was not trivial to fix, so just added test to come back to this later.)

@MrAlias
Copy link
Contributor Author

MrAlias commented Oct 27, 2022

#1874 Old issue discussing this.

I can confirm that .NET does not handle this correctly, and has the same bug as python. (I'll create an issue to track)

Gotcha, thanks for confirming 🙏

@MrAlias
Copy link
Contributor Author

MrAlias commented Oct 28, 2022

@jack-berg can you confirm the behavior of async counters in Java in the presence of an "attribute reducing" view? I'm wondering if there is any implementation that sums these values.

@jmacd
Copy link
Contributor

jmacd commented Oct 28, 2022

This is a good question. The Lightstep metrics SDK and the v0.31 and earlier OTel-Go metrics SDK did sum these instruments. As a case where this matters, potentially, is the Golang runtime/metrics package, which outputs cumulative Counter and UpDownCounter values in a form that is natural for use with asynchronous instruments.

For example, runtime/metrics outputs three metrics, one is a total of the other two:

/gc/cycles/automatic:gc-cycles
	Count of completed GC cycles generated by the Go runtime.

/gc/cycles/forced:gc-cycles
	Count of completed GC cycles forced by the application.

/gc/cycles/total:gc-cycles
	Count of all completed GC cycles.

In the instrumentation package I wrote (forked from the contrib repo), https://github.com/lightstep/otel-launcher-go/tree/main/lightstep/instrumentation/runtime, the pattern is used to discard the total, since the SDK or a downstream consumer can easily recompute the total using attribute removal and the natural merge function for the data point. So in the example above, a cumulative GC count for class=forced and class=automatic will be generated. If you remove class, you get the total back. There are three groups of metrics in the Go-1.20 runtime/metrics package that follow this pattern. For Go-1.20, we get hierarchical CPU timing measurements, so you could filter in more than one meaningful way.

@reyang
Copy link
Member

reyang commented Oct 28, 2022

A good way to think about this problem: whether the dimension is removed from the SDK (e.g. by configuring View), or it is removed by the Collector (by reaggregation), or it is removed at the backend/comsumption (via PromQL), the results should be the same.

@MrAlias
Copy link
Contributor Author

MrAlias commented Oct 28, 2022

For additive (e.g. Counter, UpDownCounter, Asynchronous Counter, Asynchronous UpDownCounter) values, removing a spatial dimension should result in sum.

@reyang should the Asynchronous Counter, Asynchronous UpDownCounter be added to the list of instrument at the end of that section that are treated as additive?

Also, should the API specification be updated to normatively require these instrument types be treated as additive?

@lalitb
Copy link
Member

lalitb commented Oct 28, 2022

otel-cpp implementation does the sum for the sync counters matching the "attribute reducing" view, but selects the last value for the async counter. Need to be fixed, will create a issue to track this.

@reyang
Copy link
Member

reyang commented Oct 28, 2022

For additive (e.g. Counter, UpDownCounter, Asynchronous Counter, Asynchronous UpDownCounter) values, removing a spatial dimension should result in sum.

@reyang should the Asynchronous Counter, Asynchronous UpDownCounter be added to the list of instrument at the end of that section that are treated as additive?

Which list?

Also, should the API specification be updated to normatively require these instrument types be treated as additive?

I think no. Additive property applies to Sums, and is covered by the SDK specification https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#default-aggregation. It is perfectly fine for the SDK user to configure an UpDownCounter as something different (e.g. gauge) and not apply additive property.

@MrAlias
Copy link
Contributor Author

MrAlias commented Oct 28, 2022

For additive (e.g. Counter, UpDownCounter, Asynchronous Counter, Asynchronous UpDownCounter) values, removing a spatial dimension should result in sum.

@reyang should the Asynchronous Counter, Asynchronous UpDownCounter be added to the list of instrument at the end of that section that are treated as additive?

Which list?

The list included in the sentence at the end of the linked section:

In OpenTelemetry, each Instrument implies whether it is additive or not. For example, Counter and UpDownCounter are additive while Asynchronous Gauge is non-additive.

@MrAlias
Copy link
Contributor Author

MrAlias commented Oct 28, 2022

Also, should the API specification be updated to normatively require these instrument types be treated as additive?

I think no. Additive property applies to Sums, and is covered by the SDK specification https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#default-aggregation.

Where is it covered? I don't see anywhere in that linked section defining sums as additive.

It also seems like the term additive is applied to instruments, not aggregations. This would make sense as that is where attributes are also defined. i.e.

each Instrument implies whether it is additive or not

@reyang
Copy link
Member

reyang commented Oct 28, 2022

The list included in the sentence at the end of the linked section:

In OpenTelemetry, each Instrument implies whether it is additive or not. For example, Counter and UpDownCounter are additive while Asynchronous Gauge is non-additive.

I think yes it can, not necessarily as it is only providing examples (without promising that it will be a complete list).

@reyang
Copy link
Member

reyang commented Oct 28, 2022

Where is it covered? I don't see anywhere in that linked section defining sums as additive.

It also seems like the term additive is applied to instruments, not aggregations. This would make sense as that is where attributes are also defined.

Yeah, I can see the challenge here.

The data model specification covered "additive" in the temporality section, and temporality is referenced by the sums section. I wrote the supplementary guideline after discussing with @jmacd, hoping it will provide some help for folks who struggle to understand the data model and API specification. Maybe we should improve the data model spec to bring more clarity by making it more streamlined.

@MrAlias

This comment was marked as outdated.

@MrAlias
Copy link
Contributor Author

MrAlias commented Oct 28, 2022

Where is it covered? I don't see anywhere in that linked section defining sums as additive.
It also seems like the term additive is applied to instruments, not aggregations. This would make sense as that is where attributes are also defined.

Yeah, I can see the challenge here.

The data model specification covered "additive" in the temporality section, and temporality is referenced by the sums section. I wrote the supplementary guideline after discussing with @jmacd, hoping it will provide some help for folks who struggle to understand the data model and API specification. Maybe we should improve the data model spec to bring more clarity by making it more streamlined.

Gotcha. Any help clarifying this in the specification would be appreciated 🙏

@reyang
Copy link
Member

reyang commented Oct 28, 2022

Just to get a sanity check: if we don't currently have normative requirements for this in the specification and stable releases of implementations don't do this, does OTel really require this behavior? Is it just considered undefined currently?

Can we add it in a backward-compatible way? Will having implementations change their behavior be backwards-compatible?

I think the spec does require it, although it is not expressed in a normative way.
Here is my understanding of the complex mental network:

  1. One has to first read the API specification to understand Asynchronous Counter.
  2. Then go to the SDK specification to understand that the default aggregation is Sum.
  3. Then go to the Data Model specification to understand temporality and realize that Sum means additive.
  4. Or by chance discovered the supplementary guideline and realized how additive property should be handled after reading the examples.

Given all these steps, 90% chance folks will run into a trap unless they asked someone who has been deeply engaged in all the three specs...

@jack-berg
Copy link
Member

Java doesn't do this correctly either. Opened open-telemetry/opentelemetry-java#4901 to track.

@MadVikingGod
Copy link
Contributor

I want to try and clarify the workings of the async instruments and their default aggregation.

In the case that was raised here, there is a filter and two observes will record to the same filtered counter, we expect the result is a sum of both observations.

What about other similar cases:

  1. If there is no filter, and two observers happen in separate callbacks?
  2. If there is no filter and two observers happen in the same callback?
  3. If there is a filter, and the same attribute sets are used? E.g. case 1, but there is a filter.
  4. What happens if there is a filter and the view says this should be a gauge?

Would it be a good idea to put other a set of edge cases, and the expected output?

@reyang
Copy link
Member

reyang commented Nov 1, 2022

the workings of the async instruments and their default aggregation.

In the case that was raised here, there is a filter and two observes will record to the same filtered counter, we expect the result is a sum of both observations.

What about other similar cases:

  1. If there is no filter, and two observers happen in separate callbacks?
  2. If there is no filter and two observers happen in the same callback?
  3. If there is a filter, and the same attribute sets are used? E.g. case 1, but there is a filter.
  4. What happens if there is a filter and the view says this should be a gauge?

@MadVikingGod could you give some concrete examples? I found that the cases you've described can be interpreted differently. Ideally something like this would help to provide clarity https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#asynchronous-example.

If folks feel this area needs more examples to facilitate the understanding, I can add a section under https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#additive-property to explain the spatial aggregation.

@MadVikingGod
Copy link
Contributor

That is the gist of what I was asking about, just more tightly focused on edge cases we have historically found to be non-obvious. Like the behavior in this issue.

If we were to ask "What should be the result in these 4 (+ the original) cases?" I think we would see different interpretations. While I think the supplementary guidelines are useful, they are verbose enough that they can only cover a small set of behaviors. I was thinking more along the lines of a document that easily be translated into tests to help with the validation of these interactions of features. This issue might look like this: https://gist.github.com/MadVikingGod/e3e65d6217d321d5de8ded159f4abc7e

@reyang
Copy link
Member

reyang commented Nov 3, 2022

That is the gist of what I was asking about, just more tightly focused on edge cases we have historically found to be non-obvious. Like the behavior in this issue.

If we were to ask "What should be the result in these 4 (+ the original) cases?" I think we would see different interpretations. While I think the supplementary guidelines are useful, they are verbose enough that they can only cover a small set of behaviors. I was thinking more along the lines of a document that easily be translated into tests to help with the validation of these interactions of features. This issue might look like this: https://gist.github.com/MadVikingGod/e3e65d6217d321d5de8ded159f4abc7e

The gist reminded of me https://github.com/w3c/trace-context/tree/main/test, which is a big commitment to introduce and maintain. I feel it is probably a separate topic on its own.

@tsloughter
Copy link
Member

Yes, I've wanted something like the tracecontext interop tests for tracing and metrics so bad. I even started on it at one point... Probably need a SIG for it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question for discussion spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

No branches or pull requests

9 participants