Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics: Make StartTimeUnixNanos required for aggregation_temporality=CUMULATIVE #292

Closed
jmacd opened this issue Dec 4, 2020 · 5 comments · Fixed by #295
Closed

Metrics: Make StartTimeUnixNanos required for aggregation_temporality=CUMULATIVE #292

jmacd opened this issue Dec 4, 2020 · 5 comments · Fixed by #295
Assignees
Labels
area:data-model release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics

Comments

@jmacd
Copy link
Contributor

jmacd commented Dec 4, 2020

What

All data points in the OTLP Metrics protocol include a StartTimeUnixNanos. This question can be asked three ways, twice for points that have CUMULATIVE/DELTA aggregation temporality and once for points that do not have aggregation temporality.

Points with CUMULATIVE aggregation temporality

Cumulative Sum data points MUST set StartTimeUnixNanos.

This is an important question because traditional metrics data formats including Prometheus Remote Write are specified as Cumulative/Monotonic and do not include a start time. Prometheus consumers are expected to apply a heuristic when calculating a rate, that assumes the monotonic value always descends when a process restarts or that the client always emits a zero when resetting a cumulative series. The intention behind StartTimeUnixNanos for CUMULATIVE series, in particular, is to perform rate calculation without heuristics.

When we have input data that is notionally a Cumulative Sum but we do not know the start time, how should this point be treated? Making StartTimeUnixNanos required for CUMULATIVE points avoids a heuristic rate calculation: the implication is that Cumulative Sum data without known start time should be translated into Gauge data points, stripped of aggregation temporality.

When we have input data that is notionally a Cumulative Sum and we know the start time and that the stream was reset, how should this point be treated? This relates to UpDownCounter data exported as Cumulative, where Delta-to-Cumulative conversion is done by a proxy, see open-telemetry/opentelemetry-specification#1273. In this case, points should retain their Cumulative aggregation temporality but consumers may wish to note that the series was reset and avoid interpreting the cumulative value as a process-lifetime total.

Points with DELTA aggregation temporality

Delta Sum data points SHOULD set StartTimeUnixNanos.

In this case, knowing StartTimeUnixNanos is beneficial (e.g., to detect duplicate points) but not semantically necessary to compute correct rate information. When Counter data arrives via Statsd, for example, it should be translated into Non-Monotonic Delta Sum data.

Points with NO aggregation temporality

Points without aggregation temporality MAY set StartTimeUnixNanos.

For Summary points: StartTimeUnixNanos SHOULD be set to the earliest time covered by the summary.
For GaugeHistogram points: StartTimeUnixNanos SHOULD NOT be set.
For Gauge points: StartTimeUnixNanos SHOULD NOT be set.

@jsuereth
Copy link
Contributor

jsuereth commented Apr 6, 2021

Action Items:

  • Deduplicate this issue w/ proto-based issue
  • Specify "importing monotonic cumulative" section of data model specification. @jmacd

@jmacd jmacd transferred this issue from open-telemetry/opentelemetry-specification Apr 6, 2021
@jmacd
Copy link
Contributor Author

jmacd commented Apr 6, 2021

See the proposal here: #229 (comment)

@jmacd
Copy link
Contributor Author

jmacd commented Apr 14, 2021

Note in the sidecar PR lightstep/opentelemetry-prometheus-sidecar#190 I have revised the approach taken for resets with an unknown start time. If we output a 0 value with StartTimeUnixNano == TimeUnixNano, it perfectly conveys the reset information in.

This leads to a valid stateful transform from sequential cumulative samples lacking a start time to a OTLP cumulative monotonic stream. When the first value is seen in the sequence, output a zero value at its new starting (Reset) timestamp. Subsequent points as long as they are non-decreasing are output relative to the original reset value, i.e., (new_value - reset_value). If ever the subsequent value decreases from its previous (known) value, the reset_value is set to 0 indicating a known reset.

This transformation loses all information before the reset. What I like most about this outcome is that the backend will see a zero-width point at the reset time, instead of dropping that point (which the Stackdriver and lightstep/opentelemetry-prometheus-sidecar have been doing).

@jsuereth jsuereth added area:data-model release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics labels Apr 20, 2021
@jsuereth
Copy link
Contributor

Updates from DataModel SiG Discussion:

We had consensus on doing the following:

  • Proto documentation Updates [jmacd]
    • Document that Start time not required
    • Document that a start time of 0 means "unknown start time"
    • Make a "heavy recommendation" that start time should be provided from all standard receivers and metric providers within open telemetry, but may be impractical at times.
  • DataModel Specification Updates
    • Add recommendations for how to do stateful receivers for start-time synthesize + reset detection.
      Here we suspect there are some algorithms we can recommend where a stateful receiver of metrics can track incoming
      sums and do counter-reset detection with synthesized start-times that lead to reasonable behavior in backends.
    • Add Recommendations for resource-based start-time synthesize
      e.g. we suspect if "resource" is lifted to a 1st-level concept a "join" between Resource signal + metrics could be used to grab start time in scenarios where start time is missing.
  • We'd like to ask the Prometheus-WG how Prometheus/OpenMetrics handles missing start times currently and make sure our decisions align (or are compatible).

@jsuereth
Copy link
Contributor

Regarding start-time synthesis and @jmacd's discussion on how we can better support prometheus style counters, see: #289

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:data-model release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants