Clarification needed for ".utilization" metrics convention #819

jmacd · 2020-08-17T20:42:08Z

What are you trying to achieve?

OTEP #119 specified a convention for metrics ending in ".utilization":
open-telemetry/oteps#119

It's not clear how to implement this in some cases, clarification may be needed. For a metric such as process.cpu.time which is emitted as a cumulative value (e.g., from a SumObserver), we'll naturally be able to compute a cumulative utilization score, i.e., the total CPU time used divided by the total time. This number, the lifetime utilization, may not be very useful. It would be perhaps more useful expressed as "Interval" temporality. The ".utilization" for cumulative time metrics has the same problem as Summary data points have, that they are rarely useful in cumulative form. Moreover, they can be derived in a backend.

Should we drop ".utilization" metrics for CPU usage? Should we specify they be conveyed as Interval summaries (i.e., Difference in cumulative usage divided by difference in time)? (@aabmass)

The text was updated successfully, but these errors were encountered:

jmacd · 2020-08-18T06:09:01Z

@bogdandrutu and @open-telemetry/specs-metrics-approvers regarding open-telemetry/opentelemetry-proto#199

MrAlias · 2020-08-18T17:53:15Z

Should we specify they be conveyed as Interval summaries (i.e., Difference in cumulative usage divided by difference in time)?

That would make sense to me.

jmacd · 2020-08-18T18:12:27Z

To me, there is still a minor concern. We have argued that SumObserver and UpDownSumObserver should accept cumulative inputs so that they can remain stateless. Observer callbacks do not need to know the last time they were called or remember the last value.

In order for instrumentation to compute CPU utilization from am Observer callback breaks this rule. The callback has to remember the last timestamp it reported and the last value it recorded in order to output the current interval's utilization.

The final destination of a *.cpu.time metric can also just compute *.cpu.utilization itself, since it has presumably accumulated a series of measurements. Maybe we can specify that utilization metrics can be generated by a stateful Observer callback or can be generated by a stateful receiver downstream, to leave the possibilities open.

aabmass · 2020-08-18T23:26:51Z

Should we drop ".utilization" metrics for CPU usage?

I think it would great to keep it, but as you mentioned it could be added back in. Which way are you leaning @jmacd?

It would be perhaps more useful expressed as "Interval" temporality.
...
Should we specify they be conveyed as Interval summaries (i.e., Difference in cumulative usage divided by difference in time)?

Are we talking about the SDK instrument to use or the OTLP temporality? My takeaway from the today's (Tuesday) meeting was that using a stateful ValueObserver (where the last value and call time is saved by the callback from its previous call) would be the easiest way to implement this with the SDK. This would send an OTLP gauge which seems ok to me. Then once we have views, it would be best to calculate this from the system.cpu.time SumObserver directly.

This does see like a common use case though, it's called out in the Metrics API spec a few times: "monotonic instruments are useful for monitoring rate information." Is "calculating" here meaning with a view or in the backend? Someone also mentioned OTEP 88 had a proposal for this interval/delta, but no concrete use cases. Would something like request rate not be an equivalent synchronous example of this ( # requests in interval / time delta )?

jmacd · 2020-08-19T17:52:37Z

This was discussed in the 8/18 Metrics SIG (OTLP) meeting. We agreed to address this in the short term by using stateful Observer callbacks that track both their last CPU time measurement and their last timestamp.

A side-note was raised relevant to OTLP: If we had a way to encode deltas from observer instruments, it would be natural to do so here. OTLP actually supports this concept, but we have not standardized any form of Delta Observer, and this may be such a special case that we continue to ignore this matter. However, if we had a Delta Observer then it would be natural to encode "CPU time elapsed" measurements. We compute *.cpu.utilization for an interval as the rate of CPU time elapsed.

jmacd added the spec:metrics Related to the specification/metrics directory label Aug 17, 2020

jmacd mentioned this issue Aug 18, 2020

Host metrics instrumentation open-telemetry/opentelemetry-go-contrib#231

Merged

jmacd mentioned this issue Aug 18, 2020

Converting Traces and Metrics into logs. open-telemetry/oteps#120

Closed

jmacd closed this as completed Aug 19, 2020

aabmass mentioned this issue Aug 19, 2020

Update opentelemetry-instrumentation-system-metrics open-telemetry/opentelemetry-python#1006

Closed

This was referenced Aug 26, 2020

Writing system metrics conventions into the specification #818

Closed

Update System Metrics open-telemetry/opentelemetry-python#1019

Merged

aabmass mentioned this issue Sep 11, 2020

System metrics semantic conventions #937

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification needed for ".utilization" metrics convention #819

Clarification needed for ".utilization" metrics convention #819

jmacd commented Aug 17, 2020

jmacd commented Aug 18, 2020

MrAlias commented Aug 18, 2020

jmacd commented Aug 18, 2020

aabmass commented Aug 18, 2020

jmacd commented Aug 19, 2020

Clarification needed for ".utilization" metrics convention #819

Clarification needed for ".utilization" metrics convention #819

Comments

jmacd commented Aug 17, 2020

jmacd commented Aug 18, 2020

MrAlias commented Aug 18, 2020

jmacd commented Aug 18, 2020

aabmass commented Aug 18, 2020

jmacd commented Aug 19, 2020