Keep cumulative metrics monotonic across collection restarts #1624

david-crespo · 2022-08-19T16:51:13Z

Was getting some clarification from @bnaecker about metrics types and he mentioned this issue with cumulative metrics:

timestamp exists for every Measurement, and it's the time the sample was taken. start_time exists for Cumulative and Histogram types, and is the zero-point or reference.

One thing to keep in mind is that the same timeseries can have measurements with different start_times. E.g., if a service restarts, that start_time will be reset. I don't know if that matters much now, but it will at some point.

From an end-user point of view, a cumulative metric that resets in this way is wrong, or at least misleading: it looks like a sawtooth when it should be a monotonically increasing line. One can imagine various ways of fixing this:

In the database
In Nexus
On the client

The latter two options are not fully general because in order to get the correct offset for data after restart N, you need to know the last data point before restarts 1..N so you can add them all up. That means that if you're looking at data in a certain window of time, you also need to pull data from outside that window to do the correction. For these reason, the database solution is likely best. It would avoid post-hoc corrections — fetching data in a given window would simply give you the right data for that window.

The only downsides I can think of to the DB approach are:

We have to do some work we haven't already done (obviously)
The telemetry data stored in the DB would lose some information, namely the fact that these restarts in data collection occurred. In my view, however, if this information is important to keep around, a cumulative metric is not right the place to store it.

The text was updated successfully, but these errors were encountered:

bnaecker · 2022-08-19T16:59:11Z

A couple of thoughts on implementation.

I would recommend we keep the exact original data as reported by the producer. We can add one or more other columns that represent the shifted value. For example, keep a column that is the last value from the preceding time window, and another which includes the reported measurements added to that. The latter is what most people would probably want to see.

Another point is that we should be pretty careful about how we do that for floating point types -- we need to avoid catastrophic cancellation.

bnaecker · 2024-03-19T17:47:43Z

If we decide to move forward with #5273, restarts in cumulative data is handled by always converting them into deltas when we first retrieve data from a timeseries. We can close this if we integrate that PR.

bnaecker · 2024-05-15T18:58:43Z

Closing now, since OxQL automatically handles restarts.

david-crespo changed the title ~~Keep cumulative metrics monotonic across restarts~~ Keep cumulative metrics monotonic across collection restarts Aug 19, 2022

bnaecker closed this as completed Aug 19, 2022

bnaecker reopened this Aug 19, 2022

david-crespo mentioned this issue Sep 22, 2022

Collect metrics for allocated CPU, storage, and memory #1734

Closed

david-crespo mentioned this issue Jan 30, 2024

Dynamically render units on metrics tab oxidecomputer/console#1916

Merged

bnaecker closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep cumulative metrics monotonic across collection restarts #1624

Keep cumulative metrics monotonic across collection restarts #1624

david-crespo commented Aug 19, 2022

bnaecker commented Aug 19, 2022

bnaecker commented Mar 19, 2024

bnaecker commented May 15, 2024

Keep cumulative metrics monotonic across collection restarts #1624

Keep cumulative metrics monotonic across collection restarts #1624

Comments

david-crespo commented Aug 19, 2022

bnaecker commented Aug 19, 2022

bnaecker commented Mar 19, 2024

bnaecker commented May 15, 2024