Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep cumulative metrics monotonic across collection restarts #1624

Closed
david-crespo opened this issue Aug 19, 2022 · 3 comments
Closed

Keep cumulative metrics monotonic across collection restarts #1624

david-crespo opened this issue Aug 19, 2022 · 3 comments

Comments

@david-crespo
Copy link
Contributor

Was getting some clarification from @bnaecker about metrics types and he mentioned this issue with cumulative metrics:

timestamp exists for every Measurement, and it's the time the sample was taken. start_time exists for Cumulative and Histogram types, and is the zero-point or reference.

One thing to keep in mind is that the same timeseries can have measurements with different start_times. E.g., if a service restarts, that start_time will be reset. I don't know if that matters much now, but it will at some point.

From an end-user point of view, a cumulative metric that resets in this way is wrong, or at least misleading: it looks like a sawtooth when it should be a monotonically increasing line. One can imagine various ways of fixing this:

  • In the database
  • In Nexus
  • On the client

The latter two options are not fully general because in order to get the correct offset for data after restart N, you need to know the last data point before restarts 1..N so you can add them all up. That means that if you're looking at data in a certain window of time, you also need to pull data from outside that window to do the correction. For these reason, the database solution is likely best. It would avoid post-hoc corrections — fetching data in a given window would simply give you the right data for that window.

The only downsides I can think of to the DB approach are:

  • We have to do some work we haven't already done (obviously)
  • The telemetry data stored in the DB would lose some information, namely the fact that these restarts in data collection occurred. In my view, however, if this information is important to keep around, a cumulative metric is not right the place to store it.
@david-crespo david-crespo changed the title Keep cumulative metrics monotonic across restarts Keep cumulative metrics monotonic across collection restarts Aug 19, 2022
@bnaecker
Copy link
Collaborator

A couple of thoughts on implementation.

I would recommend we keep the exact original data as reported by the producer. We can add one or more other columns that represent the shifted value. For example, keep a column that is the last value from the preceding time window, and another which includes the reported measurements added to that. The latter is what most people would probably want to see.

Another point is that we should be pretty careful about how we do that for floating point types -- we need to avoid catastrophic cancellation.

@bnaecker
Copy link
Collaborator

If we decide to move forward with #5273, restarts in cumulative data is handled by always converting them into deltas when we first retrieve data from a timeseries. We can close this if we integrate that PR.

@bnaecker
Copy link
Collaborator

Closing now, since OxQL automatically handles restarts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants