-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support missing samples in oximeter #4552
Conversation
Fixes #4311 |
d5095fd
to
e1f41f3
Compare
This is a pretty substantive change, so I'm including some manual testing notes in addition to the new tests I added as part of the PR. The big change here is that we've made most columns nullable in the ClickHouse tables that store individual measurements. The point is to store Testing overviewHere's the general test flow:
Starting from
|
- Add a `Datum::Missing` and `MissingDatum`, which records the intended datum type and an optional start time for a sample which could not be produced. - Database upgrades which make all scalar datum columns Nullable. Array fields are _not_ made Nullable, since ClickHouse doesn't support composite types like arrays inside a Nullable wrapper type. The empty array is used as a sentinel, which is OK since we can't have zero-length array histograms in Oximeter. Add a test which will fail if we ever change that. - Rework database serialization to handle Nullable types or empty arrays. This uses a new helper trait to convert a NULL (which has no type information) to the intended datum type, or an empty array to a histogram. - Add a test for each measurement type that we can recover a missing sample of that type -- NULLs for scalar values and empty arrays for histograms.
e1f41f3
to
8b5abc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, only a few minor cleanup comments.
I can save this question for #4311 , but what's your idea on how this gets presented to end-users?
Put another way, if someone queries for samples between times A and B, and they see:
- "No samples"
vs - "We have five samples of missing data"
What's that supposed to mean?
I know you discussed the possibility of a "measurement failure" but I'm not really wrapping my head around the end-to-end distinction between these two cases. Is this something metric producers will need to start doing implicitly? (e.g., "no samples" actually means that "missing data" is recorded?)
To be transparent, I'm not yet sure how this will be communicated to customers. It actually may not be in the end, or maybe only to operators. But I do think it's important that we internally have a record of a missing sample. Right now, producers can generate an error, which is only logged in the UPDATE: I didn't quite answer your question :) The difference between "No samples" and "5 missing samples" is, "Hmm something isn't working correctly because it couldn't collect the data it expected". On a graph, this probably be "shown" as NaNs, which is to say "no value on the y-axis, but possibly shown in some other way". For example, this shows 4 valid data points at (0, 1, 3, 4), and one missing sample at x == 2. That's one possible way someone could see that there is missing data, rather than a different scenario such as when the system didn't even attempt to produce samples.
The goal was to be less implicit, not more so. Right now, if there is a failure to collect a sample, producers can send a Hope that helps provide some more context! Let me know your thoughts about all that. |
Got it, this makes sense, and answers a big part of my question: right now, if we merge this with no other changes, we wouldn't see any missing samples. But the next step would be finding out how producers could/should emit this information. Notably, this means that we still need connectivity between collector -> producer in order to log these metrics, correct? If a producer was running within a service on a sled that goes offline, it wouldn't actually produce any "missing" samples, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mechanically, I'm happy with this PR, and I'm okay merging this as-is to get the schema changes in-place.
Logistically, I still have some questions about:
- Best practices for emitting missing metrics (always producer-triggered? What if the producer's service crashes? It might be worthwhile discussing a collector-triggered missing metric, if the producer is not responding to us)
- How we're displaying this in the UI -- though your sample graph convinced me this is a tractable problem!
Correct. A missing sample still needs to be produced,
If
That is a good idea. I think "best effort" is the most accurate suggestion I have right now. Basically, in priority order I would suggest:
|
Datum::Missing
andMissingDatum
, which records the intended datum type and an optional start time for a sample which could not be produced.