Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oximeter needs a way to represent missing samples #4311

Closed
bnaecker opened this issue Oct 21, 2023 · 3 comments
Closed

oximeter needs a way to represent missing samples #4311

bnaecker opened this issue Oct 21, 2023 · 3 comments

Comments

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 21, 2023

Code that produces data for oximeter to collect is required to generate a Sample, in the form of the Producer trait. That trait allows reporting errors, but those errors are divorced from the timeseries underlying them. oximeter needs to grow the concept of a missing or failed sample specifically, to indicate that there should be data from a particular timeseries at some time, but that it could not be generated. To put a finer point on it, there's currently no way to distinguish whether a gap in a timeseries is due to a network partition between oximeter and the producer; or a failure to produce the sample at all.

One possible implementation is to make the Sample contain an enum like:

enum SampleData {
    Measurement(Measurement),
    Missing(MissingReason),
}

enum MissingReason {
    NoData,
    MeasurementFailed,
    // others?
}

I'm not sure how valuable the MissingCause is, or if we just want a String there for folks to put whatever messages they want.

One trickier part about this is the database representation. We'd like to show these missing samples as "interleaved" with the raw data, which raises the question of how to model that in ClickHouse tables. A dumb first idea might be to make the actual measurement value column nullable, and add another nullable column for the error. I'm not sure what goes in that column, but some context about the error might be nice, such as a short string or a small enum with a few possible causes for such errors.

@bnaecker
Copy link
Collaborator Author

The Datum enum currently looks like this:

pub enum Datum {
Bool(bool),
I8(i8),
U8(u8),
I16(i16),
U16(u16),

There are I think some useful tricks we can play with serde to make the evolution here pretty painless.

One simple (and maybe stupid) idea is to rewrite each tuple variant from Datum::Variant(T) to Datum::Variant(Option<T>). This has the really nice benefit of being easy to roll out as an API change. Consumers who are producing data now will transparently have that deserialized into Datum::Variant(Some(x)). The reason to make each variant contain an option rather than the whole enum an option is so that we can still have the datum_type represented.

Another path would be to update the Measurement struct to have a datum_type field and make the actual datum field an Option<Datum>. We could add a #[serde] tag that would populate it on deserialization using the actual datum, which would correctly work for all existing producers where the datum field always exists. Then we can add a way to construct a new missing value and put that in the Measurement type.

@bnaecker
Copy link
Collaborator Author

Idea (2) above isn't quite right, since there's no way to have serde populate a field from a function that takes &self or anything like that (which makes sense). But I think there is a similar approach that can work, using serde's facilities for deserializing as a different type. Concretely, I think this should work:

  • Create a new type, say TransitionalMeasurement, which is exactly the same as the existing Measurement
  • Implement Deserialize on Measurement by using the from trick above: deserializing as TransitionalMeasurement and converting. At this stage, the conversion is pointless and bit-by-bit identical.
  • Change the definition of Measurement to be:
    #[derive(Clone, Debug, PartialEq, JsonSchema, Serialize, Deserialize)]
    #[serde(from = "TransitionalMeasurement")]
    pub struct Measurement {
        timestamp: DateTime<Utc>,
        datum_type: DatumType,
        datum: Option<Datum>,
    }
    And impl From<TransitionalMeasurement> for Measurement. This can always be done, because all peers send oximeter data where the datum is Some(_) right now. This does entail a lot of other changes inside oximeter and definitely in the database as well.
  • Once all outside consumers have updated their own dependencies on oximeter where they produce data, we can remove the TransitionalMeasurement type and deserialize directly into Measurement.

This is probably the simplest overall approach. We put the Option at the right level, avoiding lots of duplication. We can record the fields and IDs of the timeseries, as well as keep the datum type, which is important for generating the timeseries key. One downside is that we would lose the start time for cumulative data. I think that might be inherent in the problem though, but I could be wrong.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Dec 4, 2023

Closed by #4552

@bnaecker bnaecker closed this as completed Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant