`oximeter` needs a way to represent missing samples #4311

bnaecker · 2023-10-21T02:09:12Z

Code that produces data for oximeter to collect is required to generate a Sample, in the form of the Producer trait. That trait allows reporting errors, but those errors are divorced from the timeseries underlying them. oximeter needs to grow the concept of a missing or failed sample specifically, to indicate that there should be data from a particular timeseries at some time, but that it could not be generated. To put a finer point on it, there's currently no way to distinguish whether a gap in a timeseries is due to a network partition between oximeter and the producer; or a failure to produce the sample at all.

One possible implementation is to make the Sample contain an enum like:

enum SampleData {
    Measurement(Measurement),
    Missing(MissingReason),
}

enum MissingReason {
    NoData,
    MeasurementFailed,
    // others?
}

I'm not sure how valuable the MissingCause is, or if we just want a String there for folks to put whatever messages they want.

One trickier part about this is the database representation. We'd like to show these missing samples as "interleaved" with the raw data, which raises the question of how to model that in ClickHouse tables. A dumb first idea might be to make the actual measurement value column nullable, and add another nullable column for the error. I'm not sure what goes in that column, but some context about the error might be nice, such as a short string or a small enum with a few possible causes for such errors.

The text was updated successfully, but these errors were encountered:

bnaecker · 2023-11-16T23:32:08Z

The Datum enum currently looks like this:

omicron/oximeter/oximeter/src/types.rs

Lines 344 to 349 in 9c33a62

    
           pub enum Datum { 
        
               Bool(bool), 
        
               I8(i8), 
        
               U8(u8), 
        
               I16(i16), 
        
               U16(u16),

There are I think some useful tricks we can play with serde to make the evolution here pretty painless.

One simple (and maybe stupid) idea is to rewrite each tuple variant from Datum::Variant(T) to Datum::Variant(Option<T>). This has the really nice benefit of being easy to roll out as an API change. Consumers who are producing data now will transparently have that deserialized into Datum::Variant(Some(x)). The reason to make each variant contain an option rather than the whole enum an option is so that we can still have the datum_type represented.

Another path would be to update the Measurement struct to have a datum_type field and make the actual datum field an Option<Datum>. We could add a #[serde] tag that would populate it on deserialization using the actual datum, which would correctly work for all existing producers where the datum field always exists. Then we can add a way to construct a new missing value and put that in the Measurement type.

bnaecker · 2023-11-17T01:00:58Z

Idea (2) above isn't quite right, since there's no way to have serde populate a field from a function that takes &self or anything like that (which makes sense). But I think there is a similar approach that can work, using serde's facilities for deserializing as a different type. Concretely, I think this should work:

Create a new type, say TransitionalMeasurement, which is exactly the same as the existing Measurement
Implement Deserialize on Measurement by using the from trick above: deserializing as TransitionalMeasurement and converting. At this stage, the conversion is pointless and bit-by-bit identical.
Change the definition of Measurement to be:
```
#[derive(Clone, Debug, PartialEq, JsonSchema, Serialize, Deserialize)]
#[serde(from = "TransitionalMeasurement")]
pub struct Measurement {
    timestamp: DateTime<Utc>,
    datum_type: DatumType,
    datum: Option<Datum>,
}
```
And impl From<TransitionalMeasurement> for Measurement. This can always be done, because all peers send oximeter data where the datum is Some(_) right now. This does entail a lot of other changes inside oximeter and definitely in the database as well.
Once all outside consumers have updated their own dependencies on oximeter where they produce data, we can remove the TransitionalMeasurement type and deserialize directly into Measurement.

This is probably the simplest overall approach. We put the Option at the right level, avoiding lots of duplication. We can record the fields and IDs of the timeseries, as well as keep the datum type, which is important for generating the timeseries key. One downside is that we would lose the start time for cumulative data. I think that might be inherent in the problem though, but I could be wrong.

bnaecker · 2023-12-04T21:43:00Z

Closed by #4552

bnaecker added oximeter Metrics labels Oct 21, 2023

This was referenced Nov 22, 2023

oximeter's byte array datum could use some thought #4551

Open

Support missing samples in oximeter #4552

Merged

bnaecker closed this as completed Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`oximeter` needs a way to represent missing samples #4311

`oximeter` needs a way to represent missing samples #4311

bnaecker commented Oct 21, 2023 •

edited

Loading

bnaecker commented Nov 16, 2023

bnaecker commented Nov 17, 2023

bnaecker commented Dec 4, 2023

oximeter needs a way to represent missing samples #4311

oximeter needs a way to represent missing samples #4311

Comments

bnaecker commented Oct 21, 2023 • edited Loading

bnaecker commented Nov 16, 2023

bnaecker commented Nov 17, 2023

bnaecker commented Dec 4, 2023

`oximeter` needs a way to represent missing samples #4311

`oximeter` needs a way to represent missing samples #4311

bnaecker commented Oct 21, 2023 •

edited

Loading