Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset ObservableGauge for all cached attributes sets to 0 #1221

Closed
Matthias247 opened this issue Aug 22, 2023 · 4 comments
Closed

Reset ObservableGauge for all cached attributes sets to 0 #1221

Matthias247 opened this issue Aug 22, 2023 · 4 comments
Assignees
Labels
A-metrics Area: issues related to metrics question Further information is requested

Comments

@Matthias247
Copy link

I'm using an ObservableGauge to track errors of certain entities that the application is monitoring might have encountered.
The application scans the state of those entities, in periodic intervals, aggregates errors, and then updates gauges according to the latest state. This uses code along:

let mut attrs: Vec<KeyValue> = attributes.to_vec();
attrs.push(KeyValue::new("error", "".to_string()));
let mut total_errs = 0;

for (error, &count) in m.errors_encountered.iter() {
    total_errs += count;
    attrs.last_mut().unwrap().value = error.to_string().into();
    self.errors_gauge
        .observe(otel_cx, count as u64, &attrs);
}

attrs.last_mut().unwrap().value = "any".to_string().into();
self.errors_gauge
    .observe(otel_cx, total_errs as u64, &attrs);

That works fine for all errors that had been recently encountered.
However I noticed that once one scan doesn't report a certain error anymore, the opentelemetry-rust/promtheus stack still reports the old error. It would need to be explicetly set to 0.

Is there any mechanism in opentelemetry-rust that allows to reset a gauge (with all variations of attributes) to 0 before updating them again?

If there would be a well-defined set of errors, I could obviously manually update the values that are not part of errors_encountered to 0. But since those errors are dynamic strings that are received from another application that isn't easily possible.

Maybe the right answer here is also "you are doing it wrong and shouldn't use gauges for it", which is certainly debatable :) But for this particular problem where the exact amount of entities in a certain state should be determined independently of the report frequency and without metric math, they seem much easier to use.

@Matthias247
Copy link
Author

I think this might be the equivalent method in prometheus directly? https://docs.rs/prometheus/latest/prometheus/core/struct.MetricVec.html#method.reset

@Matthias247
Copy link
Author

I've did some experimenting and I'm now wondering whether this behavior changed in 0.20 towards what I expect.

In 0.19 I built the following unit-test:

#[test]
fn test_logging_setup() {
    let metrics_controller = metrics::controllers::basic(metrics::processors::factory(
        metrics::selectors::simple::histogram([1.0, 10.0]),
        aggregation::cumulative_temporality_selector(),
    ))
    .with_collect_period(std::time::Duration::from_secs(0))
    .build();

    let metrics_exporter = Arc::new(opentelemetry_prometheus::exporter(metrics_controller).init());

    let meter = metrics_exporter.meter_provider().unwrap().meter("myservice");
    let x = meter.u64_observable_gauge("mygauge").init();

    let state = KeyValue::new("state", "mystate");
    let p1 = vec![state.clone(), KeyValue::new("error", "ErrA")];
    let p2 = vec![state.clone(), KeyValue::new("error", "ErrB")];
    let p3 = vec![state.clone(), KeyValue::new("error", "ErrC")];

    let counter = Arc::new(AtomicUsize::new(0));

    meter.register_callback(move |cx| {
        let count = counter.fetch_add(1, Ordering::SeqCst);
        println!("Collection {}", count);
        if count % 2 == 0 {
            x.observe(&cx, 1, &p1);
        } else{
            x.observe(&cx, 1, &p2);
        }
        if count % 3 == 1 {
            x.observe(&cx, 1, &p3);
        }

    }).unwrap();
    
    for _ in 0..10 {
        let mut buffer = vec![];
        let encoder = TextEncoder::new();
        let metric_families = metrics_exporter.registry().gather();
        encoder.encode(&metric_families, &mut buffer).unwrap();
        println!("{}", String::from_utf8(buffer).unwrap());
    }

    panic!("failed");
}

This provides the following output (collapsed extracted just the gauges for brevity):

Click to expand
Collection 0
# HELP mygauge mygauge
# TYPE mygauge gauge
mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

Collection 1
# HELP mygauge mygauge
# TYPE mygauge gauge
mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

Collection 2
# HELP mygauge mygauge
# TYPE mygauge gauge
mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

Collection 3
# HELP mygauge mygauge
# TYPE mygauge gauge
mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

Collection 4
# HELP mygauge mygauge
# TYPE mygauge gauge
mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

Collection 5
# HELP mygauge mygauge
# TYPE mygauge gauge
mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

Collection 6
# HELP mygauge mygauge
# TYPE mygauge gauge
mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

Collection 7
# HELP mygauge mygauge
# TYPE mygauge gauge
mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

Collection 8
# HELP mygauge mygauge
# TYPE mygauge gauge
mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

Collection 9
# HELP mygauge mygauge
# TYPE mygauge gauge
mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

As the output shows, every scrape encodes all 3 metrics:

mygauge{error="ErrA",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrB",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1
mygauge{error="ErrC",service_name="unknown_service",state="mystate",otel_scope_name="myservice",otel_scope_version=""} 1

With 0.20, the equivalent code snippet seems this:

#[test]
fn test_logging_setup() {
    let prometheus_registry = prometheus::Registry::new();
    let metrics_exporter = opentelemetry_prometheus::exporter()
        .with_registry(prometheus_registry.clone())
        .build().unwrap();
    let meter_provider = metrics::MeterProvider::builder()
        .with_reader(metrics_exporter)
        .build();

    let meter = meter_provider.meter("myservice");
    let x = meter.u64_observable_gauge("mygauge").init();

    let state = KeyValue::new("state", "mystate");
    let p1 = vec![state.clone(), KeyValue::new("error", "ErrA")];
    let p2 = vec![state.clone(), KeyValue::new("error", "ErrB")];
    let p3 = vec![state.clone(), KeyValue::new("error", "ErrC")];

    let counter = Arc::new(AtomicUsize::new(0));

    meter.register_callback(&[x.as_any()], move |observer| {
        let count = counter.fetch_add(1, Ordering::SeqCst);
        println!("Collection {}", count);
        if count % 2 == 0 {
            observer.observe_u64(&x, 1, &p1);
        } else{
            observer.observe_u64(&x, 1, &p2);
        }
        if count % 3 == 1 {
            observer.observe_u64(&x, 1, &p3);
        }

    }).unwrap();
    
    for _ in 0..10 {
        let mut buffer = vec![];
        let encoder = TextEncoder::new();
        let metric_families = prometheus_registry.gather();
        encoder.encode(&metric_families, &mut buffer).unwrap();
        println!("{}", String::from_utf8(buffer).unwrap());
    }

    panic!("failed");
}

This provides the following output (collapsed extracted just the gauges for brevity):

Click to expand
Collection 0
# TYPE mygauge gauge
mygauge{error="ErrA",state="mystate",otel_scope_name="myservice"} 1

Collection 1
# TYPE mygauge gauge
mygauge{error="ErrB",state="mystate",otel_scope_name="myservice"} 1
mygauge{error="ErrC",state="mystate",otel_scope_name="myservice"} 1

Collection 2
# TYPE mygauge gauge
mygauge{error="ErrA",state="mystate",otel_scope_name="myservice"} 1

Collection 3
# TYPE mygauge gauge
mygauge{error="ErrB",state="mystate",otel_scope_name="myservice"} 1

Collection 4
# TYPE mygauge gauge
mygauge{error="ErrA",state="mystate",otel_scope_name="myservice"} 1
mygauge{error="ErrC",state="mystate",otel_scope_name="myservice"} 1

Collection 5
# TYPE mygauge gauge
mygauge{error="ErrB",state="mystate",otel_scope_name="myservice"} 1

Collection 6
# TYPE mygauge gauge
mygauge{error="ErrA",state="mystate",otel_scope_name="myservice"} 1

Collection 7
# TYPE mygauge gauge
mygauge{error="ErrB",state="mystate",otel_scope_name="myservice"} 1
mygauge{error="ErrC",state="mystate",otel_scope_name="myservice"} 1

Collection 8
# TYPE mygauge gauge
mygauge{error="ErrA",state="mystate",otel_scope_name="myservice"} 1

Collection 9
# TYPE mygauge gauge
mygauge{error="ErrB",state="mystate",otel_scope_name="myservice"} 1

Here the exported metrics match what I expect. Only the gauge values for attributes that have been submitted in the last callback are retained. But I'm not sure what prometheus will actually make out of it (reset non-submitted values to 0 or not), but I will figure it out.

Is this change in behavior expected and was a bugfix for 0.20? Or was the setup code for 0.19 - which I mostly copied from examples - instructing the library to behave this way. The description of #1000 doesn't seem to explain such a behavior change.

@cijothomas
Copy link
Member

0.20 is a near complete rewrite of Metrics API/SDK, as the older implementation was based on an OLD, experimental version of spec itself. The new one matches the stable spec, so if you are getting what you need with the new one, that is awesome news!

@TommyCpp TommyCpp self-assigned this Sep 26, 2023
@TommyCpp TommyCpp added question Further information is requested A-metrics Area: issues related to metrics labels Sep 26, 2023
@TommyCpp
Copy link
Contributor

TommyCpp commented Oct 1, 2023

It seems to be related to #955. I think the issue has been patched in v0.20 so will close this out for now. Feel free to reopen if you have other questions/comments

@TommyCpp TommyCpp closed this as completed Oct 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-metrics Area: issues related to metrics question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants