Remove stale gauges #955

rikonor · 2023-01-24T21:46:52Z

Hi!

I have the following example that exports a set of gauges.

In some cases, my gauges become stale (i.e the thing they track no longer exists), in which case, I don't wish for them to be exported anymore.

Failing to do this just makes the gauge retain it's value forever, which makes my metric aggregations wrong.
Is there a way to remove these stale metrics from my meter or registry?

TommyCpp · 2023-01-25T18:11:36Z

You can set with_memory(false) so that the metrics provider will not remember stale gauges. #946 will remove this option and forget in all cases

rikonor · 2023-01-25T19:30:38Z

Hi @TommyCpp, thanks for your reply!

Can you elaborate a bit? with_memory(...) says the following:

Memory controls whether the processor remembers metric instruments and attribute sets that were previously reported.
When Memory is true, Reader::try_for_each will visit metrics that were not updated in the most recent interval.

Does this mean that there's some internally defined interval after which a metric expires and is removed from the registry? If so, what is that interval?

Also, what if only want to drop a specific metric? with_memory(...) sounds like it makes a decision globally.

TommyCpp · 2023-01-26T07:32:39Z

Sorry was in a hurry this morning so I didn't realize you are using prometheus exporter, which only supports Cumulative temporality. It means the SDK track the state for instruments and aggregate all data points since application starts. with_memory setting will only work with Delta temporality.

Looking at your example, I feel like the instrument you are looking for is Histogram ?

Does this mean that there's some internally defined interval after which a metric expires

For prometheus use case, where the exporter is pull based. The interval will be the time between two consecutive collect call.

Also, what if only want to drop a specific metric?

That's part of the reason we are dropping with_memory method. We are working on align the implmentation with new spec, which has View to help you tweak and drop metrics from certain instrument.

rikonor · 2023-01-26T13:32:24Z

Looking at your example, I feel like the instrument you are looking for is Histogram ?

I'm not sure that's correct. I'll try to give a better explanation of our use case.

We have a service that periodically obtains a listing of nodes and runs a health check on them. Among other metrics, it also exposes a gauge with a unique label for each node indicating 0 (for down) or 1 (for up).

Occasionally, a node that was previously in the listing is removed, so from that point on our service won't do any health checks against that removed node. In that case, we don't want to retain the gauge with the label for that node, since it will just always keep the last value it had (e.g 0).

It would be nice if there was a way to remove that specific instance of the gauge (while retaining all the other ones), but I haven't been able to find a way to do it with the SDK.

It's a bit hacky, but a short-term solution for now would be to manually remove those entries from the exporter's text output.

feature(BN): control-plane metrics should not include stale gauges There's an issue at the moment where if the NNS registry no longer includes a specific replica, the Boundary Node retains the last gauge value for that replica. I.e if the last time it was health checked was a failure, then our graphs will forever show a RED line indicating that replica is down. But in reality, it's because it's no longer in the registry and therefore we never try it again. Instead, we should just exclude these stale gauges from the metrics we export. Ideally, this is something we could do with the metrics library we use (opentelemetry and opentelemetry_prometheus), but from my impression it doesn't seem possible atm (See [opentelemetry-rust#955](open-telemetry/opentelemetry-rust#955)). What we end up doing, is very primitive - before handing out the metrics response to Prometheus, we go over it and clean up manually any stale gauge entries. See merge request dfinity-lab/public/ic!10377

TommyCpp · 2023-01-28T07:25:28Z

Yeah, I think you are right. The ideal solution will be unregister support.

I will create an issue to add unregister support. I think there are a few things that are worth discussing before we implement it. Sorry we don't have a better solution for you 😔

Close with feature tracking issue #958.

ostkrok · 2023-04-20T06:17:15Z

Hi @TommyCpp,

We're also interested in the same behavior as @rikonor.

However, I don't think that unregister as it's currently defined does what we want. It only unregisters the callbacks that update the instruments, but doesn't remove the actually values from the registry.

So I think this issue should be re-opened as #958 won't solve this issue.

TommyCpp added question Further information is requested A-metrics Area: issues related to metrics labels Jan 26, 2023

TommyCpp closed this as not planned Won't fix, can't repro, duplicate, stale Jan 28, 2023

lzchen mentioned this issue Apr 12, 2023

Support callback unregistration #958

Closed

TommyCpp mentioned this issue Oct 1, 2023

Reset ObservableGauge for all cached attributes sets to 0 #1221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove stale gauges #955

Remove stale gauges #955

rikonor commented Jan 24, 2023

TommyCpp commented Jan 25, 2023

rikonor commented Jan 25, 2023 •

edited

Loading

TommyCpp commented Jan 26, 2023 •

edited

Loading

rikonor commented Jan 26, 2023

TommyCpp commented Jan 28, 2023

ostkrok commented Apr 20, 2023

Remove stale gauges #955

Remove stale gauges #955

Comments

rikonor commented Jan 24, 2023

TommyCpp commented Jan 25, 2023

rikonor commented Jan 25, 2023 • edited Loading

TommyCpp commented Jan 26, 2023 • edited Loading

rikonor commented Jan 26, 2023

TommyCpp commented Jan 28, 2023

ostkrok commented Apr 20, 2023

rikonor commented Jan 25, 2023 •

edited

Loading

TommyCpp commented Jan 26, 2023 •

edited

Loading