Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove stale gauges #955

Closed
rikonor opened this issue Jan 24, 2023 · 6 comments
Closed

Remove stale gauges #955

rikonor opened this issue Jan 24, 2023 · 6 comments
Labels
A-metrics Area: issues related to metrics question Further information is requested

Comments

@rikonor
Copy link

rikonor commented Jan 24, 2023

Hi!

I have the following example that exports a set of gauges.

In some cases, my gauges become stale (i.e the thing they track no longer exists), in which case, I don't wish for them to be exported anymore.

Failing to do this just makes the gauge retain it's value forever, which makes my metric aggregations wrong.
Is there a way to remove these stale metrics from my meter or registry?

@TommyCpp
Copy link
Contributor

You can set with_memory(false) so that the metrics provider will not remember stale gauges. #946 will remove this option and forget in all cases

@rikonor
Copy link
Author

rikonor commented Jan 25, 2023

Hi @TommyCpp, thanks for your reply!

Can you elaborate a bit? with_memory(...) says the following:

Memory controls whether the processor remembers metric instruments and attribute sets that were previously reported.
When Memory is true, Reader::try_for_each will visit metrics that were not updated in the most recent interval.

Does this mean that there's some internally defined interval after which a metric expires and is removed from the registry? If so, what is that interval?

Also, what if only want to drop a specific metric? with_memory(...) sounds like it makes a decision globally.

@TommyCpp TommyCpp added question Further information is requested A-metrics Area: issues related to metrics labels Jan 26, 2023
@TommyCpp
Copy link
Contributor

TommyCpp commented Jan 26, 2023

Sorry was in a hurry this morning so I didn't realize you are using prometheus exporter, which only supports Cumulative temporality. It means the SDK track the state for instruments and aggregate all data points since application starts. with_memory setting will only work with Delta temporality.

Looking at your example, I feel like the instrument you are looking for is Histogram ?


Does this mean that there's some internally defined interval after which a metric expires

For prometheus use case, where the exporter is pull based. The interval will be the time between two consecutive collect call.

Also, what if only want to drop a specific metric?

That's part of the reason we are dropping with_memory method. We are working on align the implmentation with new spec, which has View to help you tweak and drop metrics from certain instrument.

@rikonor
Copy link
Author

rikonor commented Jan 26, 2023

Looking at your example, I feel like the instrument you are looking for is Histogram ?

I'm not sure that's correct. I'll try to give a better explanation of our use case.

We have a service that periodically obtains a listing of nodes and runs a health check on them. Among other metrics, it also exposes a gauge with a unique label for each node indicating 0 (for down) or 1 (for up).

Occasionally, a node that was previously in the listing is removed, so from that point on our service won't do any health checks against that removed node. In that case, we don't want to retain the gauge with the label for that node, since it will just always keep the last value it had (e.g 0).

It would be nice if there was a way to remove that specific instance of the gauge (while retaining all the other ones), but I haven't been able to find a way to do it with the SDK.

It's a bit hacky, but a short-term solution for now would be to manually remove those entries from the exporter's text output.

gitlab-dfinity pushed a commit to dfinity/ic that referenced this issue Jan 26, 2023
feature(BN): control-plane metrics should not include stale gauges

There's an issue at the moment where if the NNS registry no longer includes a specific replica, the Boundary Node retains the last gauge value for that replica.
I.e if the last time it was health checked was a failure, then our graphs will forever show a RED line indicating that replica is down. But in reality, it's because it's no longer in the registry and therefore we never try it again.
Instead, we should just exclude these stale gauges from the metrics we export.

Ideally, this is something we could do with the metrics library we use (opentelemetry and opentelemetry_prometheus), but from my impression it doesn't seem possible atm (See [opentelemetry-rust#955](open-telemetry/opentelemetry-rust#955)).

What we end up doing, is very primitive - before handing out the metrics response to Prometheus, we go over it and clean up manually any stale gauge entries. 

See merge request dfinity-lab/public/ic!10377
@TommyCpp
Copy link
Contributor

Yeah, I think you are right. The ideal solution will be unregister support.

I will create an issue to add unregister support. I think there are a few things that are worth discussing before we implement it. Sorry we don't have a better solution for you 😔

Close with feature tracking issue #958.

@TommyCpp TommyCpp closed this as not planned Won't fix, can't repro, duplicate, stale Jan 28, 2023
@ostkrok
Copy link

ostkrok commented Apr 20, 2023

Hi @TommyCpp,

We're also interested in the same behavior as @rikonor.

However, I don't think that unregister as it's currently defined does what we want. It only unregisters the callbacks that update the instruments, but doesn't remove the actually values from the registry.

So I think this issue should be re-opened as #958 won't solve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-metrics Area: issues related to metrics question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants