-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove stale gauges #955
Comments
You can set |
Hi @TommyCpp, thanks for your reply! Can you elaborate a bit?
Does this mean that there's some internally defined interval after which a metric expires and is removed from the registry? If so, what is that interval? Also, what if only want to drop a specific metric? |
Sorry was in a hurry this morning so I didn't realize you are using prometheus exporter, which only supports Looking at your example, I feel like the instrument you are looking for is
For prometheus use case, where the exporter is pull based. The interval will be the time between two consecutive
That's part of the reason we are dropping |
I'm not sure that's correct. I'll try to give a better explanation of our use case. We have a service that periodically obtains a listing of nodes and runs a health check on them. Among other metrics, it also exposes a gauge with a unique label for each node indicating 0 (for down) or 1 (for up). Occasionally, a node that was previously in the listing is removed, so from that point on our service won't do any health checks against that removed node. In that case, we don't want to retain the gauge with the label for that node, since it will just always keep the last value it had (e.g 0). It would be nice if there was a way to remove that specific instance of the gauge (while retaining all the other ones), but I haven't been able to find a way to do it with the SDK. It's a bit hacky, but a short-term solution for now would be to manually remove those entries from the exporter's text output. |
feature(BN): control-plane metrics should not include stale gauges There's an issue at the moment where if the NNS registry no longer includes a specific replica, the Boundary Node retains the last gauge value for that replica. I.e if the last time it was health checked was a failure, then our graphs will forever show a RED line indicating that replica is down. But in reality, it's because it's no longer in the registry and therefore we never try it again. Instead, we should just exclude these stale gauges from the metrics we export. Ideally, this is something we could do with the metrics library we use (opentelemetry and opentelemetry_prometheus), but from my impression it doesn't seem possible atm (See [opentelemetry-rust#955](open-telemetry/opentelemetry-rust#955)). What we end up doing, is very primitive - before handing out the metrics response to Prometheus, we go over it and clean up manually any stale gauge entries. See merge request dfinity-lab/public/ic!10377
Yeah, I think you are right. The ideal solution will be I will create an issue to add Close with feature tracking issue #958. |
Hi @TommyCpp, We're also interested in the same behavior as @rikonor. However, I don't think that unregister as it's currently defined does what we want. It only unregisters the callbacks that update the instruments, but doesn't remove the actually values from the registry. So I think this issue should be re-opened as #958 won't solve this issue. |
Hi!
I have the following example that exports a set of gauges.
In some cases, my gauges become stale (i.e the thing they track no longer exists), in which case, I don't wish for them to be exported anymore.
Failing to do this just makes the gauge retain it's value forever, which makes my metric aggregations wrong.
Is there a way to remove these stale metrics from my meter or registry?
The text was updated successfully, but these errors were encountered: