Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Minimum Queue Size Threshold to Metrics API #284

Open
MHillyer opened this issue Sep 19, 2024 · 1 comment
Open

Add Minimum Queue Size Threshold to Metrics API #284

MHillyer opened this issue Sep 19, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@MHillyer
Copy link
Collaborator

MHillyer commented Sep 19, 2024

As a MailOps administrator I use the metrics API to monitor my servers looking to see if any queues are getting too large so I can investigate further.

My monitoring tool of choice is overwhelmed by the sheer number of metrics returned because I get data on each and every queue that is active, when I only care about larger queues.

If I can pass an argument to say what the smallest queue size is that I care about, I can significantly narrow down how much data the API returns, without losing any relevant information for my monitoring purposes.

For example, I could pass that min size is 1000 and any queues smaller than 1000 messages would not be returned by the metrics API.

@MHillyer MHillyer converted this from a draft issue Sep 19, 2024
@MHillyer MHillyer added the enhancement New feature or request label Sep 19, 2024
@wez
Copy link
Collaborator

wez commented Sep 19, 2024

The situation behind this has more nuance than a simple minimum threshold for queue size, because the metrics don't know that they are associated with a queue, and the logic for exporting metrics doesn't know about the concept of a queue, because metrics and exporting is part of a separate crate independent from the queuing portion of kumod.

While we can fairly easily decide to set a minimum threshold for literally just the scheduled_count metric we cannot easily know in the exporter that the other half-dozen or so related metrics should be excluded if the scheduled_count is below some threshold without introducing quadratic complexity to the exporter where every metric needs to know about its overall relationship with others and resolve and evaluate the queue size metric from that association.

We could shift the responsibility for exclusion to the client by making them explicitly pass in the list of metrics and their thresholds as part of the /metrics GET request, but there are already quite a few different variations of metrics and rollups and that list would immediately become very cluttered and difficult to manage.

The way I'm leaning at the moment is that it might best to leave that sort of filtering logic to the prometheus configuration as discussed in https://grafana.com/blog/2022/10/20/how-to-manage-high-cardinality-metrics-in-prometheus-and-kubernetes/ and https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/ because the configuration is at least easier to see and understand in the prometheus config file vs. all pushed into a giant HTTP URL.

In discussion with a customer, I got the impression that the prometheus export doesn't really work as well as desired at scale because the cardinality is so high, and there are some operational states around understanding the various causes of throttling that cannot be expressed in the relatively limited numerical form that prometheus supports. What I'm exploring at the moment is a non-prometheus endpoint that can more easily be constrained with thresholds and also show textual and timestamp information; for example, we could indicate that the maintainer has reached a connection cap to to hitting a specific provider throttle, it's name, and when that state came into effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

2 participants