You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a MailOps administrator I use the metrics API to monitor my servers looking to see if any queues are getting too large so I can investigate further.
My monitoring tool of choice is overwhelmed by the sheer number of metrics returned because I get data on each and every queue that is active, when I only care about larger queues.
If I can pass an argument to say what the smallest queue size is that I care about, I can significantly narrow down how much data the API returns, without losing any relevant information for my monitoring purposes.
For example, I could pass that min size is 1000 and any queues smaller than 1000 messages would not be returned by the metrics API.
The text was updated successfully, but these errors were encountered:
The situation behind this has more nuance than a simple minimum threshold for queue size, because the metrics don't know that they are associated with a queue, and the logic for exporting metrics doesn't know about the concept of a queue, because metrics and exporting is part of a separate crate independent from the queuing portion of kumod.
While we can fairly easily decide to set a minimum threshold for literally just the scheduled_count metric we cannot easily know in the exporter that the other half-dozen or so related metrics should be excluded if the scheduled_count is below some threshold without introducing quadratic complexity to the exporter where every metric needs to know about its overall relationship with others and resolve and evaluate the queue size metric from that association.
We could shift the responsibility for exclusion to the client by making them explicitly pass in the list of metrics and their thresholds as part of the /metrics GET request, but there are already quite a few different variations of metrics and rollups and that list would immediately become very cluttered and difficult to manage.
In discussion with a customer, I got the impression that the prometheus export doesn't really work as well as desired at scale because the cardinality is so high, and there are some operational states around understanding the various causes of throttling that cannot be expressed in the relatively limited numerical form that prometheus supports. What I'm exploring at the moment is a non-prometheus endpoint that can more easily be constrained with thresholds and also show textual and timestamp information; for example, we could indicate that the maintainer has reached a connection cap to to hitting a specific provider throttle, it's name, and when that state came into effect.
As a MailOps administrator I use the metrics API to monitor my servers looking to see if any queues are getting too large so I can investigate further.
My monitoring tool of choice is overwhelmed by the sheer number of metrics returned because I get data on each and every queue that is active, when I only care about larger queues.
If I can pass an argument to say what the smallest queue size is that I care about, I can significantly narrow down how much data the API returns, without losing any relevant information for my monitoring purposes.
For example, I could pass that min size is 1000 and any queues smaller than 1000 messages would not be returned by the metrics API.
The text was updated successfully, but these errors were encountered: