-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KubePodEvictionRateHigh alert for elevated eviction rates #760
Add KubePodEvictionRateHigh alert for elevated eviction rates #760
Conversation
Signed-off-by: Mac Chaffee <[email protected]>
{ | ||
alert: 'KubePodEvictionRateHigh', | ||
expr: ||| | ||
sum(rate(kubelet_evictions[15m])) > %(highEvictionRateThreshold)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check kube_pod_status_reason ? I think the reason
label of this metric shall be used to find whther the pod has been evicted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm looks like that metric can't be used to determine the eviction rate since the query returns results for hours/days after the eviction actually happened (those two evictions happened yesterday, but appear when searching for evictions in the last 15 minutes):
Maybe there's a way we could still use kube_pod_status_reason
query results in the description? Or is the description limited to using results from the expr
?
We need to figure out a solution, where the responsible team gets the alert rather than a general KubeEvictionHigh alert. Example: If a namespace owned by a team A sets limits too low, the pod gets evicted Team A should get the alert, not the infrastructure team :) |
Sounds like we're stuck with the limitations of the Also that example only applies to clusters where resource limits are mandated, which is sadly not the norm. If limits aren't mandated, one pod using too many resources can cause a cascade of evictions across the whole node, something I think we should send alerts for sooner rather than later. You can always silent alerts, but the alert has to exist first :) |
We can't add a So it seems the current implementation is the best we can do. |
What about using kube-state-metrics and looking for Evicted Pods? |
That |
Sounds like there just isn't enough data to make a good, actionable alert. For anyone seeing this PR: I do still recommend using this alert since higher-than-normal eviction rates are always good to catch. You would just have to tune the threshold for your own cluster and have other sources of data (like event logging) to cross-reference. |
Fixes #759
This PR adds an alert to detect high pod eviction rates. Since the underlying metric (
kubelet_evictions
) doesn't have many labels, this alert just detects the cluster-wide eviction rate rather than any particular namespace or workload.This is very useful for workloads that may have RAM/ephemeral-storage limits that are set too low. Especially for DaemonSets because DaemonSets will silently remove evicted pods, so the issue is invisible unless you look at the events and pod ages. It could also be an early-warning sign for a DDoS attack.