-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect daemonset overhead with taints/tolerations #1749
Comments
This issue is currently awaiting triage. If Karpenter contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Can you share your logs for this? This sounds unexpected to me, and it sounds like you have the logs ready available. |
Sure, here's a sample of the daemonset overhead error from our existing logs. The scheduling error here is unrelated I believe, that's just a result of working on our nodepools. But what's unusual is the reported daemonset overhead. Notice most nodepools show a daemonset overhead of 0 pods, which is not accurate. There are a few nodepools that accurately detect some of the daemonset overhead, those are in cases where a daemonset is specifically injected with a toleration matching that nodepool, even though in all cases they also have the Again just to be clear, the fact that this particular pod can't schedule is expected and not related to the error I originally mentioned. The issue I want to highlight is that most of these nodepools report an overhead of 0, despite the fact that several daemonsets are running on each. The overhead I'd expect for each of these nodepools would be something like:
|
Description
Observed Behavior:
When using Daemonsets with a toleration of
- operator: Exists
, and nodepools with taints configured, Karpenter doesn't appear to account for daemonset overhead accurately on the nodes, and logs the daemonset overhead as{"pods":"0"}
.This behavior appears to extend to kube-system daemonsets as well, such as kube-proxy.
Seemly as a result, in nodepools that allow it, Karpenter will generate smaller nodes than appropriate given the daemonset overhead, including generating nodes too small to even fit a single pod given existing overhead, which causes dramatic churning behavior.
Expected Behavior:
Karpenter should recognize that a toleration of
- operator: Exists
allows a daemonset to tolerate all taints and that daemonsets with this toleration will schedule on tainted nodepools. This fact should be accounted for in daemonset overhead calculations to ensure node sizes are decided correctly.Reproduction Steps (Please include YAML):
To replicate churning behavior,
Start with an EKS cluster with 4+ daemonsets with tolerations of
- operator: Exists
. For reference, our clusters have the following EKS daemonsets, but I'd expect others with the same taint to cause similar behavior: aws-node, ebs-csi-node, eks-pod-identity-agent, kube-proxyNote: all of the listed daemonsets use the following toleration block
Apply a nodepool with some taint while also allowing for very small instance sizes (such as t3.nano):
Apply a pod for scheduling on the given nodepool:
A new node should be created for the pod, but if the pod size is small enough karpenter can select a t3.nano for it. This instance type only allows 4 pods, meaning it will fill with daemonset pods, and the nginx pod will fail to schedule. After a short time Karpenter seems to detect that the pod still hasn't scheduled. It'll create a new node, but will still select t3.nano causing the same issue. In the meantime, it'll detect the first node is empty, and will destroy it. Over time this causes significant churn and AWS config costs, and the nginx pod will never schedule.
I've also observed similar behavior at larger node sizes, presumably caused by the same root issue, but with particular memory and cpu requests from daemonsets squeezing out a large scheduling pod. However it's much easier to replicate at small pod & nodes sizes.
Versions:
kubectl version
): v1.30.4-eks-a737599The text was updated successfully, but these errors were encountered: