-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing to provision EC2 node even though matching EC2 node type exists #1212
Comments
I am currently experiencing the exact same issue - currently attempting to troubleshoot this with AWS now. We are running chart version Edit: We are using |
Right, I didn't mention that our setup just recently upgraded from k8s 1.27 and Karpenter 0.28.1, so this exists for quite some time. |
You might need to adjust the |
Thanks a lot for proving so much context and related issues/pull requests! I'll double check what kubelet reports for this instance type and see if math makes sense with Karpenter added overhead, but that seems to explain it. The core problem of memory being reported less by kubelet may have different solutions, I could also think of an optional field in ec2nodeclass, but of course I'm not aware how majority cuts these classes for their use cases and if this field would be of help. However, would definitely help my use case. |
/remove-kind bug |
/triage accepted |
Also seeing this issue in 0.36.1. We're trying to use GPU nodes and setting I even checked my node pool requirements settings against https://karpenter.sh/docs/reference/instance-types/#g5xlarge (for example), and they should match and use this instance type. But it doesn't seem to work |
Nevermind, I think the problem is AWS was out of all the instance types that our node pool was selecting. When I broadened the node pool requirements it was able to find a match finally. Also enabled the Prometheus metrics which have some useful metrics on instance type availability. Maybe the error message in this instance is a little bit too vague: |
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity. |
Sorry to resurrect this issue, but i dont think this is related to
Is my understanding correct? Relevant code. If my understanding is correct, it is unclear why the Pods are not getting a Node allocated and I think it would be helpful to surface in the error a more clear message about what did not match. If it is incorrect, then I also believe the error message is inclease as this should trigger |
Sample config to replicate: Nodepool
Workload:
|
Description
Observed Behavior:
Karpenter can not schedule the pod claiming that scheduling requirements were not met.
The pod should run on i3en.24xlarge EC2 node, and as far as I can tell the resource requests satisfy the resources available on this node type (including daemonset pods overhead). If i3en.24xlarge node is already running in the cluster Kubernetes scheduler successfully assigns this pod to run on that node, the only problem is to get Karpenter to provision it. One way I was able to do that is to scale up another deployment, where CPU requests is set to 40, thus, 2 Pods force Karpenter to provision the node I need, so that Kubernetes scheduler can then assign my initial pod to it.
Setting Karpenter log level to
debug
unfortunately didn't reveal anything at all (I have just setlogLevel
Helm chart property todebug
, is there some other place to increase it even more?). Here is what Karpenter says at the moment:Overall it looks like that something is wrong with how Karpenter calculates memory requirements, or maybe the problem is on AWS side - whatever it returns to Karpenter when it tries to fetch information on available node types, thus, the issue could be related to Karpenter or AWS provider. Maybe the reason lies in the fact that when such node is running, Kubernetes reports that it consumes 97%, and there is some kind of a threshold that Karpenter doesn't want to break? I would still expect Karpenter to be able to provision a node, since scheduler didn't have problems to assing that pod.
Expected Behavior:
Karpenter identifies matching node and provisions it.
Reproduction Steps (Please include YAML):
Karpenter node definition:
Sample pod:
Versions:
kubectl version
):The text was updated successfully, but these errors were encountered: