-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support AWS Inferentia chips #3095
Comments
Could you tell us which version of k8s and the CA you're running? Also, are there any Inf1 instances running at the point the CA produces this message? If so could you provide the output (appropriately obfuscated) of describing one of those nodes? |
@gjtempleton Thanks for coming back with a reply. The version of k8s is 1.15 (the highest possible on EKS) and that of the CA is 1.18.1. No, no Inf1 instances are running at that point. The node(s) are not up because the CA can't find any nodes matching the requested resources and so, no node can be spun up. I can force one to scale-up though and then copy-paste the output. Here's an Inf instance described:
Also, one thing to keep in mind is that the number of allocable |
@gjtempleton I've done a bit more testing on this matter and here's what I've found:
I suspect that when there is more than 1 inf chip on a given instance, the CA doesn't know how to bring that up (and I get that above error from my first message). Somehow it works when there's at least one node running. In the same vein, I suspect it's the same case for One important observation is that in my previous node description, the log should have described an Maybe there's an incompatibility between CA (version 1.18.1) and EKS (version 1.15) due to their mismatch in version numbers. I know they ought to run on the same version (1.15.x on CA with 1.15 on EKS), but I really had to update it to 1.18.1 to get the support for inf instances. And EKS can't be updated to a higher version because 1.15 is already the highest. If #2550 is patched onto version 1.15.6 of CA (to get a 1.15.7), then maybe this issue will go away. Not to mention the error I keep getting on 1.18.1 about CSI nodes not existing. Do you think that would solve something? What's your take on this? What do you think there's to be done here? EditI noticed that our cluster was originally running on version Now, I tried version 1.15.6 and in the
But when it tries to auto-scale, I get this: What do you think it's going wrong here? |
I have an inkling on the 1.15.6 issue you've seen, but need to double check this in a cluster before I can say for sure. In terms of the main issue though, can I ask what tags you have on the ASG when you're trying to scale from zero? Are there any tags on the ASG itself at the AWS level along the lines of
? |
@gjtempleton I have something along the lines of
I see there was a bug regarding the dynamically generated instance types in #3109. This may make my issue go away since I could revert back to Is there a timeline of when this could make it in as patch for |
@gjtempleton applying #3110 to version I'm still testing, so I might be back with some more feedback. Until then, I wouldn't close the ticket. |
inf1 has not been officially supported on EKS, it will be ready pretty soon (next week). Thanks for reporting issues here. |
Hi, I'm also having thie problem
I'm using cluster-autoscaler:v1.15.6, do you have recommendation ? |
/assign @Jeffwan |
Can you use this option? It will load all instance types in runtime autoscaler/cluster-autoscaler/main.go Line 171 in f029a38
|
Check here for more details. Feel free to reopen the issue |
@Jeffwan: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/close |
It appears that even though Inferentia instances have been added to cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go#L655-L678, they still don't have support for the actual Inferentia chips.
This is what I'm getting when I try to scale:
1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"api-api-0-67b4b9665f-rw7zm", UID:"808d2f5b-4c66-47cc-af1b-d66144ce5959", APIVersion:"v1", ResourceVersion:"31577", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient hugepages-2Mi, 2 Insufficient aws.amazon.com/infa, 1 max node group size reached
.For reference, here are the instructions to get the k8s setup for neuron (Inferentia chips):
https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-container-tools/tutorial-k8s.md#steps
Is there anyone that can help us with a patch? Or at least with a bit of info as to what parts of the codebase have to be modified to accomodate Inferentia chips.
The text was updated successfully, but these errors were encountered: