-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: Can't scale up from 0 #2418
Comments
I should also mention that all of our workers do have the So I think we're satisfying the requirements from Scaling a node group to 0 in the docs. FWIW, the cluster was created with |
I'm facing this as well - additionally, I have 2 nodegroups (1 is an on-demand instance on AWS running the autoscaler), the other group is supposed to be a spot group where i want to deploy jobs - it does not scale up from 0. I'm using autodiscovery feature |
so I upgraded to |
Interesting. Hopefully if there's a fix for the 1.16 line it can be cherry picked back to 1.14 etc. since the docs recommend matching your CA version with your k8s version (maybe that's why yours isn't working, @chnsh). |
aah - possibly, I'll try tomorrow and update the thread. So, I too am on v1.14.5 now and it did trigger autoscaling for me - that bit works fine, I still don't see those nodes in |
Okay so I am on CA |
Interesting. I just tried As a workaround I guess I'll set the minimum on these guys to 1, but it would be great for one of the devs to take a look at this. |
@mgalgs Can you share eksctl cluster config? I can help reproduce the issue on our side. |
@Jeffwan Sure, here's my config: cluster.yml
I have three OnDemand instance groups, each with a handful of instances inside, and yes, CA is hosted there.
Yes
I tainted the spot nodes and added tolerations to workloads that can run on spots. It's all working as expected (spot-tolerant workloads are scheduled on spot nodes, non-spot-tolerant workloads avoid spot nodes) as long as I set the minSize of the group to 1.
I believe |
I get a similar error with kops provisioning mixed instance groups. I am scaling from zero. Cluster-autoscaler: v1.14.5 Error: Kops instance group config:
IAM role policy attached to cluster autoscaler:
|
@mgalgs Checking your logs, did you use any node selectors? Seems it fails on GeneralPredicates.
|
@faheem-cliqz I probably already fix the issue you meet. Please check this 58f3f23#diff-ade7b95627ea0dd6b6f4deee7f24fa7eR323-R331 We will have a release next week |
/assign @Jeffwan |
/area provider/aws |
@Jeffwan Regarding the logs:
This one is expected since I had a nodeSelector on this pod to force it onto a node from the spot group (and this isn't the spot group).
This one is strange because that nodeGroup should have had the necessary labels to allow that pod (with its nodeSelector) to be scheduled on that node... Again, I don't see this problem when minSize of the group is set to 1, and if this was a nodeSelector problem it seems like I'd still have an issue scheduling the pod... Are you testing with 1.14? If this is definitely fixed in 1.15 it might not even be worth troubleshooting here since we have a workaround (setting the group's minSize to 1). |
Cool will get back to you with updates once you guys releases :) |
@mgalgs
Do you have tag in your ASG? check here https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws#scaling-a-node-group-to-0 If you still have the issues, I will try to see anything wrong in 1.14. |
I see... So node labels and taints need to be applied to the ASGs themselves as well. Looking at my |
doh... This has already been raised on the eksctl project with a solution proposed. Thank you for your help! |
When the cluster-autoscaler adds a new node to a group, it grabs an existing node in the group and builds a "template" to launch a new node identical to the one it grabbed from the group. However, when scaling up from 0 there aren't any live nodes to reference to build this template. Instead, the cluster-autoscaler relies on tags in the ASG to build the new node template. This can cause unexpected behavior if the pods triggering the scale-out are using node selectors or taints; CA doesn't have sufficient information to decide if a new node launched in the group will satisfy the request. The long and short of it is that for CA to do its job properly we must tag our ASGs corresponding to our labels and taints. Add a note in the docs about this since scaling up from 0 is a fairly common use case. References: - kubernetes/autoscaler#2418 - eksctl-io#1066
When the cluster-autoscaler adds a new node to a group, it grabs an existing node in the group and builds a "template" to launch a new node identical to the one it grabbed from the group. However, when scaling up from 0 there aren't any live nodes to reference to build this template. Instead, the cluster-autoscaler relies on tags in the ASG to build the new node template. This can cause unexpected behavior if the pods triggering the scale-out are using node selectors or taints; CA doesn't have sufficient information to decide if a new node launched in the group will satisfy the request. The long and short of it is that for CA to do its job properly we must tag our ASGs corresponding to our labels and taints. Add a note in the docs about this since scaling up from 0 is a fairly common use case. References: - kubernetes/autoscaler#2418 - eksctl-io#1066
I can't get this working in k8s.gcr.io/cluster-autoscaler :( |
What autoscaler version are you using, and what eksctl version? |
Autoscaler v1.13.8. We don't use eksctl. We've manage all the infra using Terraform. Let me know if there is any additional info I can provide to help |
You may want to try to upgrade CA to 1.15 |
Scale up from 0 need tags on the ASG for CA the get template node. Could someone with problem share ASG settings? |
Hi, I have CA version 1.15.6 running my kubeadm cluster. The ASG is tagged correctly like below
CA is configured with autodiscover ASG. I have sample deployment like below
However CA doesnt scale up the ASG from 0. The pod always remain in pending status with below message.
The CA logs show below
Toleration exists in deployment and taints are applied at ASG . I am not sure whats missing. Could you please help here? |
|
Hi @Jeffwan Thanks for your response. I did try this however got similar results.
|
@Jeffwan This has been tested against 1.15.10 k8s version. We also tried with older version of k8s and got similar issue. The observation is once i keep minsize 1 then it did find the correct node and deploy. Latar on if i reduced minsize 0, it works as expected. However our problem is we do not want to keep minsize as 1 initially. Could you please help us here. |
@jvaibhav123 Do you any any other restrictions on the pod? Does it request other resources? |
@Jeffwan No there are no other restrictions. The example which has given above the actual use case except the image was different related to our application. Rest of the parameters are same. |
@mgalgs thanks for documenting the solution to this. I just ran into this and you saved my day! 💖 |
Cherry pick the bug fix in #2418 onto 1.19
Cherry pick the bug fix in #2418 onto 1.18
Possibly related: #1754
I recently added three new node groups to my cluster using AWS spot instances. I initially set the
minSize
on each of the three new groups to 0, but CA was refusing to scale them up from 0. If I go into the EC2 console and manually force the ASGminSize
up to 1 then CA gets unstuck and will continue scaling the group up as new requests come in.I'm attaching the following files:
minSize=0
and thus CA refused to scale them up.minSize=1
. At this point CA starts scaling them up as expected.get pod -o yaml
of my CAIs it not supported to have
minSize=0
on AWS?I'm running CA v1.14.5.
The text was updated successfully, but these errors were encountered: