-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AWS ASG mixed instances policy #1886
Add support for AWS ASG mixed instances policy #1886
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
CLA signed |
/check-cla |
There seems to be some unnecessary commits in this PR. Please take a look. |
* Enables the ASG instance type to be determined from the LaunchTemplate default * This makes it possible to have a mixed instance ASG, relying on AWS' native logic for things like on-demand launch priority and spot instances etc
bc77dd4
to
a6ce5d9
Compare
Fixed :) |
I am currently using this in the following manner:
|
CLA is still not signed. |
/check-cla |
@@ -388,6 +388,9 @@ func (m *asgCache) buildAsgFromAWS(g *autoscaling.Group) (*asg, error) { | |||
func (m *asgCache) buildLaunchTemplateParams(g *autoscaling.Group) (string, string) { | |||
if g.LaunchTemplate != nil { | |||
return aws.StringValue(g.LaunchTemplate.LaunchTemplateName), aws.StringValue(g.LaunchTemplate.Version) | |||
} else if g.MixedInstancesPolicy != nil && g.MixedInstancesPolicy.LaunchTemplate != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @drewhemm This is great! Can you also add some tests for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep in mind that Cluster Autoscaler in many places assumes that nodes within a single node pool are identical - they have same labels, capacity, etc. If the nodes in the mixed pool are different, it may cause problems with scale-up.
Yes, that's why I choose instance types that are equivalent in terms of CPU and memory. I also include the next instance type up, because it is still cheaper as a spot instance than the on-demand 'base' instance. If I understand correctly, scaling-out is based on available capacity, which is defined by Kubernetes, not by Cluster Autoscaler. Please correct me if I am wrong on this. If once capacity is full, CA then looks at the 'base' instance type for the LT, but actually spins up a bigger one, I think that is acceptable? My priority here is availability of spot instances, but others may view the issue from a different perspective. |
@drewhemm Community has lots of discussion on this top. Basically, the way MixInstancePolicy or Ec2 Fleet scale is not fully compatible with Cluster Autoscaling. As I know, most of the users are still on LaunchConfiguration (ASG) and stick to one instance type per ASG. MixInstancePolicy and EC2 Fleet requires Launch Template.
|
My use case is no 2, which would indeed be quite complicated if the instance types are completely different and we are expecting CA to resolve that. If the instance types have the same number of CPU cores (and ideally the same processor speed) and amount of memory, then CA can safely increase the desired count regardless. If an instance gets added that has more CPU and memory (e.g. 2xlarge instead of xlarge), that won't be a problem and if I am right CA won't scale it in while it is being utilised. Again, please correct me if my understanding of what CA is doing is incorrect. I am quite happy to create multiple mixed instance ASGs according to my needs and leave CA to do the simple task of scaling in and out whatever I have given it. It may not be 100% perfect, but it certainly works and is achieved with merely a couple of lines of code. I took a look at the existing tests, but they seem quite limited. Can you give me some pointers on what the tests should do? I would be happy to add them. P.S. I am on holiday until the end of the month though, so while I can reply in discussions, I probably won't be doing any coding until then... |
@drewhemm Thanks! Looks like test for this file doesn't give a good example. Maybe I can rewrite it later. Right now, let's just make sure logic is ok. I have not tried MixInstancePolicy, my only concern is are |
@Jeffwan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mwielgus The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hello! I was looking exactly for this feature and I see it is already merged into master. Could be possible to include it into the 1.14.x release? Thanks! |
Hi @danigrmartinez, for now I just built an image and deployed that, until a new release comes out:
|
Thank you, @drewhemm, definitely we are testing with our own build base on 1.14 and this change. We just wanted to know if we could avoid deploying custom builds. Do you know when the next release will go out? |
Sorry for commenting late, I only spotted this now. I don't want to meddle with cloudprovider I don't have experience with, but what you do is unsafe. I think at minimum the documentation should mention the dangers of using this feature.
It will actually be a problem for future scale-ups. Template is only used for scale-from-0. If you already have nodes in a given ASG, Cluster-Autoscaler will use a random existing node as a template. So if you have a 2xlarge in otherwise xlarge group, CA can perform scale-up with assumption that all future nodes in this ASG will be 2xlarge. To answer a logical follow-up question: CA uses existing node if possible, because predicting how node will actually look like based on template is just a (very) poor approximation. For more details you can check the discussion on #1021. |
Okay, so instance types should ideally have the same amount of memory and number of CPU cores. That will reduce the number of suitable instance types (and therefore spot diversity), but sounds like it is worth it from the perspective of not messing with current/expected CA behaviour? |
tl;dr you'll probably be fine if you use instance types with the same cpu/memory. Technically speaking any difference in the resulting node object is potentially problematic - a different set of system labels might cause a problem for pods that use nodeSelector/nodeAffinity on one of those labels. A different example is how we don't officially support multi-AZ ASGs - if you use any zone-aware scheduling feature on your pods, CA may make wrong decisions. |
Quick note about this: Spot Instances in a mixed ASG are selected according to the lowest priced instances per availability zone. the number of lowest priced instances per AZ for ASG to fulfill capacity from is determined with a parameter: SpotInstancePools "SpotInstancePools The range is 1–20 and the default is 2." |
Agree. I think at least we should have some documentation to warn users the risk. A feasible usage is users add similar nodes with same CPU and Memory to improve chance spot request can be filled and reduce risk between different instance types. Even with this, some node affinity cases will not work since node labels are different. Users have to take risk if they use this way. |
I'll put something together in another PR to add to the Common notes and gotchas section for AWS |
I realised it needed a bit more than a bullet point in the CN&G, so I added a whole section for it. |
No description provided.