Add support for AWS ASG mixed instances policy #1886

drewhemm · 2019-04-12T11:00:14Z

No description provided.

k8s-ci-robot · 2019-04-12T11:00:17Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

drewhemm · 2019-04-12T11:04:12Z

Enables the ASG instance type to be determined from the LaunchTemplate
default
This makes it possible to have a mixed instance ASG, relying on AWS'
native logic for things like on-demand launch priority and spot
instances etc

drewhemm · 2019-04-12T11:07:25Z

CLA signed

drewhemm · 2019-04-12T11:08:46Z

/check-cla

mwielgus · 2019-04-12T11:09:35Z

There seems to be some unnecessary commits in this PR. Please take a look.

* Enables the ASG instance type to be determined from the LaunchTemplate default * This makes it possible to have a mixed instance ASG, relying on AWS' native logic for things like on-demand launch priority and spot instances etc

drewhemm · 2019-04-12T11:12:28Z

There seems to be some unnecessary commits in this PR. Please take a look.

Fixed :)

drewhemm · 2019-04-12T12:45:11Z

I am currently using this in the following manner:

Create Launch Template with an instance type, for example, r5.2xlarge. Consider this the 'base' instance type. No spot config defined here
Create an ASG with a MixedInstancesPolicy that refers to the above LT
Set LaunchTemplateOverrides to include the 'base' instance type and suitable alternatives. I have r5.2xlarge, i3.2xlarge, r5.4xlarge and i3.4xlarge. (I am not using r5a.2xlarge because of their slower processors, but they could be added for certain use-cases)
Set InstancesDistribution to your desired settings. I am using 100% spot instances.
For on-demand instances the ordering of the overrides determines which instance type will be chosen
For spot instances, they are chosen at random, which is not perfect, but all of those instance spot prices are cheaper than the on-demand price for the instance type defined in the LT, so I am still saving money.
Repeat by creating other LTs and ASGs, for example c5.18xlarge and c5n.18xlarge or a bunch of similar burstable instances.

mwielgus · 2019-04-14T22:09:55Z

CLA is still not signed.

drewhemm · 2019-04-15T05:02:56Z

/check-cla

Jeffwan · 2019-04-12T16:14:16Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

@@ -388,6 +388,9 @@ func (m *asgCache) buildAsgFromAWS(g *autoscaling.Group) (*asg, error) {
 func (m *asgCache) buildLaunchTemplateParams(g *autoscaling.Group) (string, string) {
 	if g.LaunchTemplate != nil {
 		return aws.StringValue(g.LaunchTemplate.LaunchTemplateName), aws.StringValue(g.LaunchTemplate.Version)
+	} else if g.MixedInstancesPolicy != nil && g.MixedInstancesPolicy.LaunchTemplate != nil {


Hi @drewhemm This is great! Can you also add some tests for this change?

mwielgus

Please keep in mind that Cluster Autoscaler in many places assumes that nodes within a single node pool are identical - they have same labels, capacity, etc. If the nodes in the mixed pool are different, it may cause problems with scale-up.

drewhemm · 2019-04-16T14:43:48Z

Yes, that's why I choose instance types that are equivalent in terms of CPU and memory. I also include the next instance type up, because it is still cheaper as a spot instance than the on-demand 'base' instance. If I understand correctly, scaling-out is based on available capacity, which is defined by Kubernetes, not by Cluster Autoscaler. Please correct me if I am wrong on this.

If once capacity is full, CA then looks at the 'base' instance type for the LT, but actually spins up a bigger one, I think that is acceptable?

My priority here is availability of spot instances, but others may view the issue from a different perspective.

Jeffwan · 2019-04-16T19:42:48Z

@drewhemm Community has lots of discussion on this top. Basically, the way MixInstancePolicy or Ec2 Fleet scale is not fully compatible with Cluster Autoscaling. As I know, most of the users are still on LaunchConfiguration (ASG) and stick to one instance type per ASG. MixInstancePolicy and EC2 Fleet requires Launch Template.
Major use cases is

Spot + onDemand in one group, with same instance type (your case? This should be safe)
Spot + onDemand Different instance type. As you said, CA check template and then bring up nodes, I don't think this will work pretty well with current setting.

drewhemm · 2019-04-16T20:11:21Z

My use case is no 2, which would indeed be quite complicated if the instance types are completely different and we are expecting CA to resolve that. If the instance types have the same number of CPU cores (and ideally the same processor speed) and amount of memory, then CA can safely increase the desired count regardless. If an instance gets added that has more CPU and memory (e.g. 2xlarge instead of xlarge), that won't be a problem and if I am right CA won't scale it in while it is being utilised. Again, please correct me if my understanding of what CA is doing is incorrect.

I am quite happy to create multiple mixed instance ASGs according to my needs and leave CA to do the simple task of scaling in and out whatever I have given it. It may not be 100% perfect, but it certainly works and is achieved with merely a couple of lines of code.

I took a look at the existing tests, but they seem quite limited. Can you give me some pointers on what the tests should do? I would be happy to add them.

P.S. I am on holiday until the end of the month though, so while I can reply in discussions, I probably won't be doing any coding until then...

Jeffwan · 2019-04-17T16:52:09Z

@drewhemm Thanks! Looks like test for this file doesn't give a good example. Maybe I can rewrite it later. Right now, let's just make sure logic is ok. I have not tried MixInstancePolicy, my only concern is are MixedInstancesPolicy and LaunchTemplate mutually exclusive?
If that's not true, you probably want to move logic front of LaunchTemplate. If it's true, current PR looks good to me

almariah · 2019-04-20T01:02:24Z

@Jeffwan MixedInstancesPolicy and LaunchTemplate are mutually exclusive based on AWS docs
... Required: Conditional. You must specify one of the following: InstanceId, LaunchConfigurationName, LaunchTemplate, or MixedInstancesPolicy...

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-group.html#cfn-as-group-launchtemplate

mwielgus

/lgtm
/approve

k8s-ci-robot · 2019-04-23T21:53:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

danigrmartinez · 2019-04-29T09:21:14Z

Hello! I was looking exactly for this feature and I see it is already merged into master. Could be possible to include it into the 1.14.x release? Thanks!

drewhemm · 2019-04-29T10:09:08Z

Hi @danigrmartinez, for now I just built an image and deployed that, until a new release comes out:

git clone [email protected]:kubernetes/autoscaler.git
cd cluster-autoscaler
make
docker build .
docker tag BUILD_IMAGE PRIVATE_REPO_IMAGE
docker push

danigrmartinez · 2019-04-29T10:45:36Z

Thank you, @drewhemm, definitely we are testing with our own build base on 1.14 and this change. We just wanted to know if we could avoid deploying custom builds. Do you know when the next release will go out?

MaciekPytel · 2019-04-29T12:36:35Z

Sorry for commenting late, I only spotted this now. I don't want to meddle with cloudprovider I don't have experience with, but what you do is unsafe. I think at minimum the documentation should mention the dangers of using this feature.

If an instance gets added that has more CPU and memory (e.g. 2xlarge instead of xlarge), that won't be a problem

It will actually be a problem for future scale-ups. Template is only used for scale-from-0. If you already have nodes in a given ASG, Cluster-Autoscaler will use a random existing node as a template. So if you have a 2xlarge in otherwise xlarge group, CA can perform scale-up with assumption that all future nodes in this ASG will be 2xlarge.
This is especially painful with smaller machine types - if you have pending pods that doesn't fit on your 'base' machine, but fit on a random larger one in your ASG autoscaler will keep adding nodes to that ASG. For example imagine an ASG with 5 1-CPU machines and a single 2-CPU one. If you have a pending pod that requests 1.5CPUs, ~every 6 loops (not exactly every 6 loops of course) CA will randomly pick the 2-CPU instance as a template and trigger a scale-up in that ASG.

To answer a logical follow-up question: CA uses existing node if possible, because predicting how node will actually look like based on template is just a (very) poor approximation. For more details you can check the discussion on #1021.

drewhemm · 2019-04-29T12:47:04Z

Okay, so instance types should ideally have the same amount of memory and number of CPU cores.

That will reduce the number of suitable instance types (and therefore spot diversity), but sounds like it is worth it from the perspective of not messing with current/expected CA behaviour?

MaciekPytel · 2019-04-29T13:24:44Z

tl;dr you'll probably be fine if you use instance types with the same cpu/memory.

Technically speaking any difference in the resulting node object is potentially problematic - a different set of system labels might cause a problem for pods that use nodeSelector/nodeAffinity on one of those labels. A different example is how we don't officially support multi-AZ ASGs - if you use any zone-aware scheduling feature on your pods, CA may make wrong decisions.
That being said, the more similar the nodes are, the more unlikely/convoluted the use-case that they break becomes. If the nodes are identical from the perspective of scheduling your pods, you should be 100% safe. My guess would be that if you use instances with identical cores&memory it would be safe for vast majority of users, but there is probably someone out there using some less common scheduling feature that would somehow differ between different instance types.

ranshn · 2019-04-30T09:41:53Z

For spot instances, they are chosen at random, which is not perfect, but all of those instance spot prices are cheaper than the on-demand price for the instance type defined in the LT, so I am still saving money.
@drewhemm

Quick note about this: Spot Instances in a mixed ASG are selected according to the lowest priced instances per availability zone. the number of lowest priced instances per AZ for ASG to fulfill capacity from is determined with a parameter: SpotInstancePools

"SpotInstancePools
The number of Spot pools to use to allocate your Spot capacity. The Spot pools are determined from the different instance types in the Overrides array of LaunchTemplate.

The range is 1–20 and the default is 2."
from: https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_InstancesDistribution.html

Jeffwan · 2019-04-30T16:12:21Z

Sorry for commenting late, I only spotted this now. I don't want to meddle with cloudprovider I don't have experience with, but what you do is unsafe. I think at minimum the documentation should mention the dangers of using this feature.

Agree. I think at least we should have some documentation to warn users the risk. A feasible usage is users add similar nodes with same CPU and Memory to improve chance spot request can be filled and reduce risk between different instance types. Even with this, some node affinity cases will not work since node labels are different. Users have to take risk if they use this way.

drewhemm · 2019-05-03T11:19:31Z

I'll put something together in another PR to add to the Common notes and gotchas section for AWS

drewhemm · 2019-05-03T11:53:19Z

I realised it needed a bit more than a bullet point in the CN&G, so I added a whole section for it.

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 12, 2019

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer April 12, 2019 11:00

mwielgus requested review from Jeffwan and removed request for feiskyer and aleksandra-malinowska April 12, 2019 11:08

Added handler for ASG MixedInstancesPolicy

a6ce5d9

* Enables the ASG instance type to be determined from the LaunchTemplate default * This makes it possible to have a mixed instance ASG, relying on AWS' native logic for things like on-demand launch priority and spot instances etc

drewhemm force-pushed the mixed-instances-policy branch from bc77dd4 to a6ce5d9 Compare April 12, 2019 11:11

drewhemm mentioned this pull request Apr 12, 2019

Added OnDemand and Spot Price models addressing #131 #486

Closed

mwielgus closed this Apr 14, 2019

mwielgus reopened this Apr 14, 2019

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 15, 2019

Jeffwan reviewed Apr 16, 2019

View reviewed changes

mwielgus reviewed Apr 16, 2019

View reviewed changes

k8s-ci-robot assigned Jeffwan Apr 23, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2019

k8s-ci-robot assigned mwielgus Apr 23, 2019

mwielgus approved these changes Apr 23, 2019

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2019

k8s-ci-robot merged commit 956e0c5 into kubernetes:master Apr 23, 2019

drewhemm deleted the mixed-instances-policy branch April 29, 2019 10:06

drewhemm mentioned this pull request May 3, 2019

Added documentation about AWS ASG MixedInstancesPolicy #1983

Merged

Jeffwan mentioned this pull request May 10, 2019

cluster autoscaler: Support scaling AWS groups from 0 when using MixedInstancesPolicy #1473

Closed

This was referenced May 13, 2019

[cherry-pick] Add ASG MixedInstancesPolicy Support #2013

Merged

[cherry-pick] Add ASG MixedInstancesPolicy Support 1.12 #2018

Closed

[cherry-pick] Add ASG MixedInstancesPolicy Support 1.13 #2019

Closed

ranshn mentioned this pull request Jun 5, 2019

Spot instance without mixed instance group eksctl-io/eksctl#839

Closed

This was referenced Aug 27, 2019

Continue operation if parsing some ASG fails #1728

Closed

[EKS/Kubernetes] [request]: Cluster autoscaler support for Multiple Instance Type ASGs/Spot fleets aws/containers-roadmap#144

Closed

EC2 Fleet Autoscaling Support? #838

Closed

deliahu mentioned this pull request Oct 6, 2019

Support spot instances [2] cortexlabs/cortex#469

Closed

5 tasks

danielmellado mentioned this pull request Oct 24, 2019

Rebase to upstream/cluster-autoscaler-release-1.16 openshift/kubernetes-autoscaler#119

Closed

drewhemm mentioned this pull request Nov 19, 2019

[EKS] Managed worker nodes aws/containers-roadmap#139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for AWS ASG mixed instances policy #1886

Add support for AWS ASG mixed instances policy #1886

drewhemm commented Apr 12, 2019

k8s-ci-robot commented Apr 12, 2019

drewhemm commented Apr 12, 2019

drewhemm commented Apr 12, 2019

drewhemm commented Apr 12, 2019

mwielgus commented Apr 12, 2019

drewhemm commented Apr 12, 2019

drewhemm commented Apr 12, 2019

mwielgus commented Apr 14, 2019

drewhemm commented Apr 15, 2019

Jeffwan Apr 12, 2019

mwielgus left a comment

drewhemm commented Apr 16, 2019 •

edited

Loading

Jeffwan commented Apr 16, 2019

drewhemm commented Apr 16, 2019

Jeffwan commented Apr 17, 2019

almariah commented Apr 20, 2019

mwielgus left a comment

k8s-ci-robot commented Apr 23, 2019

danigrmartinez commented Apr 29, 2019

drewhemm commented Apr 29, 2019

danigrmartinez commented Apr 29, 2019

MaciekPytel commented Apr 29, 2019

drewhemm commented Apr 29, 2019

MaciekPytel commented Apr 29, 2019

ranshn commented Apr 30, 2019

Jeffwan commented Apr 30, 2019

drewhemm commented May 3, 2019

drewhemm commented May 3, 2019

Add support for AWS ASG mixed instances policy #1886

Add support for AWS ASG mixed instances policy #1886

Conversation

drewhemm commented Apr 12, 2019

k8s-ci-robot commented Apr 12, 2019

drewhemm commented Apr 12, 2019

drewhemm commented Apr 12, 2019

drewhemm commented Apr 12, 2019

mwielgus commented Apr 12, 2019

drewhemm commented Apr 12, 2019

drewhemm commented Apr 12, 2019

mwielgus commented Apr 14, 2019

drewhemm commented Apr 15, 2019

Jeffwan Apr 12, 2019

Choose a reason for hiding this comment

mwielgus left a comment

Choose a reason for hiding this comment

drewhemm commented Apr 16, 2019 • edited Loading

Jeffwan commented Apr 16, 2019

drewhemm commented Apr 16, 2019

Jeffwan commented Apr 17, 2019

almariah commented Apr 20, 2019

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 23, 2019

danigrmartinez commented Apr 29, 2019

drewhemm commented Apr 29, 2019

danigrmartinez commented Apr 29, 2019

MaciekPytel commented Apr 29, 2019

drewhemm commented Apr 29, 2019

MaciekPytel commented Apr 29, 2019

ranshn commented Apr 30, 2019

Jeffwan commented Apr 30, 2019

drewhemm commented May 3, 2019

drewhemm commented May 3, 2019

drewhemm commented Apr 16, 2019 •

edited

Loading