EC2 Fleet Autoscaling Support? #838

zdoherty · 2018-05-09T03:36:56Z

Amazon recently released a feature called EC2 Fleets which appears to consolidate spot fleet requests with EC2 on-demand/auto-scaling group requests. Per their documentation, this appears to support a similar feature to desired capacity in auto-scaling:

You can modify the following parameters of an EC2 Fleet:

target-capacity – Increase or decrease the target capacity.

target-capacity appears to be very similar to spot fleets weighted capactity, but you're able to change it over time. Being able to change the target-capacity parameter over time seems to align closely with changing the DesiredCapacity parameter of an auto-scaling group. Are there any plans to support EC2 fleets with Kubernetes autoscaling?

itskingori · 2018-05-20T13:00:17Z

Amazon recently released a feature called EC2 Fleets which appears to consolidate spot fleet requests with EC2 on-demand/auto-scaling group requests.

According to this official blog post on EC2 Fleets, auto-scaling group support is still in progress. They say:

We plan to connect EC2 Fleet and EC2 Auto Scaling groups. This will let you create a single fleet that mixed instance types and Spot, Reserved and On-Demand, while also taking advantage of EC2 Auto Scaling features such as health checks and lifecycle hooks. This integration will also bring EC2 Fleet functionality to services such as Amazon ECS, Amazon EKS, and AWS Batch that build on and make use of EC2 Auto Scaling for fleet management.

Thought it's worth mentioning since it's greatly affects design/implementation.

fejta-bot · 2018-08-18T13:16:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-09-17T14:03:12Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

jfoy · 2018-10-16T22:25:02Z

/remove-lifecycle rotten

geota · 2018-11-16T20:13:00Z

@zdoherty @itskingori @jfoy
https://aws.amazon.com/about-aws/whats-new/2018/11/scale-instances-across-purchase-options-in-a-single-ASG/

I think this work is no longer blocked - ec2 fleets now support ASGs (or vice versa?)

cablespaghetti · 2018-11-17T21:30:06Z

@geota I believe this is solved by that release. It's now possible to have an ASG with spot instances in it...so in theory this should "just work" with the cluster autoscaler. I'm planning on doing some testing next week.

disha1104 · 2018-12-11T09:14:42Z

Any update on this? We are trying to achieve autoscaling with EC2-fleet.

cablespaghetti · 2018-12-11T12:04:49Z

Spot Fleet and EC2 Fleet seem to have been superseded by this: https://aws.amazon.com/blogs/aws/new-ec2-auto-scaling-groups-with-multiple-instance-types-purchase-options/

It doesn't quite have all the same features but it should fit most use cases. We've set it up with 100% spot instances of two different types using terraform (although you can have a mixture of spot and on demand like in EC2 fleet). The cluster autoscaler works very nicely with it. 😄

disha1104 · 2018-12-11T12:14:35Z

@cablespaghetti Thanks a lot. But does the present cluster autoscaler support Spot instances(EC2 fleet or spot fleet), from what I know, it was just supporting on-demand instances.

Also, will there be service interruption with this approach?
Basically what I am trying to achieve is an approach similar to "Spotinst" for managing K8s clusters on AWS. Can you suggest some similar use-case for that?

cablespaghetti · 2018-12-11T12:17:22Z

It doesn't know or care what kind of instances are behind the ASG. It just increases or decreases the "Desired Count". The ASG config controls what instance types are used auto-magically. I'll upload my terraform config which is 100% spot but it's easy to have a base of X on demand instances and then use spot (see the linked blog post).

cablespaghetti · 2018-12-11T12:23:56Z

https://gist.github.com/cablespaghetti/54de0ae93449e4698f0206a0e85514be

MaciekPytel · 2018-12-11T12:32:59Z

It doesn't know or care what kind of instances are behind the ASG

That is not true. CA works by performing careful simulations to see if adding more nodes would help pending pods. It needs to know exactly how a new node will look like to make good decisions.

The actuation is indeed done by increasing "desired count", so your setup will work to an extent. But it also means CA makes decisions based on incorrect data (it will make prediction how next node will look like, but in your setup it will be incorrect most of the time).

Effectively you're going from "carefully choose how many nodes of each kind will be the best fit for your pods" to "let's add some nodes and hope for the best". It may be good enough for a very simple cluster (single ASG, each pod can run on every type of node available), but it's not something we recommend. As of now our official policy is that all nodes in NodeGroup must be strictly identical and things like multi-instance type ASGs or multi-zonal ASGs are not officially supported by CA.

disha1104 · 2018-12-12T09:53:18Z

@MaciekPytel , so we can't use CA with spot instances (or EC2-fleets to be more precise)?

aleksandra-malinowska · 2018-12-12T12:15:14Z

@MaciekPytel , so we can't use CA with spot instances (or EC2-fleets to be more precise)?

To clarify: Cluster Autoscaler cares very much how the nodes (as in, Kubernetes Node objects) look like. It doesn't care about the underlying instance directly.

Most often, instance type will determine the node's scheduling properties, like allocatable resources. But if the nodes running on spot instances are indistinguishable from nodes running on regular instances, it should be OK to mix those.

dominicgunn · 2018-12-18T00:02:45Z

Has there been any progress on this? Are we able to increment spot fleets desired count at will?

gambol99 · 2018-12-20T10:07:44Z

That is not true. CA works by performing careful simulations to see if adding more nodes would help pending pods. It needs to know exactly how a new node will look like to make good decisions.

So if your using an ASG backed by LaunchTemplate + MixedInstancePolicy how does the CA determine the node capacity per ASG to simulate? .. given they can be backed by a mixture of instance types

MaciekPytel · 2018-12-20T12:38:45Z

So if your using an ASG backed by LaunchTemplate + MixedInstancePolicy how does the CA determine the node capacity per ASG to simulate? .. given they can be backed by a mixture of instance types

Precisely because of that reason MixedInstancePolicy is not officially supported by CA (as of now). You can still set it up using the config provided by @cablespaghetti or creating a similar one of your own. If you do CA will just take one instance type and assume each node will look exactly like that. Depending on your exact setup and your luck it may or may not result in correct scaling decisions.

Has there been any progress on this? Are we able to increment spot fleets desired count at will?

There is an ongoing effort to make it work #1473. Note that it still assumes all the instance types in ASG are (roughly) the same size.

AndresPineros · 2019-03-17T05:17:23Z

I think we should be able to simulate a behavior from projects like Spotinst. For me the main requirement is to be able to use Spot instances BUT fallback to On-Demand whenever there are no spot instances available. Then replace the On-Demand whenever Spot are available again.

The current support of EC2 Fleets in the ASGs doesn't consider this (I think, please correct if I'm wrong... and I hope I am). They just give you a base % for On-Demand and the rest for Spot, but if you don't have enough spot instances because AWS interrupts them, the ASG won't replace with On-Demand and then move back to Spot whenever possible. So, we're screwed and still depending on luck.

I think this could be VERY easily solved by the cluster-autoscaler if it allowed giving priorities to ASGs when scaling up. I could have two ASGs, one with a MixedPolicy pointing to multiple spot instance pools and another with my On-Demand instances. If I could configure the CA to always try to upscale using the Spot instance ASG but if not possible to use the On-Demand, we would have the same behavior as Spotinst.

EDIT: I think they are already working on this, by allowing the price based expander. This would be even better because prices would be calculated dynamically, but it is a much more complex feature than just letting a user pick the priorities for the ASGs. I'd simply do something like:

autoscalingGroups:
  - name: nodes-spot-large # This one could be a MixedPolicy with m3, m4 and m5.large spot pools.
    minSize: 0
    maxSize: 10
    priority: 10
  - name: nodes-spot-xlarge # This one could be a MixedPolicy with m3, m4 and m5.xlarge spot pools.
    minSize: 0
    maxSize: 10
    priority: 20
  - name: nodes-on-demand-large # Try to use smaller instances before going to xlarge.
    minSize: 0
    maxSize: 10
    priority: 30
  - name: nodes-on-demand-xlarge
    minSize: 0
    maxSize: 10
    priority: 40

Vlaaaaaaad · 2019-03-18T10:05:47Z

I'd really like seeing native support for this in cluster-autoscaler too.

As a sidenote, there is also k8s-spot-rescheduler which might be of interest to the discussion.

fejta-bot · 2019-06-16T10:25:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Jeffwan · 2019-06-24T06:03:04Z

/remove-lifecyle stale

fejta-bot · 2019-07-24T06:15:03Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

tbarrella · 2019-07-26T06:39:59Z

/remove-lifecycle rotten

jaypipes · 2019-08-27T15:01:15Z

Can we close this issue now? #1886 has merged and been cherry-picked back to 1.14. There is also documentation about how to use Spot + On-Demand in the same ASG. There are known limitations, including the roughly-same-sized instances restriction that @MaciekPytel mentions, but I believe the gist of this issue has been completed. @zdoherty, can you share your thoughts? Thanks!

fejta-bot · 2019-11-25T15:01:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-12-25T15:45:36Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Jeffwan · 2019-12-26T21:47:36Z

MixedInstancePolicy instance pool already support 1 instance type. (It was 2 before). From feature perspective, there's no different compare to EC2-Fleet. we can close this issue.

Feel free to reopen if anyone has questions

https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_InstancesDistribution.html

Jeffwan · 2019-12-26T21:47:41Z

/close

k8s-ci-robot · 2019-12-26T21:47:42Z

@Jeffwan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…modules/github.com/onsi/ginkgo/v2-2.10.0 Bump github.com/onsi/ginkgo/v2 from 2.9.7 to 2.10.0

aleksandra-malinowska added area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels May 9, 2018

itskingori mentioned this issue May 20, 2018

Implement cluster autoscaling on AWS by manipulating spot fleet target? #519

Closed

itskingori mentioned this issue May 20, 2018

Support spotFleet for instances groups kubernetes/kops#1784

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 17, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 16, 2018

zytek mentioned this issue Nov 5, 2018

Cluster-autoscaler: AWS EC2 Spot Fleets support kubernetes-retired/contrib#2066

Closed

geota mentioned this issue Nov 20, 2018

Support EC2 Fleets (manage On-Demand, RI, Spot Instances with a single ASG) terraform-aws-modules/terraform-aws-eks#194

Closed

4 tasks

rverma-nikiai mentioned this issue Nov 22, 2018

Any luck with manually creating EC2 Spot Fleet ? kubernetes/kops#5797

Closed

MaciekPytel mentioned this issue Dec 20, 2018

Question: Is it safe to run CAS 1.2.3 with AWS ASG with multi-type instances? #1519

Closed

dcherman mentioned this issue Feb 3, 2019

Add support for ASG Multi Instance Types & Purchase Options eksctl-io/eksctl#320

Closed

Jeffwan mentioned this issue Feb 3, 2019

cluster-autoscaler/aws: Unable to get instance type #1647

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 16, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 24, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 26, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 25, 2019

k8s-ci-robot closed this as completed Dec 26, 2019

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024

Merge pull request kubernetes#838 from kubernetes-sigs/dependabot/go_…

2a74cb0

…modules/github.com/onsi/ginkgo/v2-2.10.0 Bump github.com/onsi/ginkgo/v2 from 2.9.7 to 2.10.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EC2 Fleet Autoscaling Support? #838

EC2 Fleet Autoscaling Support? #838

zdoherty commented May 9, 2018 •

edited

Loading

itskingori commented May 20, 2018

fejta-bot commented Aug 18, 2018

fejta-bot commented Sep 17, 2018

jfoy commented Oct 16, 2018

geota commented Nov 16, 2018

cablespaghetti commented Nov 17, 2018

disha1104 commented Dec 11, 2018

cablespaghetti commented Dec 11, 2018

disha1104 commented Dec 11, 2018

cablespaghetti commented Dec 11, 2018

cablespaghetti commented Dec 11, 2018

MaciekPytel commented Dec 11, 2018

disha1104 commented Dec 12, 2018

aleksandra-malinowska commented Dec 12, 2018 •

edited

Loading

dominicgunn commented Dec 18, 2018

gambol99 commented Dec 20, 2018

MaciekPytel commented Dec 20, 2018

AndresPineros commented Mar 17, 2019 •

edited

Loading

Vlaaaaaaad commented Mar 18, 2019

fejta-bot commented Jun 16, 2019

Jeffwan commented Jun 24, 2019

fejta-bot commented Jul 24, 2019

tbarrella commented Jul 26, 2019

jaypipes commented Aug 27, 2019

fejta-bot commented Nov 25, 2019

fejta-bot commented Dec 25, 2019

Jeffwan commented Dec 26, 2019

Jeffwan commented Dec 26, 2019

k8s-ci-robot commented Dec 26, 2019

EC2 Fleet Autoscaling Support? #838

EC2 Fleet Autoscaling Support? #838

Comments

zdoherty commented May 9, 2018 • edited Loading

itskingori commented May 20, 2018

fejta-bot commented Aug 18, 2018

fejta-bot commented Sep 17, 2018

jfoy commented Oct 16, 2018

geota commented Nov 16, 2018

cablespaghetti commented Nov 17, 2018

disha1104 commented Dec 11, 2018

cablespaghetti commented Dec 11, 2018

disha1104 commented Dec 11, 2018

cablespaghetti commented Dec 11, 2018

cablespaghetti commented Dec 11, 2018

MaciekPytel commented Dec 11, 2018

disha1104 commented Dec 12, 2018

aleksandra-malinowska commented Dec 12, 2018 • edited Loading

dominicgunn commented Dec 18, 2018

gambol99 commented Dec 20, 2018

MaciekPytel commented Dec 20, 2018

AndresPineros commented Mar 17, 2019 • edited Loading

Vlaaaaaaad commented Mar 18, 2019

fejta-bot commented Jun 16, 2019

Jeffwan commented Jun 24, 2019

fejta-bot commented Jul 24, 2019

tbarrella commented Jul 26, 2019

jaypipes commented Aug 27, 2019

fejta-bot commented Nov 25, 2019

fejta-bot commented Dec 25, 2019

Jeffwan commented Dec 26, 2019

Jeffwan commented Dec 26, 2019

k8s-ci-robot commented Dec 26, 2019

zdoherty commented May 9, 2018 •

edited

Loading

aleksandra-malinowska commented Dec 12, 2018 •

edited

Loading

AndresPineros commented Mar 17, 2019 •

edited

Loading