Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 Fleet Autoscaling Support? #838

Closed
zdoherty opened this issue May 9, 2018 · 29 comments
Closed

EC2 Fleet Autoscaling Support? #838

zdoherty opened this issue May 9, 2018 · 29 comments
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@zdoherty
Copy link

zdoherty commented May 9, 2018

Amazon recently released a feature called EC2 Fleets which appears to consolidate spot fleet requests with EC2 on-demand/auto-scaling group requests. Per their documentation, this appears to support a similar feature to desired capacity in auto-scaling:

You can modify the following parameters of an EC2 Fleet:

  • target-capacity – Increase or decrease the target capacity.

target-capacity appears to be very similar to spot fleets weighted capactity, but you're able to change it over time. Being able to change the target-capacity parameter over time seems to align closely with changing the DesiredCapacity parameter of an auto-scaling group. Are there any plans to support EC2 fleets with Kubernetes autoscaling?

@itskingori
Copy link
Member

Amazon recently released a feature called EC2 Fleets which appears to consolidate spot fleet requests with EC2 on-demand/auto-scaling group requests.

According to this official blog post on EC2 Fleets, auto-scaling group support is still in progress. They say:

We plan to connect EC2 Fleet and EC2 Auto Scaling groups. This will let you create a single fleet that mixed instance types and Spot, Reserved and On-Demand, while also taking advantage of EC2 Auto Scaling features such as health checks and lifecycle hooks. This integration will also bring EC2 Fleet functionality to services such as Amazon ECS, Amazon EKS, and AWS Batch that build on and make use of EC2 Auto Scaling for fleet management.

Thought it's worth mentioning since it's greatly affects design/implementation.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 17, 2018
@jfoy
Copy link

jfoy commented Oct 16, 2018

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 16, 2018
@geota
Copy link

geota commented Nov 16, 2018

@zdoherty @itskingori @jfoy
https://aws.amazon.com/about-aws/whats-new/2018/11/scale-instances-across-purchase-options-in-a-single-ASG/

I think this work is no longer blocked - ec2 fleets now support ASGs (or vice versa?)

@cablespaghetti
Copy link
Contributor

@geota I believe this is solved by that release. It's now possible to have an ASG with spot instances in it...so in theory this should "just work" with the cluster autoscaler. I'm planning on doing some testing next week.

@disha1104
Copy link

Any update on this? We are trying to achieve autoscaling with EC2-fleet.

@cablespaghetti
Copy link
Contributor

Spot Fleet and EC2 Fleet seem to have been superseded by this: https://aws.amazon.com/blogs/aws/new-ec2-auto-scaling-groups-with-multiple-instance-types-purchase-options/

It doesn't quite have all the same features but it should fit most use cases. We've set it up with 100% spot instances of two different types using terraform (although you can have a mixture of spot and on demand like in EC2 fleet). The cluster autoscaler works very nicely with it. 😄

@disha1104
Copy link

@cablespaghetti Thanks a lot. But does the present cluster autoscaler support Spot instances(EC2 fleet or spot fleet), from what I know, it was just supporting on-demand instances.

Also, will there be service interruption with this approach?
Basically what I am trying to achieve is an approach similar to "Spotinst" for managing K8s clusters on AWS. Can you suggest some similar use-case for that?

@cablespaghetti
Copy link
Contributor

It doesn't know or care what kind of instances are behind the ASG. It just increases or decreases the "Desired Count". The ASG config controls what instance types are used auto-magically. I'll upload my terraform config which is 100% spot but it's easy to have a base of X on demand instances and then use spot (see the linked blog post).

@cablespaghetti
Copy link
Contributor

@MaciekPytel
Copy link
Contributor

It doesn't know or care what kind of instances are behind the ASG

That is not true. CA works by performing careful simulations to see if adding more nodes would help pending pods. It needs to know exactly how a new node will look like to make good decisions.

The actuation is indeed done by increasing "desired count", so your setup will work to an extent. But it also means CA makes decisions based on incorrect data (it will make prediction how next node will look like, but in your setup it will be incorrect most of the time).

Effectively you're going from "carefully choose how many nodes of each kind will be the best fit for your pods" to "let's add some nodes and hope for the best". It may be good enough for a very simple cluster (single ASG, each pod can run on every type of node available), but it's not something we recommend. As of now our official policy is that all nodes in NodeGroup must be strictly identical and things like multi-instance type ASGs or multi-zonal ASGs are not officially supported by CA.

@disha1104
Copy link

@MaciekPytel , so we can't use CA with spot instances (or EC2-fleets to be more precise)?

@aleksandra-malinowska
Copy link
Contributor

aleksandra-malinowska commented Dec 12, 2018

@MaciekPytel , so we can't use CA with spot instances (or EC2-fleets to be more precise)?

To clarify: Cluster Autoscaler cares very much how the nodes (as in, Kubernetes Node objects) look like. It doesn't care about the underlying instance directly.

Most often, instance type will determine the node's scheduling properties, like allocatable resources. But if the nodes running on spot instances are indistinguishable from nodes running on regular instances, it should be OK to mix those.

@dominicgunn
Copy link

Has there been any progress on this? Are we able to increment spot fleets desired count at will?

@gambol99
Copy link

That is not true. CA works by performing careful simulations to see if adding more nodes would help pending pods. It needs to know exactly how a new node will look like to make good decisions.

So if your using an ASG backed by LaunchTemplate + MixedInstancePolicy how does the CA determine the node capacity per ASG to simulate? .. given they can be backed by a mixture of instance types

@MaciekPytel
Copy link
Contributor

So if your using an ASG backed by LaunchTemplate + MixedInstancePolicy how does the CA determine the node capacity per ASG to simulate? .. given they can be backed by a mixture of instance types

Precisely because of that reason MixedInstancePolicy is not officially supported by CA (as of now). You can still set it up using the config provided by @cablespaghetti or creating a similar one of your own. If you do CA will just take one instance type and assume each node will look exactly like that. Depending on your exact setup and your luck it may or may not result in correct scaling decisions.

Has there been any progress on this? Are we able to increment spot fleets desired count at will?

There is an ongoing effort to make it work #1473. Note that it still assumes all the instance types in ASG are (roughly) the same size.

@AndresPineros
Copy link

AndresPineros commented Mar 17, 2019

I think we should be able to simulate a behavior from projects like Spotinst. For me the main requirement is to be able to use Spot instances BUT fallback to On-Demand whenever there are no spot instances available. Then replace the On-Demand whenever Spot are available again.

The current support of EC2 Fleets in the ASGs doesn't consider this (I think, please correct if I'm wrong... and I hope I am). They just give you a base % for On-Demand and the rest for Spot, but if you don't have enough spot instances because AWS interrupts them, the ASG won't replace with On-Demand and then move back to Spot whenever possible. So, we're screwed and still depending on luck.

I think this could be VERY easily solved by the cluster-autoscaler if it allowed giving priorities to ASGs when scaling up. I could have two ASGs, one with a MixedPolicy pointing to multiple spot instance pools and another with my On-Demand instances. If I could configure the CA to always try to upscale using the Spot instance ASG but if not possible to use the On-Demand, we would have the same behavior as Spotinst.

EDIT: I think they are already working on this, by allowing the price based expander. This would be even better because prices would be calculated dynamically, but it is a much more complex feature than just letting a user pick the priorities for the ASGs. I'd simply do something like:

autoscalingGroups:
  - name: nodes-spot-large # This one could be a MixedPolicy with m3, m4 and m5.large spot pools.
    minSize: 0
    maxSize: 10
    priority: 10
  - name: nodes-spot-xlarge # This one could be a MixedPolicy with m3, m4 and m5.xlarge spot pools.
    minSize: 0
    maxSize: 10
    priority: 20
  - name: nodes-on-demand-large # Try to use smaller instances before going to xlarge.
    minSize: 0
    maxSize: 10
    priority: 30
  - name: nodes-on-demand-xlarge
    minSize: 0
    maxSize: 10
    priority: 40

@Vlaaaaaaad
Copy link

I'd really like seeing native support for this in cluster-autoscaler too.

As a sidenote, there is also k8s-spot-rescheduler which might be of interest to the discussion.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 16, 2019
@Jeffwan
Copy link
Contributor

Jeffwan commented Jun 24, 2019

/remove-lifecyle stale

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 24, 2019
@tbarrella
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 26, 2019
@jaypipes
Copy link
Contributor

Can we close this issue now? #1886 has merged and been cherry-picked back to 1.14. There is also documentation about how to use Spot + On-Demand in the same ASG. There are known limitations, including the roughly-same-sized instances restriction that @MaciekPytel mentions, but I believe the gist of this issue has been completed. @zdoherty, can you share your thoughts? Thanks!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 25, 2019
@Jeffwan
Copy link
Contributor

Jeffwan commented Dec 26, 2019

MixedInstancePolicy instance pool already support 1 instance type. (It was 2 before). From feature perspective, there's no different compare to EC2-Fleet. we can close this issue.

Feel free to reopen if anyone has questions

https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_InstancesDistribution.html

@Jeffwan
Copy link
Contributor

Jeffwan commented Dec 26, 2019

/close

@k8s-ci-robot
Copy link
Contributor

@Jeffwan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
…modules/github.com/onsi/ginkgo/v2-2.10.0

Bump github.com/onsi/ginkgo/v2 from 2.9.7 to 2.10.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests