-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ECS] [EC2]: better integration between service and instance autoscaling #76
Comments
We're doing the same thing, with a period Lambda that combines a strategy similar to the |
Actually, it would be great if the cluster supported an "n + 1" configuration, always keeping at least one instance running for new tasks to be placed when no other instances have enough resources. |
@matthewcummings I would like to extend this with requiring a stand-by per-AZ that the cluster is active in. The current behavior of ECS scheduling is quite dangerous in my mind in the case where a single node is available with plenty of space. Even with a placement strategy of spread across AZ it will put ALL the tasks on the single instance with available space. |
I just finished implementing this available container count scaling for our ECS clusters and would be happy to chat with someone from AWS if they've got questions. I was just now working on a public repo + blog post with my implementation. UPDATE: Since AWS is working on a solution for this I'll probably just abandon the blog post.. Here's some brief notes I had taken on the solution I've implemented: https://gist.github.com/jamiegs/296943b1b6ab4bdcd2a9d28e54bc3de0 |
It's good to see that this topic becomes awareness. Actually, I was thinking to change the metric, I described in my blog post, so when the value increase, also the cluster size should increase (like a ContainerBufferFillRate). This would help to use target tracking and makes the configuration easier. |
We currently scale out in and on reservation. We are starting to run into scenarios where very large tasks (16gb of M) are no longer placed after a scale in. There is enough total space in cluster to fit it, and its below our 90% reservation but not enough space on an single node to place task. Events are published and only way to know if pending task is because lack of space vs bad task definition is by parsing the service events per service. |
@jamiegs are they? i'm planning/testing an ec2 cluster implementation that i would like to eventually autoscale, however everything i'm reading still suggests the type of workarounds described in posts linked from this issue - i can't find anything official. |
ahh sorry, i missed it - just checking if you had any inside info =) for anyone else who missed it, this is currently in the Research phase on the roadmap, so if you're trying to do this now it appears lambda-based cluster scaling is the way to go. |
I just ran across #121 which is similar to my request if not a duplicate. At the end of the day we all want a reliable way to ensure that there are always enough instances running to add additional tasks when they are needed. |
You can work around this by using DAEMON tasks (instead of REPLICA) and doing all scaling at the ASG level (instead of application auto scaling). Works OK if you only have one service per cluster, but it is kind of an abuse of daemonsets. |
Hi everyone, we are actively researching this and have a proposed solution in mind. This solution would work as follows:
Thoughts on this proposal? Please let us know! |
This would be awesome! |
Sign me up |
I love it. |
This is a much needed feature for ECS. Right now, users have to over-provision their cluster instances or implement custom engg solutions using lambdas/cloudwatch for scale out and in scenarios. The cluster autoscale-aware feature with respect to the services/tasks is absolutely becessary. While this may not be applicable for Fargate, this is still needed for ECS use cases. I hope this gets prioritized and delivered, we have been waiting for this. |
@coultn: I think your proposal will work just fine for clusters that start their tasks using ECS services. I have a few thoughts to keep in mind:
I can beta test some of your proposed changes at my work if you're interested. |
Can you please clarify:
Does this mean that if I have a cluster with 10 services then the new metric will be over 100% if it can't not fit 1 task for each service combined (additive) for a total overhead of the combined requirements of the 10 tasks? Or is it a "shared" overhead that will essentially guarantee you're service/task with the largest deployment can add one more? |
Is the "full" metric CPU, Memory, connections, disk, something else? I feel like this type of metric makes sense for
Can someone explain how the metric would work for large, mixed workload, multi-ASG clusters? If that's an anti-pattern for ECS it would also be good to know where the product roadmap is headed. |
I second @masneyb third point. We use ECS in combination with Jenkins ECS plug-in to start containers (tasks) for every Jenkins job. The ECS plug-in is smart enough to retry tasks that failed due to insufficient resources. But I don't see how this new metric could be of much help in this case since it still only looks at the current resource usage and not the required resources. Settings a threshold < 100% is only a heuristics. |
This sounds good. Will the scaling out policy also takes care of scaling with respect to AZ spread? As in, will the scaling activity start new instance based on the AZ spread its task is looking for to scale or will it be random? |
@coultn sounds good overall, with the exception of one thing (which I may be misunderstanding).
To me the statement in bold implies that when the cluster "fullness" metric is at 100% then there is still space for at least one more task, which is not what I would expect, especially since you are not allowed to set a target tracking metric of greater than 100%. What do you do if you actually want your cluster to be fully (efficiently) allocated? As an example lets say my cluster consists of 5 nodes, each with 2 vCPUs running a single service where each task requires 1 vCPU of capacity. My understanding of the current proposal is
My expectation of what the metric would be:
So ideally for me, at 10 tasks with 100% target tracking the ASG would be at steady state. If the ECS service tries to allocate an 11th task then the metric would go to 110% and target tracking would cause the ASG to start a 6th node. Now if I decide instead that I do want hot spare behavior, then I would set my target fullness to 90%. To expound further on my use case, my intention would be set target tracking at the ASG level to 100% allocation and then separately set target tracking at the ECS service level to a specific CPU utilization (30% for example). So rather that having a spare node not doing anything, I would have all nodes active, but with sufficient CPU capacity to handle temporary spikes. If traffic gradually starts to climb and average CPU usage goes above 30%, then ECS would attempt to start more tasks and the ASG would start more nodes, and while the new nodes are starting up, there is still sufficient CPU headroom. I definitely think you guys should make easy for end users to determine appropriate percentage for having one, two or three hot spares, since the math won't always be as simple as my example. But I think 100% utilization should be an option, even if you don't think it should be the default. Perhaps in the console you could auto-calculate and pre-fill the "1 hot spare" percentage for users, or at least pre-calculate some examples. |
Thanks for the comments/questions everyone! Some clarifications and additional details:
|
@coultn YES. Currently we're investigating K8S because of this and other reasons. |
I'm not sure if this is a bug or this should be a feature request but I noticed that if a capacity provider is created with Managed Scaling DISABLED and Managed Instance Protection ENABLED neither the managed instance protection nor the "queuing" tasks feature work (when you launch a task and it gets to a provisioning state before running).
|
I stumbled across the very same problem. I worked around it by enabling managed scaling but then removing the scaling policy from the ASG (the one that is automatically created once you associate the ASG with the capacity provider). Then new task as still "queued" but not automated scaling happens. |
Thanks for your suggestion |
For those who are curious about more details on how ECS cluster auto scaling works, we just published a blog post that dives deep on that topic: https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/ |
@coultn Hey Nick question for you, and apologies if this has already been covered - I'm just getting my feet wet with all this (admittedly have not tried creating a test case yet). Does all this work about the same if my auto scaling group is based on weighted capacity instead of just instance count? For example if my main concern is memory usage and I define an ASG that can use several different instance types of 2/4/8GB of memory etc. Instead of moving my desired capacity as representing an instance count, I would rather move the desired capacity more along the lines of "move me from 50GB to 60GB total memory using any combination of these X types". Do the new capacity providers work with this type of ASG as well? |
@dferretti-fig The short answer to your question is "no." Longer answer: you can use multiple instance types in a single ASG with ECS cluster auto scaling, BUT you must set all of the weights to 1 (or not use weights). The metric calculation detailed in the blog assumes that the ASG capacity is equal to the number of instances, otherwise the metric calculation will be incorrect. When using multiple instance types, the scale out behavior falls back to a fixed scaling step size equal to the minimum scaling step size configured in the capacity provider. The reason for this is that there is no guarantee that your tasks will be constrained on the same resource dimension across all instance types - in other words, you may be CPU bound on one instance type but ENI bound on another, so trying to scale out using a single capacity metric isn't possible in a way that is guaranteed to always work. |
@coultn Gotcha - thanks for the response! |
@coultn Any plans to change it so it works with weighted ASG? I see it as valuable feature, especially in case of spot instances where I am willing to get 2x instance for less money then 1x. I might not fully understand the problem with guarantee, as it scales out using minimum scaling step anyway. ASG instances can report their weighted capacity to the metric, so metric could be weight-based, isn't it? Even if tasks constrained on different resources, worst case scaling would be taking minimal steps and re-evaluating the metric. |
@kkopachev you can use it Spot including with mixed instance types. The only caveat is that you must not set the weights (or set them to 1), and scaling will be a bit slower than if you used a single instance type. But you can use any combination of instances with ECS cluster auto scaling (as long as the instances can actually run your tasks, of course). |
@coultn To be fair, not being able to set weights and falling back to step-scaling are pretty big caveats. We were hoping that this feature would allow us to move off of our custom scaler Lambda solution, but the trade-offs we'd have to make in terms of responsiveness and cost-optimization don't seem worthwhile at this point. |
@coultn The blog article says that CAS falls back to min-step-scaling even in case of same instance types, but multiple AZs? Is that correct? That basically says it will always use min-step-scaling, unless your ASG restricted to one instance type in one AZ, right? |
@kkopachev correct. |
@coultn I want to know that can use "ecs blue/green deployment" with CAS. if can't use it now, have any plan about it ? We need to deploy new tasks to scaled instances with blue/green. and also wanna use CAS. |
TLDR: this feature does not work when using an ECS service with auto scaling, just for manually placed tasks. I have just set up a new cluster according to the tutorial The event logs is stuck on: It seems that the ecs auto scaling only works when you place a task. But the service auto scaling fails to place a task, thus no ecs auto scaling actually takes place... This is either a major bug, or I'm missing something in the configuration of my service. @coultn Can you please advise? |
Service auto scaling works with ECS cluster auto scaling, that is not the issue. This appears to be a bug specifically with the |
@coultn thanks! please update here the ticket id when you can |
Hi, I am not using a 'distinctInstance' constraint nor do I have a service autoscaling policy, yet I encounter the same issue @regevbr describes. Tasks are also not placed if there is insufficient memory in the cluster, and I do not see them in a 'PROVISIONING' state. I have confirmed that my cluster uses a capacity provider that has 'managed scaling' enabled:
My ASG is multi-AZ, multi-instance type, using spot instances. What am I doing wrong? |
Hi @ian-axelrod thanks for surfacing the issue that you are facing. Would you be willing to share more details via email so that we can deep dive into the issue you are facing. Could you send me your log files, ClusterID and AWS account ID at [email protected] . Happy to work with you to resolve the issue. |
Hi @pavneeta |
@kiwibel There is a known bug (https://github.com/aws/containers-roadmap/issues/7239) with using Capacity Providers and ECS Services with Workaround for now is assigning a hostPort to the Task Definition and remove the constraint all together. Also, a ECS Service with a Capacity Provider will never go into READY mode, when it has a |
@kiwibel Could you please confirm if you are using the Distinct Placement Placement Constraint ? would it possible to share your placement configuration?
Hi Tore - Yes, you are correct this is a known issue and we are currently working deploying a fix. |
Thanks @toredash @pavneeta
|
@kiwibel I haven't tested with memberOf. For that specific usecase I would use spread on AZ instead of using memberOf. But if you need to limit to specific AZ that wont work. I would create stripped down busybox example service with that placementConstraint setting, if it does not work, file a support ticket with AWS with the example code. Thats what I did to get a confirmation that it was a bug. |
Thanks again @toredash Turned out autoscaling didn't work due to the fact that existing services need to be updated to use capacity provider strategy. I had an impression they will automatically pick up the default one from the cluster 😳 On a separate note, after configuring this in production we have faced an issue when the underlying ASG won't (sometimes) scale-in down to a minimum number of instances even though the corresponding CloudWatch alarm went off. AWS support is currently investigating this. Cheers |
Whta have you to update in services to use capacity provider, I am under the impression that if we are not using the launchtype field then it will use the default cluster strategy |
I believe that is true if you create a new service. Updating a service from the GUI does not assign it the default capacity provider. |
Tell us about your request
Blog posts like this exist because it is difficult to coordinate service autoscaling with instance autoscaling:
https://engineering.depop.com/ahead-of-time-scheduling-on-ecs-ec2-d4ef124b1d9e
https://garbe.io/blog/2017/04/12/a-better-solution-to-ecs-autoscaling/
https://www.unicon.net/about/blogs/aws-ecs-auto-scaling
Which service(s) is this request for?
ECS and EC2
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I would love for ECS to provide a simple/easy way to tell a supporting EC2 ASG to scale up when a task cannot be placed on its cluster. I'd also love to see this concern addressed: #42
Are you currently working around this issue?
I'm doing something similar to this: https://garbe.io/blog/2017/04/12/a-better-solution-to-ecs-autoscaling/
Additional context
Yes, please note that I love Lambda and Fargate but sometimes regular old ECS is a better fit and fwiw, Google Cloud has had cluster autoscaling for a long time now: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler. Also, I haven't tried EKS yet but cluster autoscaling would be super helpful there.
The text was updated successfully, but these errors were encountered: