Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] [EC2]: better integration between service and instance autoscaling #76

Closed
matthewcummings opened this issue Dec 20, 2018 · 92 comments
Assignees
Labels
ECS Amazon Elastic Container Service Proposed Community submitted issue

Comments

@matthewcummings
Copy link

Tell us about your request
Blog posts like this exist because it is difficult to coordinate service autoscaling with instance autoscaling:
https://engineering.depop.com/ahead-of-time-scheduling-on-ecs-ec2-d4ef124b1d9e
https://garbe.io/blog/2017/04/12/a-better-solution-to-ecs-autoscaling/
https://www.unicon.net/about/blogs/aws-ecs-auto-scaling

Which service(s) is this request for?
ECS and EC2

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I would love for ECS to provide a simple/easy way to tell a supporting EC2 ASG to scale up when a task cannot be placed on its cluster. I'd also love to see this concern addressed: #42

Are you currently working around this issue?
I'm doing something similar to this: https://garbe.io/blog/2017/04/12/a-better-solution-to-ecs-autoscaling/

Additional context
Yes, please note that I love Lambda and Fargate but sometimes regular old ECS is a better fit and fwiw, Google Cloud has had cluster autoscaling for a long time now: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler. Also, I haven't tried EKS yet but cluster autoscaling would be super helpful there.

@matthewcummings matthewcummings added the Proposed Community submitted issue label Dec 20, 2018
@idubinskiy
Copy link

We're doing the same thing, with a period Lambda that combines a strategy similar to the garbe.io blog post with detection of tasks that are pending. We've continued to fine-tune the logic to try to strike a good balance between availability and cost, but it would be very convenient if ECS provided the functionality or at least published metrics to allow scaling the cluster out and especially in on actual service/task capacity.

@matthewcummings
Copy link
Author

Actually, it would be great if the cluster supported an "n + 1" configuration, always keeping at least one instance running for new tasks to be placed when no other instances have enough resources.

@jespersoderlund
Copy link

@matthewcummings I would like to extend this with requiring a stand-by per-AZ that the cluster is active in. The current behavior of ECS scheduling is quite dangerous in my mind in the case where a single node is available with plenty of space. Even with a placement strategy of spread across AZ it will put ALL the tasks on the single instance with available space.

@jamiegs
Copy link

jamiegs commented Jan 10, 2019

I just finished implementing this available container count scaling for our ECS clusters and would be happy to chat with someone from AWS if they've got questions. I was just now working on a public repo + blog post with my implementation.

UPDATE: Since AWS is working on a solution for this I'll probably just abandon the blog post.. Here's some brief notes I had taken on the solution I've implemented: https://gist.github.com/jamiegs/296943b1b6ab4bdcd2a9d28e54bc3de0

@pgarbe
Copy link

pgarbe commented Jan 11, 2019

It's good to see that this topic becomes awareness. Actually, I was thinking to change the metric, I described in my blog post, so when the value increase, also the cluster size should increase (like a ContainerBufferFillRate). This would help to use target tracking and makes the configuration easier.

@zbintliff
Copy link

zbintliff commented Jan 30, 2019

We currently scale out in and on reservation. We are starting to run into scenarios where very large tasks (16gb of M) are no longer placed after a scale in. There is enough total space in cluster to fit it, and its below our 90% reservation but not enough space on an single node to place task.

Events are published and only way to know if pending task is because lack of space vs bad task definition is by parsing the service events per service.

@tabern tabern added the ECS Amazon Elastic Container Service label Jan 30, 2019
@hlarsen
Copy link

hlarsen commented Feb 6, 2019

UPDATE: Since AWS is working on a solution for this

@jamiegs are they?

i'm planning/testing an ec2 cluster implementation that i would like to eventually autoscale, however everything i'm reading still suggests the type of workarounds described in posts linked from this issue - i can't find anything official.

@jamiegs
Copy link

jamiegs commented Feb 6, 2019

@jamiegs are they?

@hlarsen well, I guess I assume they are since they have this ticket to improve autoscaling under researching on their roadmap.

@hlarsen
Copy link

hlarsen commented Feb 6, 2019

ahh sorry, i missed it - just checking if you had any inside info =)

for anyone else who missed it, this is currently in the Research phase on the roadmap, so if you're trying to do this now it appears lambda-based cluster scaling is the way to go.

@matthewcummings
Copy link
Author

I just ran across #121 which is similar to my request if not a duplicate. At the end of the day we all want a reliable way to ensure that there are always enough instances running to add additional tasks when they are needed.

@gabegorelick
Copy link

You can work around this by using DAEMON tasks (instead of REPLICA) and doing all scaling at the ASG level (instead of application auto scaling). Works OK if you only have one service per cluster, but it is kind of an abuse of daemonsets.

@coultn
Copy link

coultn commented Apr 19, 2019

Hi everyone, we are actively researching this and have a proposed solution in mind. This solution would work as follows:

  • ECS will compute a new CloudWatch metric for each ECS cluster. The new metric is a measure of how "full" your cluster is relative to the tasks you are running (and want to run). The metric will only be less than 100% if you are guaranteed to have space for at least 1 more task of each service and RunTask group already running in your cluster. It will be greater than or equal to 100% only if you have at least one service or RunTask group that can NOT place any additional tasks in the cluster. The metric accounts not only for tasks already running, but new tasks that have not been placed yet. This means that, for example, if a service is trying to scale out and needs 8 times as many instances as the cluster currently has, the metric will be 800%.
  • ECS will automatically set up a target tracking scaling policy on your ECS cluster using this new metric. You can set a target value for the metric less than or equal to 100%. A target value of 100% means the cluster will only scale out if there is no more space in your cluster for at least one service or RunTask group. A target value of less than 100% means that the cluster will keep some space in reserve for additional tasks to run. This will give you faster task scaling performance, with some additional cost due to unused capacity. The target tracking policy will scale your cluster out and in with the goal of maintaining the metric at or near your target value. When scaling out, the target tracking scaling policy can scale to the correct cluster size in one step, because the metric reflects all of the tasks that you want to run but aren't yet running. It can even scale out from zero running instances.
  • When scaling in, automated instance protection will ensure that ECS is more intelligent about which instances get terminated, and automated instance draining (also for Spot) will ensure that your tasks have the opportunity to shut down cleanly.

Thoughts on this proposal? Please let us know!

@kalpik
Copy link

kalpik commented Apr 20, 2019

This would be awesome!

@samdammers
Copy link

Sign me up

@matthewcummings
Copy link
Author

I love it.

@geethaRam
Copy link

This is a much needed feature for ECS. Right now, users have to over-provision their cluster instances or implement custom engg solutions using lambdas/cloudwatch for scale out and in scenarios. The cluster autoscale-aware feature with respect to the services/tasks is absolutely becessary. While this may not be applicable for Fargate, this is still needed for ECS use cases. I hope this gets prioritized and delivered, we have been waiting for this.

@masneyb
Copy link

masneyb commented Apr 21, 2019

@coultn: I think your proposal will work just fine for clusters that start their tasks using ECS services. I have a few thoughts to keep in mind:

  • Maybe this is out of scope, but since you brought up the automated EC2 instance protection bit, I think that you should also take into consideration changes to the EC2 launch configuration (i.e. new AMI, instance type, etc) to help make the management of the ECS clusters easier. I link at the bottom of my comment to a CloudFormation template that does this for a cluster that runs batch jobs. For the clusters that run web applications, we wouldn't want the instance protection bit to be in the way when the fleet is rolled by autoscaling.

  • We have some QA clusters that run ECS services with development git branches of work that is currently in progress. These environments usually stick around for 8 hours after the last commit. Most of these environments hardly receive any traffic unless automated performance testing is in progress. Let's assume that we currently have X ECS services, and that all X of them have the same requirements from ECS (memory/CPU) for simplicity. Will the new CloudWatch metric tell us that we can start one copy of a task on just one of those services? So if the metric says we can start one, and if two ECS services try to scale out at the same time, then we'll encounter a stall scaling out the second service? Or, will the new metric tell us if we can start one copy of every ECS service that is currently configured? Hopefully it is the former since scaling policies can be configured to handle the latter case if needed.

  • This proposal won't work for ECS scheduled tasks. We have a cluster that runs over 200 cron-style jobs as ECS scheduled tasks for a legacy application. It's a mix of small and large jobs and our ECS cluster typically doubles the number of EC2 instances during parts of the day when more larger jobs are running. These jobs aren't setup as ECS services. Initially we started using the CloudWatch event rules to start an ECS task however we had a large number of jobs that wouldn't start during some parts of the day due to the run-task API call failing due to insufficient capacity in the cluster. To fix this, we still use CloudWatch event rules, however it sends a message to a SQS queue and a Lambda function is subscribed to it. The function will try to start the task, and if it fails due to insufficient capacity, then it will increase the desired number in the autoscaling group, and try again later. The tasks are bin packed to help make scaling in easier. The jobs have a finite duration, so scaling in involves looking for empty instances, draining them, and then terminating them. I have a CloudFormation template that implements this use case at https://github.com/MoveInc/ecs-cloudformation-templates/blob/master/ECS-Batch-Cluster.template and it's fairly well commented at the top with more details, including how we handle launch configuration changes (for AMI updates, new EC2 instance types, etc).

I can beta test some of your proposed changes at my work if you're interested.

@waffleshop
Copy link

@coultn Looks good. My organization implemented a system very similar to this. Do you have any more details regarding how your solution computes the new scaling metric? We currently base it on the largest container (cpu and memory) in the ECS cluster -- similar to this solution.

@zbintliff
Copy link

Can you please clarify:

The metric will only be less than 100% if you are guaranteed to have space for at least 1 more task of each service and RunTask group already running in your cluster.

Does this mean that if I have a cluster with 10 services then the new metric will be over 100% if it can't not fit 1 task for each service combined (additive) for a total overhead of the combined requirements of the 10 tasks? Or is it a "shared" overhead that will essentially guarantee you're service/task with the largest deployment can add one more?

@rothgar
Copy link

rothgar commented Apr 22, 2019

Is the "full" metric CPU, Memory, connections, disk, something else? I feel like this type of metric makes sense for

  • Single or similar constraint services. (e.g. all CPU bound or all Memory bound)
  • Similar workload types (services vs batch)
  • Many small clusters vs large (mixed workload) clusters
  • Single ASG/instance type in the cluster

Can someone explain how the metric would work for large, mixed workload, multi-ASG clusters? If that's an anti-pattern for ECS it would also be good to know where the product roadmap is headed.

@sithmein
Copy link

I second @masneyb third point. We use ECS in combination with Jenkins ECS plug-in to start containers (tasks) for every Jenkins job. The ECS plug-in is smart enough to retry tasks that failed due to insufficient resources. But I don't see how this new metric could be of much help in this case since it still only looks at the current resource usage and not the required resources. Settings a threshold < 100% is only a heuristics.
Ideally - and I get that this is a more fundamental change - ECS has a queue of pending tasks (like any other "traditional" queueing system) instead if immediately rejecting them. The length of the queue and its item's resource requirements can then easily be used to scale in and out.

@vimmis
Copy link

vimmis commented Apr 23, 2019

This sounds good. Will the scaling out policy also takes care of scaling with respect to AZ spread? As in, will the scaling activity start new instance based on the AZ spread its task is looking for to scale or will it be random?

@talawahtech
Copy link

@coultn sounds good overall, with the exception of one thing (which I may be misunderstanding).

  • ECS will automatically set up a target tracking scaling policy on your ECS cluster using this new metric. You can set a target value for the metric less than or equal to 100%. A target value of 100% means the cluster will only scale out if there is no more space in your cluster for at least one service or RunTask group. A target value of less than 100% means that the cluster will keep some space in reserve for additional tasks to run.

To me the statement in bold implies that when the cluster "fullness" metric is at 100% then there is still space for at least one more task, which is not what I would expect, especially since you are not allowed to set a target tracking metric of greater than 100%. What do you do if you actually want your cluster to be fully (efficiently) allocated?

As an example lets say my cluster consists of 5 nodes, each with 2 vCPUs running a single service where each task requires 1 vCPU of capacity.

My understanding of the current proposal is

  • 9 tasks -> 100%
  • 10 tasks -> more than 100%

My expectation of what the metric would be:

  • 9 tasks -> 90%
  • 10 tasks -> 100%

So ideally for me, at 10 tasks with 100% target tracking the ASG would be at steady state. If the ECS service tries to allocate an 11th task then the metric would go to 110% and target tracking would cause the ASG to start a 6th node. Now if I decide instead that I do want hot spare behavior, then I would set my target fullness to 90%.

To expound further on my use case, my intention would be set target tracking at the ASG level to 100% allocation and then separately set target tracking at the ECS service level to a specific CPU utilization (30% for example). So rather that having a spare node not doing anything, I would have all nodes active, but with sufficient CPU capacity to handle temporary spikes. If traffic gradually starts to climb and average CPU usage goes above 30%, then ECS would attempt to start more tasks and the ASG would start more nodes, and while the new nodes are starting up, there is still sufficient CPU headroom.

I definitely think you guys should make easy for end users to determine appropriate percentage for having one, two or three hot spares, since the math won't always be as simple as my example. But I think 100% utilization should be an option, even if you don't think it should be the default. Perhaps in the console you could auto-calculate and pre-fill the "1 hot spare" percentage for users, or at least pre-calculate some examples.

@coultn
Copy link

coultn commented Apr 23, 2019

Thanks for the comments/questions everyone! Some clarifications and additional details:

  • The metric will account for both service tasks, and tasks run directly via RunTask. Right now, tasks started with RunTask either start or they don't. So, for example, if you try to run 10 copies of a task and there is only capacity for 5, then 5 will start and 5 will not. The remaining 5 will not be retried unless you call RunTask again. Would it be helpful to have an option for RunTask where it would keep trying until capacity is available? If this option were available with RunTask, then the new metric would scale appropriately to all tasks, both service tasks and RunTask tasks, and scheduled tasks.
  • The metric DOES account for not only already-running tasks, but tasks that are not yet running (via the desired count of each service, and the count of the RunTask invocation, assuming the retry logic mentioned above). It is a precise measurement of what your cluster can run relative to what you want to run.
  • Equal to 100% means 'exactly full' for at least one service or RunTask invocation, and less than full for the rest. In other words, there is at least one service, or one set of tasks started with RunTask, for which the cluster has no additional capacity, but is exactly at capacity. If you set your target value to 100%, then the cluster might not scale until it completely runs out of resources for a service, and there are additional tasks that cannot be run.
  • Greater than 100% means that there is at least one service or RunTask invocation that wants to run more tasks than there is capacity in the cluster. This will always trigger a scale-out event regardless of the target value used in the target tracking scaling policy (assuming the target value is between 0-100).
  • Less than 100% means that each service or RunTask invocation has room for at least one more task. This does not mean that they all could add one more task at the same time. If your services tend to scale up or down at the same time, then you would want to account for that when configuring the target value for the metric. If you set the target value to less than 100%, then you always have spare capacity; however, depending on how quickly your services scale (or how quickly you call RunTask) you may still fill up the cluster. You are less likely to do so if you use a smaller target value. (Because target tracking scaling effectively rounds up to the next largest integer value, any target value less than 100% means you have capacity for at least 1 extra task, regardless of the task sizes).
  • The metric will accommodate both single-service and multi-service clusters. It looks at the capacity across all services (and RunTask invocations) and computes the maximum value. The services and tasks do not need to have the same resource requirements or constraints.
    -The metric is not explicitly aiming to solve the rebalancing problem. That is a separate feature.

@lattwood
Copy link

lattwood commented May 8, 2019

@coultn Would it be helpful to have an option for RunTask where it would keep trying until capacity is available? If this option were available with RunTask, then the new metric would scale appropriately to all tasks, both service tasks and RunTask tasks, and scheduled tasks

YES. Currently we're investigating K8S because of this and other reasons.

@ronaldour
Copy link

I'm not sure if this is a bug or this should be a feature request but I noticed that if a capacity provider is created with Managed Scaling DISABLED and Managed Instance Protection ENABLED neither the managed instance protection nor the "queuing" tasks feature work (when you launch a task and it gets to a provisioning state before running).
I was trying to set up a custom scaling solution but taking advantage of the managed instance protection feature of the capacity provider.
You can reproduce this behavior by creating a capacity provider with managed scaling disabled and managed instance protection enabled and:

  • launch a new task with no container instances available ( By the way I didn't use a launch type which means that it should use the default capacity provider strategy of the cluster). You will notice that the run_task command fails.
  • Try to scale in the cluster. As you have to have Scale in protection enabled for the ASG to enable managed instance protection in the capacity provider, all the instances have instance protection set to true and when you try to scale in the protection isn't removed from the instances so it's not able to scale in. This doesn't happen when you have managed scaling enabled

@sithmein
Copy link

I stumbled across the very same problem. I worked around it by enabling managed scaling but then removing the scaling policy from the ASG (the one that is automatically created once you associate the ASG with the capacity provider). Then new task as still "queued" but not automated scaling happens.
If you have a custom scaling application (which we do have, too) you can easily remove the scale in protection from selected instances so that isn't a real problem.

@ronaldour
Copy link

Thanks for your suggestion
I also found a similar solution. When you create a capacity provider with managed scaling enabled ecs creates an autoscaling plan (you can check it with aws autoscaling-plans describe-scaling-plans) so I deleted the scaling plan and used my own policies. Doing this allows us to use a custom scaling with the "queing" feature and the "managed termination protection" feature (We need to protect instances with running tasks).
Although this is working fine for now I feel this like a hack and this might not work in the feature so I would like to hear some feedback and create the bug/feature request issue

@coultn
Copy link

coultn commented Jan 3, 2020

For those who are curious about more details on how ECS cluster auto scaling works, we just published a blog post that dives deep on that topic: https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/

@dferretti
Copy link

@coultn Hey Nick question for you, and apologies if this has already been covered - I'm just getting my feet wet with all this (admittedly have not tried creating a test case yet). Does all this work about the same if my auto scaling group is based on weighted capacity instead of just instance count? For example if my main concern is memory usage and I define an ASG that can use several different instance types of 2/4/8GB of memory etc. Instead of moving my desired capacity as representing an instance count, I would rather move the desired capacity more along the lines of "move me from 50GB to 60GB total memory using any combination of these X types". Do the new capacity providers work with this type of ASG as well?

@coultn
Copy link

coultn commented Jan 8, 2020

@dferretti-fig The short answer to your question is "no."

Longer answer: you can use multiple instance types in a single ASG with ECS cluster auto scaling, BUT you must set all of the weights to 1 (or not use weights). The metric calculation detailed in the blog assumes that the ASG capacity is equal to the number of instances, otherwise the metric calculation will be incorrect. When using multiple instance types, the scale out behavior falls back to a fixed scaling step size equal to the minimum scaling step size configured in the capacity provider.

The reason for this is that there is no guarantee that your tasks will be constrained on the same resource dimension across all instance types - in other words, you may be CPU bound on one instance type but ENI bound on another, so trying to scale out using a single capacity metric isn't possible in a way that is guaranteed to always work.

@dferretti
Copy link

@coultn Gotcha - thanks for the response!

@kkopachev
Copy link

@coultn Any plans to change it so it works with weighted ASG? I see it as valuable feature, especially in case of spot instances where I am willing to get 2x instance for less money then 1x.

I might not fully understand the problem with guarantee, as it scales out using minimum scaling step anyway. ASG instances can report their weighted capacity to the metric, so metric could be weight-based, isn't it? Even if tasks constrained on different resources, worst case scaling would be taking minimal steps and re-evaluating the metric.

@coultn
Copy link

coultn commented Jan 10, 2020

@kkopachev you can use it Spot including with mixed instance types. The only caveat is that you must not set the weights (or set them to 1), and scaling will be a bit slower than if you used a single instance type. But you can use any combination of instances with ECS cluster auto scaling (as long as the instances can actually run your tasks, of course).

@idubinskiy
Copy link

@coultn To be fair, not being able to set weights and falling back to step-scaling are pretty big caveats. We were hoping that this feature would allow us to move off of our custom scaler Lambda solution, but the trade-offs we'd have to make in terms of responsiveness and cost-optimization don't seem worthwhile at this point.

@kkopachev
Copy link

@coultn The blog article says that CAS falls back to min-step-scaling even in case of same instance types, but multiple AZs? Is that correct? That basically says it will always use min-step-scaling, unless your ASG restricted to one instance type in one AZ, right?

@coultn
Copy link

coultn commented Jan 17, 2020

@kkopachev correct.

@thkong
Copy link

thkong commented Jan 22, 2020

@coultn I want to know that can use "ecs blue/green deployment" with CAS. if can't use it now, have any plan about it ? We need to deploy new tasks to scaled instances with blue/green. and also wanna use CAS.

@regevbr
Copy link

regevbr commented Jan 25, 2020

TLDR: this feature does not work when using an ECS service with auto scaling, just for manually placed tasks.

I have just set up a new cluster according to the tutorial
Everything worked like a charm.
I then went to add a real service to the cluster, after removing the tasks created in the tutorial.
The service is set to rolling update, it has auto scaling rules and has a placement constraint of 1 per host.
But, when I added the new service, the auto scaling never happened...

The event logs is stuck on:
service X was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your cluster. For more information, see the Troubleshooting section.

It seems that the ecs auto scaling only works when you place a task. But the service auto scaling fails to place a task, thus no ecs auto scaling actually takes place...

This is either a major bug, or I'm missing something in the configuration of my service.

@coultn Can you please advise?

@coultn
Copy link

coultn commented Jan 25, 2020

Service auto scaling works with ECS cluster auto scaling, that is not the issue. This appears to be a bug specifically with the distinctInstance placement constraint. Services that have a placement constraint of this type are not triggering scaling correctly. We will track this issue as a separate one in github.

@regevbr
Copy link

regevbr commented Jan 25, 2020

@coultn thanks! please update here the ticket id when you can

@coultn
Copy link

coultn commented Jan 25, 2020

#723

@ian-axelrod
Copy link

Hi,

I am not using a 'distinctInstance' constraint nor do I have a service autoscaling policy, yet I encounter the same issue @regevbr describes. Tasks are also not placed if there is insufficient memory in the cluster, and I do not see them in a 'PROVISIONING' state. I have confirmed that my cluster uses a capacity provider that has 'managed scaling' enabled:

                "managedScaling": {
                    "status": "ENABLED",
                    "targetCapacity": 100,
                    "minimumScalingStepSize": 1,
                    "maximumScalingStepSize": 10000
                },

My ASG is multi-AZ, multi-instance type, using spot instances. What am I doing wrong?

@pavneeta
Copy link

tance' constraint nor do I have a service autoscaling policy, yet I encounter the same issue @regevbr d

Hi @ian-axelrod thanks for surfacing the issue that you are facing. Would you be willing to share more details via email so that we can deep dive into the issue you are facing. Could you send me your log files, ClusterID and AWS account ID at [email protected] . Happy to work with you to resolve the issue.

@pavneeta pavneeta self-assigned this Mar 13, 2020
@kiwibel
Copy link

kiwibel commented Apr 28, 2020

Hi @pavneeta
Has this been resolved?
We have just faced a similar issue. New tasks are being placed on existing hosts based on existing service autoscaling but when there are not enough resources available we get:
unable to place a task because no container instance met all of its requirements
After that, cluster autoscaling never updates CapacityProviderReservation metric and the desired capacity for asg never increases. No tasks go to PROVISIONING state as well.
Capacity provider is created and attached to the cluster OK.

@toredash
Copy link

@kiwibel There is a known bug (https://github.com/aws/containers-roadmap/issues/7239) with using Capacity Providers and ECS Services with distinctInstance constraint. It does not work.

Workaround for now is assigning a hostPort to the Task Definition and remove the constraint all together.

Also, a ECS Service with a Capacity Provider will never go into READY mode, when it has a distinctInstance constraint.

@pavneeta
Copy link

pavneeta commented Apr 28, 2020

@kiwibel Could you please confirm if you are using the Distinct Placement Placement Constraint ? would it possible to share your placement configuration?

@kiwibel There is a known bug (https://github.com/aws/containers-roadmap/issues/7239) with using Capacity Providers and ECS Services with distinctInstance constraint. It does not work.

Workaround for now is assigning a hostPort to the Task Definition and remove the constraint all together.

Also, a ECS Service with a Capacity Provider will never go into READY mode, when it has a distinctInstance constraint.

Hi Tore - Yes, you are correct this is a known issue and we are currently working deploying a fix.

@kiwibel
Copy link

kiwibel commented Apr 28, 2020

Thanks @toredash @pavneeta
We actually don't use distinctInstance but memberOf. This does not seem to work either 🤷‍♂️
My other guess is it is something to do with mixed spot/on-demand instances in the underlying asg. Here's how placement constrain/strategy look for our service:

            "placementConstraints": [
                {
                    "type": "memberOf",
                    "expression": "attribute:ecs.availability-zone in [us-west-2a, us-west-2b, us-west-2c]"
                }
            ],
            "placementStrategy": [
                {
                    "type": "spread",
                    "field": "instanceId"
                }
            ],

@toredash
Copy link

@kiwibel I haven't tested with memberOf. For that specific usecase I would use spread on AZ instead of using memberOf. But if you need to limit to specific AZ that wont work.

I would create stripped down busybox example service with that placementConstraint setting, if it does not work, file a support ticket with AWS with the example code. Thats what I did to get a confirmation that it was a bug.

@kiwibel
Copy link

kiwibel commented May 12, 2020

Thanks again @toredash

Turned out autoscaling didn't work due to the fact that existing services need to be updated to use capacity provider strategy. I had an impression they will automatically pick up the default one from the cluster 😳

On a separate note, after configuring this in production we have faced an issue when the underlying ASG won't (sometimes) scale-in down to a minimum number of instances even though the corresponding CloudWatch alarm went off. AWS support is currently investigating this.

Cheers

@pramshar
Copy link

Thanks again @toredash

Turned out autoscaling didn't work due to the fact that existing services need to be updated to use capacity provider strategy. I had an impression they will automatically pick up the default one from the cluster 😳

On a separate note, after configuring this in production we have faced an issue when the underlying ASG won't (sometimes) scale-in down to a minimum number of instances even though the corresponding CloudWatch alarm went off. AWS support is currently investigating this.

Cheers

Whta have you to update in services to use capacity provider, I am under the impression that if we are not using the launchtype field then it will use the default cluster strategy

@toredash
Copy link

Thanks again @toredash
Turned out autoscaling didn't work due to the fact that existing services need to be updated to use capacity provider strategy. I had an impression they will automatically pick up the default one from the cluster 😳
On a separate note, after configuring this in production we have faced an issue when the underlying ASG won't (sometimes) scale-in down to a minimum number of instances even though the corresponding CloudWatch alarm went off. AWS support is currently investigating this.
Cheers

Whta have you to update in services to use capacity provider, I am under the impression that if we are not using the launchtype field then it will use the default cluster strategy

I believe that is true if you create a new service. Updating a service from the GUI does not assign it the default capacity provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECS Amazon Elastic Container Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests