-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AWS][Cluster Autoscale] Cluster Autoscaler is randomly adding and deleting nodes in the node groups, results in uneven node distribution across different zones #3082
Comments
We are seeing very similar behaviour in multiple clusters. Each cluster has 3 ASGs (3 AZs) with varying maximum and minimum instance numbers. Some current ASG numbers:
Command: |
Digging a little bit on the CA source code I've seen this
Meaning, if I understood correctly how CA works, that each ASG is a node group on itself instead of grouping by tags or name pattern or similar... Once said that the behaviour seen on
and I've observed these logs entries.
reaching a more evenly distributed cluster across AZs using a quick test...
Set resource request and limits. I'm using
Sit back and relax... at the end the chosen instance ASG type chosen was scaled evenly
(Please bear in mind that we have only been running this CA version for a couple of hours... I'll repeat the test a couple of times over this week and see what happens) At the moment we have 3 different ASG with the same configuration one per AZ but AWS docs say that
which makes specific reference to StatefukSets but same rules apply to other workloads...??? I hope this helps someone to through some light on this issue |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Although I have the same labels for all ASGs, the same instance size and also enabled
balance-similar-node-groups
, cluster auto-scaler does not balance or evenly distribute instances.We have 4 ASGs, one ASG per AZ.
There are 6 instances in ASG-A, 2 in ASG-B, 1 in ASG-C, 1 in ASG-D
Container Configuration:
containers:
Config map logs:
status: |+
Cluster-autoscaler status at 2020-04-23 09:44:25.020239986 +0000 UTC:
Cluster-wide:
Health: Healthy (ready=12 unready=0 notStarted=0 longNotStarted=0 registered=13 longUnregistered=0)
LastProbeTime: 2020-04-23 09:44:25.016382263 +0000 UTC m=+25908.679659720
LastTransitionTime: 2020-04-23 02:33:05.172110061 +0000 UTC m=+28.835387548
ScaleUp: NoActivity (ready=12 registered=13)
LastProbeTime: 2020-04-23 09:44:25.016382263 +0000 UTC m=+25908.679659720
LastTransitionTime: 2020-04-23 09:33:51.210503791 +0000 UTC m=+25274.873781228
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-04-23 09:44:25.016382263 +0000 UTC m=+25908.679659720
LastTransitionTime: 2020-04-23 09:43:44.950793146 +0000 UTC m=+25868.614070533
NodeGroups:
Name: eks-travel-qa-subnet-1e46c654-workers-NodeGroup-PBB6IJ6ZNHKF
Health: Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=0))
LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
ScaleUp: NoActivity (ready=0 cloudProviderTarget=0)
LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-04-23 09:44:25.016382263 +0000 UTC m=+25908.679659720
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
Name: eks-travel-qa-subnet-4d4d8c2a-workers-NodeGroup-PC6EPZSXRVAT
Health: Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=0))
LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
ScaleUp: NoActivity (ready=0 cloudProviderTarget=0)
LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-04-23 09:44:25.016382263 +0000 UTC m=+25908.679659720
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
Name: eks-travel-qa-subnet-9e48b5c2-workers-NodeGroup-J61VTAEY3A7U
Health: Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=0))
LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
ScaleUp: NoActivity (ready=0 cloudProviderTarget=0)
LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-04-23 09:44:25.016382263 +0000 UTC m=+25908.679659720
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
Name: eks-travel-qa-subnet-faf608d4-workers-NodeGroup-NW2O6EXRYLO
Health: Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=0))
LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
ScaleUp: NoActivity (ready=0 cloudProviderTarget=0)
LastProbeTime: 0001-01-01 00:00:00 +0000 UTC
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-04-23 09:44:25.016382263 +0000 UTC m=+25908.679659720
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
Any help would be appreciated, Thanks.
The text was updated successfully, but these errors were encountered: