Cluster Autoscaler support for AWS EC2 attribute-based instance selection #5580

youwalther65 · 2023-03-09T06:55:16Z

Which component are you using?: Cluster Autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
AWS EC2 has a rich set of instance types. AWS atribute-based instance selection described here provides an easy way to specify instance selection for an Auto Scaling Group by providing for example required number of vCPU and memory.
Following a Terraform example using this in EKS module self-managed node group:

    smng-mixed = {
      name = "smng-mixed"
…
      use_mixed_instances_policy = true
      mixed_instances_policy = {
        instances_distribution = {
          on_demand_base_capacity                  = 0
          on_demand_percentage_above_base_capacity = 0
          spot_allocation_strategy                 = "price-capacity-optimized"
          # SpotInstancePools option is only available with the lowest-price allocation strategy
          #spot_instance_pools                      = 2
        }

        # does not work with Cluster Auroscaler because he can't build a proper template node :-(
        # this is a list so commas are mandatory
        override = [
          {
            # attribute-based instance selection
            # this is a map so commas are optional
            instance_requirements = {
                vcpu_count = {
                  min = 4
                  max = 4
                },
                memory_mib = {
                  min = 16384
                  max = 16384
                },
                burstable_performance = "excluded",
                excluded_instance_types = ["d*","g*","x*","z*"],
            }
          },

Describe the solution you'd like.: At the moment Cluster Autoscaler is not able to create a node template and raises the following error in leeader logs:

$ k logs -n kube-system cluster-autoscaler-aws-cluster-autoscaler-xxx
…
E0308 15:38:05.516606       1 mixed_nodeinfos_processor.go:151] Unable to build proper template node for smng-mixed-2023022810062797600000002d: ASG "smng-mixed-2023022810062797600000002d" uses the unknown EC2 instance type ""
E0308 15:38:05.516615       1 static_autoscaler.go:290] Failed to get node infos for groups: ASG "smng-mixed-2023022810062797600000002d" uses the unknown EC2 instance type ""

Describe any alternative solutions you've considered.: Develop a way to build a proper template node by either:

either calling the AWS EC2 API GetInstanceTypesFromInstanceRequirements
or by using vCPU, memory information etc. from InstanceRequirements from LaunchTemplate or Auto Scaling group MixedInstancesPolicy object

Additional context.: N/A

bwagner5 · 2023-03-10T17:15:27Z

This should have been implemented in this PR #4588

What version of CAS are you using?

bwagner5 · 2023-03-10T17:33:43Z

Linking this issue since the feature appears to be working in @bpineau's case based on PR's testing:

#5550

youwalther65 · 2023-03-10T18:08:14Z

@bwagner5 When looking at the Terraform code, the instance requirements are in ASG LT override, not the LT itself. Could this be the reason? This is the easy way to use the EKS modules self-managed node group. If only LT is queried then one have to use either custom LT or AWS provider resorces instead.

bwagner5 · 2023-03-10T18:47:48Z

It should work for both in the LT and as an LT ASG Override. Are you able to try it with an LT instead of an LT override though just to see?

youwalther65 · 2023-03-11T14:07:46Z

@bwagner5 I saw that your PR was merged into CAS 1.25 and most probably not backported to older versions of CAS, can you please confirm?!
I just checked and it seems I still had v1.24.0 image of CAS (just recently switched to EKS 1.25 after it's release).
Now I switched to CAS image v1.25.0 and the error is no longer visible.

Here the data:

latest Helm chart

$ helm list -n kube-system | grep cluster-autoscaler
cluster-autoscaler              kube-system     4               2023-03-10 07:26:44.405431457 +0000 UTC deployed        cluster-autoscaler-9.26.0                      1.24.0

Image:

$ k get deploy -n kube-system cluster-autoscaler-aws-cluster-autoscaler -o yaml |  yq e '.spec.template.spec.containers[0].image'
registry.k8s.io/autoscaling/cluster-autoscaler:v1.25.0

EKS version:

$ k version --short
...yaml
Server Version: v1.25.6-eks-48e63af

ASG info:

$ aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names smng-mixed-2023022810062797600000002d
{
    "AutoScalingGroups": [
        {
            "AutoScalingGroupName": "smng-mixed-2023022810062797600000002d",
            "AutoScalingGroupARN": "arn:aws:autoscaling:eu-west-1:<redacted>:autoScalingGroup:<redacted>:autoScalingGroupName/smng-mixed-2023022810062797600000002d",
            "MixedInstancesPolicy": {
                "LaunchTemplate": {
                    "LaunchTemplateSpecification": {
                        "LaunchTemplateId": "lt-090929890da8f991b",
                        "LaunchTemplateName": "smng-mixed-20230228100627260600000022",
                        "Version": "1"
                    },
                    "Overrides": [
                        {
                            "InstanceRequirements": {
                                "VCpuCount": {
                                    "Min": 4,
                                    "Max": 4
                                },
                                "MemoryMiB": {
                                    "Min": 16384,
                                    "Max": 16384
                                },
                                "ExcludedInstanceTypes": [
                                    "g*",
                                    "d*",
                                    "z*",
                                    "x*"
                                ],
                                "BurstablePerformance": "excluded"
                            }
                        }
                    ]
                },
...

bwagner5 · 2023-03-14T13:55:39Z

Yes, that is correct.

spr-mweber3 · 2023-03-14T16:35:59Z

Unfortunately, running into the same issue. Even with the latest chart version and the latest 1.26.1 release of the autoscaler. I did upgrade from 1.24.0 to see if the problem is gone now. But unfortunately that doesn't seem to be the case.

In my case I'm aswell using attribute based selection of EC2 instance types.

static_autoscaler.go:290] Failed to get node infos for groups: ASG "eks1-euc1-stg-etc" uses the unknown EC2 instance type ""
mixed_nodeinfos_processor.go:151] Unable to build proper template node for ...

This issue only occurs if the ASG is scaled to 0 when the autoscaler is starting up. As soon as I scale up to 1 and restart the autoscaler, it will work.

Someone else also raised a question here: https://devops.stackexchange.com/questions/16833/cluster-autoscaler-crash-unable-to-build-proper-template-node

youwalther65 · 2023-03-15T13:17:10Z

@spr-mweber3 It worked for me even on 1.25.0.
I used a self-managed node group with taints and tolerations and added those as ASG tags as required for scale-from-0. For managed node groups CAS just need the eks:DescribeNodegroup IAM permission to recognize labels and taints.

CAS leader log excerpt
...
I0315 12:57:46.750288       1 expiration_cache.go:103] Entry smng-mixed-2023022810062797600000002d: {name:smng-mixed-2023022810062797600000002d instanceType:m4.xlarge} has expired
...
I0315 13:04:11.967399       1 scale_up.go:477] Best option to resize: smng-mixed-2023022810062797600000002d
I0315 13:04:11.967414       1 scale_up.go:481] Estimated 1 nodes needed in smng-mixed-2023022810062797600000002d
I0315 13:04:11.967440       1 scale_up.go:601] Final scale-up plan: [{smng-mixed-2023022810062797600000002d 0->1 (max: 3)}]
I0315 13:04:11.967461       1 scale_up.go:700] Scale-up: setting group smng-mixed-2023022810062797600000002d size to 1
I0315 13:04:11.967485       1 auto_scaling_groups.go:248] Setting asg smng-mixed-2023022810062797600000002d size to 1
I0315 13:04:11.967780       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"12800f36-344b-4f5e-8e32-1f39380d60db", APIVersion:"v1", ResourceVersion:"21128420", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group smng-mixed-2023022810062797600000002d size to 1 instead of 0 (max: 3)
I0315 13:04:12.118636       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"12800f36-344b-4f5e-8e32-1f39380d60db", APIVersion:"v1", ResourceVersion:"21128420", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group smng-mixed-2023022810062797600000002d size set to 1 instead of 0 (max: 3)
I0315 13:04:12.125654       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"inflate-multi-az-system-comp-6b67d44f55-86jnd", UID:"344075cb-9762-4594-9f73-e9a3171b53f7", APIVersion:"v1", ResourceVersion:"21128440", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{smng-mixed-2023022810062797600000002d 0->1 (max: 3)}]
I0315 13:04:12.133217       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"inflate-multi-az-system-comp-6b67d44f55-gcxgr", UID:"64c3a596-3b19-45e1-91c2-26a049d64473", APIVersion:"v1", ResourceVersion:"21128436", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{smng-mixed-2023022810062797600000002d 0->1 (max: 3)}]

SMNG uses the the Terraform code I showed in initial comment.

k8s-triage-robot · 2023-06-13T14:30:01Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-07-13T14:49:55Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Joldnine · 2023-10-19T07:36:52Z

Have the same issue. May I know any update for the thread? Is it resolved in the later releases..

Shubham82 · 2024-01-09T09:33:13Z

/remove-lifecycle rotten

Shubham82 · 2024-01-09T09:35:25Z

/area provider/aws

k8s-triage-robot · 2024-04-08T10:03:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-08T10:52:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-06-07T11:30:32Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-06-07T11:30:37Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

youwalther65 added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 9, 2023

jbartosik added the area/cluster-autoscaler label Mar 13, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 13, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 9, 2024

k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Jan 9, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 8, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler support for AWS EC2 attribute-based instance selection #5580

Cluster Autoscaler support for AWS EC2 attribute-based instance selection #5580

youwalther65 commented Mar 9, 2023 •

edited

Loading

bwagner5 commented Mar 10, 2023

bwagner5 commented Mar 10, 2023

youwalther65 commented Mar 10, 2023 •

edited

Loading

bwagner5 commented Mar 10, 2023

youwalther65 commented Mar 11, 2023 •

edited

Loading

bwagner5 commented Mar 14, 2023

spr-mweber3 commented Mar 14, 2023

youwalther65 commented Mar 15, 2023 •

edited

Loading

k8s-triage-robot commented Jun 13, 2023

k8s-triage-robot commented Jul 13, 2023

Joldnine commented Oct 19, 2023

Shubham82 commented Jan 9, 2024

Shubham82 commented Jan 9, 2024

k8s-triage-robot commented Apr 8, 2024

k8s-triage-robot commented May 8, 2024

k8s-triage-robot commented Jun 7, 2024

k8s-ci-robot commented Jun 7, 2024

Cluster Autoscaler support for AWS EC2 attribute-based instance selection #5580

Cluster Autoscaler support for AWS EC2 attribute-based instance selection #5580

Comments

youwalther65 commented Mar 9, 2023 • edited Loading

bwagner5 commented Mar 10, 2023

bwagner5 commented Mar 10, 2023

youwalther65 commented Mar 10, 2023 • edited Loading

bwagner5 commented Mar 10, 2023

youwalther65 commented Mar 11, 2023 • edited Loading

bwagner5 commented Mar 14, 2023

spr-mweber3 commented Mar 14, 2023

youwalther65 commented Mar 15, 2023 • edited Loading

k8s-triage-robot commented Jun 13, 2023

k8s-triage-robot commented Jul 13, 2023

Joldnine commented Oct 19, 2023

Shubham82 commented Jan 9, 2024

Shubham82 commented Jan 9, 2024

k8s-triage-robot commented Apr 8, 2024

k8s-triage-robot commented May 8, 2024

k8s-triage-robot commented Jun 7, 2024

k8s-ci-robot commented Jun 7, 2024

youwalther65 commented Mar 9, 2023 •

edited

Loading

youwalther65 commented Mar 10, 2023 •

edited

Loading

youwalther65 commented Mar 11, 2023 •

edited

Loading

youwalther65 commented Mar 15, 2023 •

edited

Loading