Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler support for AWS EC2 attribute-based instance selection #5580

Closed
youwalther65 opened this issue Mar 9, 2023 · 17 comments
Closed
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@youwalther65
Copy link

youwalther65 commented Mar 9, 2023

Which component are you using?: Cluster Autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
AWS EC2 has a rich set of instance types. AWS atribute-based instance selection described here provides an easy way to specify instance selection for an Auto Scaling Group by providing for example required number of vCPU and memory.
Following a Terraform example using this in EKS module self-managed node group:

    smng-mixed = {
      name = "smng-mixed"

      use_mixed_instances_policy = true
      mixed_instances_policy = {
        instances_distribution = {
          on_demand_base_capacity                  = 0
          on_demand_percentage_above_base_capacity = 0
          spot_allocation_strategy                 = "price-capacity-optimized"
          # SpotInstancePools option is only available with the lowest-price allocation strategy
          #spot_instance_pools                      = 2
        }

        # does not work with Cluster Auroscaler because he can't build a proper template node :-(
        # this is a list so commas are mandatory
        override = [
          {
            # attribute-based instance selection
            # this is a map so commas are optional
            instance_requirements = {
                vcpu_count = {
                  min = 4
                  max = 4
                },
                memory_mib = {
                  min = 16384
                  max = 16384
                },
                burstable_performance = "excluded",
                excluded_instance_types = ["d*","g*","x*","z*"],
            }
          },

Describe the solution you'd like.: At the moment Cluster Autoscaler is not able to create a node template and raises the following error in leeader logs:

$ k logs -n kube-system cluster-autoscaler-aws-cluster-autoscaler-xxx

E0308 15:38:05.516606       1 mixed_nodeinfos_processor.go:151] Unable to build proper template node for smng-mixed-2023022810062797600000002d: ASG "smng-mixed-2023022810062797600000002d" uses the unknown EC2 instance type ""
E0308 15:38:05.516615       1 static_autoscaler.go:290] Failed to get node infos for groups: ASG "smng-mixed-2023022810062797600000002d" uses the unknown EC2 instance type ""

Describe any alternative solutions you've considered.: Develop a way to build a proper template node by either:

  • either calling the AWS EC2 API GetInstanceTypesFromInstanceRequirements
  • or by using vCPU, memory information etc. from InstanceRequirements from LaunchTemplate or Auto Scaling group MixedInstancesPolicy object

Additional context.: N/A

@youwalther65 youwalther65 added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 9, 2023
@bwagner5
Copy link

This should have been implemented in this PR #4588

What version of CAS are you using?

@bwagner5
Copy link

Linking this issue since the feature appears to be working in @bpineau's case based on PR's testing:

#5550

@youwalther65
Copy link
Author

youwalther65 commented Mar 10, 2023

@bwagner5 When looking at the Terraform code, the instance requirements are in ASG LT override, not the LT itself. Could this be the reason? This is the easy way to use the EKS modules self-managed node group. If only LT is queried then one have to use either custom LT or AWS provider resorces instead.

@bwagner5
Copy link

It should work for both in the LT and as an LT ASG Override. Are you able to try it with an LT instead of an LT override though just to see?

@youwalther65
Copy link
Author

youwalther65 commented Mar 11, 2023

@bwagner5 I saw that your PR was merged into CAS 1.25 and most probably not backported to older versions of CAS, can you please confirm?!
I just checked and it seems I still had v1.24.0 image of CAS (just recently switched to EKS 1.25 after it's release).
Now I switched to CAS image v1.25.0 and the error is no longer visible.

Here the data:

latest Helm chart

$ helm list -n kube-system | grep cluster-autoscaler
cluster-autoscaler              kube-system     4               2023-03-10 07:26:44.405431457 +0000 UTC deployed        cluster-autoscaler-9.26.0                      1.24.0

Image:

$ k get deploy -n kube-system cluster-autoscaler-aws-cluster-autoscaler -o yaml |  yq e '.spec.template.spec.containers[0].image'
registry.k8s.io/autoscaling/cluster-autoscaler:v1.25.0

EKS version:

$ k version --short
...yaml
Server Version: v1.25.6-eks-48e63af

ASG info:

$ aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names smng-mixed-2023022810062797600000002d
{
    "AutoScalingGroups": [
        {
            "AutoScalingGroupName": "smng-mixed-2023022810062797600000002d",
            "AutoScalingGroupARN": "arn:aws:autoscaling:eu-west-1:<redacted>:autoScalingGroup:<redacted>:autoScalingGroupName/smng-mixed-2023022810062797600000002d",
            "MixedInstancesPolicy": {
                "LaunchTemplate": {
                    "LaunchTemplateSpecification": {
                        "LaunchTemplateId": "lt-090929890da8f991b",
                        "LaunchTemplateName": "smng-mixed-20230228100627260600000022",
                        "Version": "1"
                    },
                    "Overrides": [
                        {
                            "InstanceRequirements": {
                                "VCpuCount": {
                                    "Min": 4,
                                    "Max": 4
                                },
                                "MemoryMiB": {
                                    "Min": 16384,
                                    "Max": 16384
                                },
                                "ExcludedInstanceTypes": [
                                    "g*",
                                    "d*",
                                    "z*",
                                    "x*"
                                ],
                                "BurstablePerformance": "excluded"
                            }
                        }
                    ]
                },
...

@bwagner5
Copy link

Yes, that is correct.

@spr-mweber3
Copy link

Unfortunately, running into the same issue. Even with the latest chart version and the latest 1.26.1 release of the autoscaler. I did upgrade from 1.24.0 to see if the problem is gone now. But unfortunately that doesn't seem to be the case.

In my case I'm aswell using attribute based selection of EC2 instance types.

static_autoscaler.go:290] Failed to get node infos for groups: ASG "eks1-euc1-stg-etc" uses the unknown EC2 instance type ""
mixed_nodeinfos_processor.go:151] Unable to build proper template node for ...

This issue only occurs if the ASG is scaled to 0 when the autoscaler is starting up. As soon as I scale up to 1 and restart the autoscaler, it will work.

Someone else also raised a question here: https://devops.stackexchange.com/questions/16833/cluster-autoscaler-crash-unable-to-build-proper-template-node

@youwalther65
Copy link
Author

youwalther65 commented Mar 15, 2023

@spr-mweber3 It worked for me even on 1.25.0.
I used a self-managed node group with taints and tolerations and added those as ASG tags as required for scale-from-0. For managed node groups CAS just need the eks:DescribeNodegroup IAM permission to recognize labels and taints.

CAS leader log excerpt
...
I0315 12:57:46.750288       1 expiration_cache.go:103] Entry smng-mixed-2023022810062797600000002d: {name:smng-mixed-2023022810062797600000002d instanceType:m4.xlarge} has expired
...
I0315 13:04:11.967399       1 scale_up.go:477] Best option to resize: smng-mixed-2023022810062797600000002d
I0315 13:04:11.967414       1 scale_up.go:481] Estimated 1 nodes needed in smng-mixed-2023022810062797600000002d
I0315 13:04:11.967440       1 scale_up.go:601] Final scale-up plan: [{smng-mixed-2023022810062797600000002d 0->1 (max: 3)}]
I0315 13:04:11.967461       1 scale_up.go:700] Scale-up: setting group smng-mixed-2023022810062797600000002d size to 1
I0315 13:04:11.967485       1 auto_scaling_groups.go:248] Setting asg smng-mixed-2023022810062797600000002d size to 1
I0315 13:04:11.967780       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"12800f36-344b-4f5e-8e32-1f39380d60db", APIVersion:"v1", ResourceVersion:"21128420", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group smng-mixed-2023022810062797600000002d size to 1 instead of 0 (max: 3)
I0315 13:04:12.118636       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"12800f36-344b-4f5e-8e32-1f39380d60db", APIVersion:"v1", ResourceVersion:"21128420", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group smng-mixed-2023022810062797600000002d size set to 1 instead of 0 (max: 3)
I0315 13:04:12.125654       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"inflate-multi-az-system-comp-6b67d44f55-86jnd", UID:"344075cb-9762-4594-9f73-e9a3171b53f7", APIVersion:"v1", ResourceVersion:"21128440", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{smng-mixed-2023022810062797600000002d 0->1 (max: 3)}]
I0315 13:04:12.133217       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"inflate-multi-az-system-comp-6b67d44f55-gcxgr", UID:"64c3a596-3b19-45e1-91c2-26a049d64473", APIVersion:"v1", ResourceVersion:"21128436", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{smng-mixed-2023022810062797600000002d 0->1 (max: 3)}]

SMNG uses the the Terraform code I showed in initial comment.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 13, 2023
@Joldnine
Copy link

Have the same issue. May I know any update for the thread? Is it resolved in the later releases..

@Shubham82
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 9, 2024
@Shubham82
Copy link
Contributor

/area provider/aws

@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Jan 9, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 8, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 7, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants