Support exposing scalable resources to Cluster Autoscaler in ClusterClass #5442

enxebre · 2021-10-19T10:09:11Z

User Story

As a cluster operator I would like to be able to enable Cluster Autoscaling for some node pools in managed topologies

Detailed Description

ClusterClass provides a way to define a "stamp" to be used for creating many Clusters with a similar shape.

It would be great to have a way to expose scalable resources i.e MachineDeployments to the Cluster Autoscaler in the ClusterClass, so it will be automatically included in the generated Clusters.
This has two separate parts:

Expose scalable resources to autoscaling i.e set min/max Cluster Autoscaler labels in MachineDeployments.
Deploy the Cluster Autoscaler, e.g clusterClass.spec.autoscaling: true let us manage a deployment running the Cluster Autoscaler.

/kind feature

enxebre · 2021-10-19T10:10:30Z

This should have common design considerations with #5125

enxebre · 2021-10-19T11:30:13Z

/area topology

vincepri · 2021-10-22T17:29:37Z

/milestone Next
/assign @enxebre
to talk at community meeting

vincepri · 2021-10-22T17:29:45Z

/assign @randomvariable

elmiko · 2021-10-28T21:02:08Z

i like this idea, but considering the MHC cluster class proposal as well it starts to make me wonder if we should have a more generic mechanism for allowing the user to add components that get installed on a per-cluster basis. we have had users ask about this type of feature in the past and i wonder if we would see a way for users to include arbitrary manifests (or references) in their cluster class which could be deployed after the core cluster is running?

k8s-triage-robot · 2022-01-26T21:38:35Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elmiko · 2022-01-26T21:46:49Z

i think we should keep this open, but i'm not clear on the next steps. would it be appropriate to draft a proposal for this?

/remove-lifecycle stale

k8s-triage-robot · 2022-04-26T22:22:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-05-26T22:54:58Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

elmiko · 2022-05-27T14:09:16Z

i still think this is a nice idea, not sure how the rest of the community feels. is this worth bringing up again at a meeting or should we consider rolling it into the lifecycle hooks stuff? (cc @enxebre )

/remove-lifecycle rotten

fabriziopandini · 2022-05-28T16:14:54Z

/lifecycle frozen

sbueringer · 2022-05-30T07:38:13Z

Regarding the above points:

Is this already supported today as we can set labels on MDs via Cluster.spec.topology...?
I'm not sure if ClusterClass should also be responsible to deploy the ClusterAutoscaler. For my understanding, the Cluster Autoscaler would be deployed once in the mgmt cluster?

elmiko · 2022-05-31T19:05:13Z

Is this already supported today as we can set labels on MDs via Cluster.spec.topology...?

that's kinda what i was wondering too

i'm not sure if ClusterClass should also be responsible to deploy the ClusterAutoscaler.

agreed, i'm not sure either. i had thought we decided /not/ to include autoscaler while we had the discussion during the clusterclass enhancement process.

i wonder if this issue needs updating to fit with the changes we have proposed more recently?

sbueringer · 2022-05-31T19:09:54Z

Fabrizio wrote an interesting comment here: #5532 (comment)

MaxFedotov · 2022-07-20T14:38:03Z

Hello! And what is the current state of this issue? The lack of autoscaling abilities seems like a big stopper for migrating to ClusterClass and managed topologies

elmiko · 2022-07-20T16:55:21Z

@MaxFedotov as you can tell we haven't discussed this in the past month or so. i know the group is split about whether ClusterClass should expose a way to deploy the cluster autoscaler, and we have decided not to include this functionality currently.

with that said, you could add the minimum and maximum scaling annotation to the MachineDeployment or MachineSets defined in the ClusterClass. that would at least give the ability for the autoscaler to detect those scalable resources, you would just be responsible for deploying the autoscaler itself.

elmiko · 2022-07-20T17:46:34Z

we had a conversation about this on slack, it has some interesting details about problems that users might find.

https://kubernetes.slack.com/archives/C8TSNPY4T/p1658302795387519

MaxFedotov · 2022-07-20T20:16:42Z

Performed some test using CAPD.
If you create a Cluster and specify replicas for controlPlane or machineDeployments

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: capd
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    serviceDomain: cluster.local
    services:
      cidrBlocks:
      - 10.128.0.0/12
  topology:
    class: quick-start
    controlPlane:
      metadata: {}
      replicas: 1
    variables:
    - name: imageRepository
      value: k8s.gcr.io
    - name: etcdImageTag
      value: ""
    - name: coreDNSImageTag
      value: ""
    - name: podSecurityStandard
      value:
        audit: restricted
        enabled: true
        enforce: baseline
        warn: restricted
    version: v1.24.0
    workers:
      machineDeployments:
      - class: default-worker
        name: md-0
        replicas: 1

you won't be able to scale them using:

kubectl scale kcp capi-quickstart-75wgz --replicas=3

or

k scale md capi-quickstart-md-0-d498t --replicas=3

They will always go to values specified in Cluster spec and the following error will be in capi-controller-manager logs:

I0720 19:09:21.173605       1 machineset_controller.go:443] "Too many replicas" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" machineSet="default/capi-quickstart-md-0-jl47c-55969c6fd8" namespace="default" name="capi-quickstart-md-0-jl47c-55969c6fd8" reconcileID=06e8abbb-8634-4b00-b780-624c337cf31c need=1 deleting=1
I0720 19:09:21.174082       1 machineset_controller.go:449] "Found delete policy" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" machineSet="default/capi-quickstart-md-0-jl47c-55969c6fd8" namespace="default" name="capi-quickstart-md-0-jl47c-55969c6fd8" reconcileID=06e8abbb-8634-4b00-b780-624c337cf31c delete-policy="Random"
I0720 19:09:21.199668       1 reconcile_state.go:56] "Reconciling state for topology owned objects" controller="topology/cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" cluster="default/capi-quickstart" namespace="default" name="capi-quickstart" reconcileID=49b8a80f-19f1-4aff-a6eb-a6dcd34fe82f
I0720 19:09:21.234217       1 machineset_controller.go:460] "Deleted machine" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" machineSet="default/capi-quickstart-md-0-jl47c-55969c6fd8" namespace="default" name="capi-quickstart-md-0-jl47c-55969c6fd8" reconcileID=06e8abbb-8634-4b00-b780-624c337cf31c machine="capi-quickstart-md-0-jl47c-55969c6fd8-vkmtb"

But if your Cluster spec won't include replicas field for controlPlane or machineDeployments:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: capi-quickstart
  namespace: default
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    serviceDomain: cluster.local
    services:
      cidrBlocks:
      - 10.128.0.0/12
  topology:
    class: quick-start
    controlPlane:
      metadata: {}
    variables:
    - name: imageRepository
      value: k8s.gcr.io
    - name: etcdImageTag
      value: ""
    - name: coreDNSImageTag
      value: ""
    - name: podSecurityStandard
      value:
        audit: restricted
        enabled: true
        enforce: baseline
        warn: restricted
    version: v1.24.0
    workers:
      machineDeployments:
      - class: default-worker
        name: md-0

Then a KubeadmControlPlane with 1 replica and MachineDeployment with 1 replica will be created. And the replicas numbers won't be managed by the Topology controller so that they can be scaled by user or cluster-autoscaler.

Many thanks to @elmiko and @killianmuldoon for helping to understand how to deal with this issue.

fabriziopandini · 2022-07-29T17:54:06Z

As per comment above this is already possible,
There is also documentation in the autoscaler kubernetes/autoscaler#5053 thanks to @killianmuldoon
/close

k8s-ci-robot · 2022-07-29T17:55:18Z

@fabriziopandini: Closing this issue.

In response to this:

As per comment above this is already possible,
There is also documentation in the autoscaler kubernetes/autoscaler#5053 thanks to @killianmuldoon
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 19, 2021

k8s-ci-robot added the area/topology label Oct 19, 2021

k8s-ci-robot assigned enxebre Oct 22, 2021

k8s-ci-robot added this to the Next milestone Oct 22, 2021

k8s-ci-robot assigned randomvariable Oct 22, 2021

killianmuldoon mentioned this issue Oct 29, 2021

Lifecycle integrations for clusterClass #5532

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 27, 2022

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label May 28, 2022

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the Next milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot closed this as completed Jul 29, 2022

killianmuldoon mentioned this issue Sep 27, 2022

MachineDeployment.spec.replicas defaulting should take into account autoscaler min/max size if defined #7293

Closed

killianmuldoon added the area/clusterclass Issues or PRs related to clusterclass label May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support exposing scalable resources to Cluster Autoscaler in ClusterClass #5442

Support exposing scalable resources to Cluster Autoscaler in ClusterClass #5442

enxebre commented Oct 19, 2021 •

edited

Loading

enxebre commented Oct 19, 2021

enxebre commented Oct 19, 2021

vincepri commented Oct 22, 2021

vincepri commented Oct 22, 2021

elmiko commented Oct 28, 2021

k8s-triage-robot commented Jan 26, 2022

elmiko commented Jan 26, 2022

k8s-triage-robot commented Apr 26, 2022

k8s-triage-robot commented May 26, 2022

elmiko commented May 27, 2022

fabriziopandini commented May 28, 2022

sbueringer commented May 30, 2022

elmiko commented May 31, 2022

sbueringer commented May 31, 2022

MaxFedotov commented Jul 20, 2022

elmiko commented Jul 20, 2022

elmiko commented Jul 20, 2022

MaxFedotov commented Jul 20, 2022

fabriziopandini commented Jul 29, 2022

k8s-ci-robot commented Jul 29, 2022

Support exposing scalable resources to Cluster Autoscaler in ClusterClass #5442

Support exposing scalable resources to Cluster Autoscaler in ClusterClass #5442

Comments

enxebre commented Oct 19, 2021 • edited Loading

enxebre commented Oct 19, 2021

enxebre commented Oct 19, 2021

vincepri commented Oct 22, 2021

vincepri commented Oct 22, 2021

elmiko commented Oct 28, 2021

k8s-triage-robot commented Jan 26, 2022

elmiko commented Jan 26, 2022

k8s-triage-robot commented Apr 26, 2022

k8s-triage-robot commented May 26, 2022

elmiko commented May 27, 2022

fabriziopandini commented May 28, 2022

sbueringer commented May 30, 2022

elmiko commented May 31, 2022

sbueringer commented May 31, 2022

MaxFedotov commented Jul 20, 2022

elmiko commented Jul 20, 2022

elmiko commented Jul 20, 2022

MaxFedotov commented Jul 20, 2022

fabriziopandini commented Jul 29, 2022

k8s-ci-robot commented Jul 29, 2022

enxebre commented Oct 19, 2021 •

edited

Loading