Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concepts/ClusterAdministration: Expand Node Autoscaling documentation #45802

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

towca
Copy link

@towca towca commented Apr 8, 2024

Change Cluster Autoscaling concept page to Node Autoscaling, and expand it in accordance to the Cluster Autoscaler/Karpenter Alignment AEP: https://docs.google.com/document/d/1rHhltfLV5V1kcnKr_mKRKDC4ZFPYGP4Tde2Zy-LE72w/edit#heading=h.iof64m6gewln.

The documentation tentatively mentions one of the workload-level APIs we're supposed to align on (see kubernetes/autoscaler#6648). We can comment that section out until we align and implement the new version in both CA and Karpenter.

Tracking issue: kubernetes/autoscaler#6646

Staged preview: https://deploy-preview-45802--kubernetes-io-main-staging.netlify.app/docs/concepts/cluster-administration/node-autoscaling/

cc @jonathan-innis @MaciekPytel @gjtempleton WDYT?

@k8s-ci-robot k8s-ci-robot added the language/en Issues or PRs related to English language label Apr 8, 2024
@k8s-ci-robot k8s-ci-robot requested review from kbhawkey and sftim April 8, 2024 14:15
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 8, 2024
Copy link

netlify bot commented Apr 8, 2024

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit fd0893e
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/66f161fe2c859400081ff9db
😎 Deploy Preview https://deploy-preview-45802--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@towca
Copy link
Author

towca commented Apr 8, 2024

/hold
/assign @jonathan-innis
/assign @gjtempleton
/assign @MaciekPytel

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 8, 2024
with a particular set of pods, lack of cloud provider capacity).

{{< note >}}
Only the scheduling constraints of pods (for example resource requests, node selectors) are taken into account when determining Nodes to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this statement "Only the scheduling constraints of pods... are taken into account when determining Nodes to provision." is not entirely correct for Karpenter. Maybe a better way to put it would be:

"The fundamental input for determining node provisioning is pod scheduling constraints."

This would leave open other node provisioning inputs (such as vm cost). I don't think we have to state this, but I think saying Only the scheduling constraints of pods... is too strong and excludes other node autoprovisioning considerations.

cc @jonathan-innis

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent was to emphasize that real resource usage is not really considered and neither CA nor Karpenter will do anything if your pod is OOMing. I think that is worth emphasizing.
But I agree that the wording is misleading - both Karpenter and CA look at VM price (in case of CA only if you enable price expander). And I could imagine a lot of other signals - e.g. spot availability. Let's rephrase this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably pretty accurate when it comes to provisioning. It differs a bit (in Karpenter at least) when trying to do Consolidation through VM price comparisons, and also preferences are maybe handled differently in Karpenter, but I think it's fair to say:
"pod scheduling and resource constraints are the primary drivers of schedulability"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I like your suggestion! Is this ok?

to autoscale Nodes based on actual resource usage, you can [combine workload and Node autoscaling](#multi-scaling).
{{< /note >}}

Depending on the choice of autoscaler, provisioning can also be triggered by

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think minimum (or maximum) node thresholds are more of an autoscaler configuration rather than a required behavior of a particular autoscaler. So I'd probably replace "Depending on the choice of autoscaler..." with

"Autoscaler configuration may also include other node provisioning triggers (for example the number of Nodes falling below a configured minimum limit)."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph was only meant to highlight that there are "non-standard" triggers for provisioning that particular autoscalers might have, not trying to imply any required behavior for autoscalers in general - hence the "depending on the choice of autoscaler".

Your suggestion IMO conveys the same thing, changed.

experience. Both will provision new Nodes for unschedulable Pods, and both will
consolidate the Nodes that are no longer optimally utilized.

The most of the difference between the autoscalers is in how they're set up and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've already stated the first part of this sentence already in the doc. We could add a simple statement here "Different autoscalers may provide features outsice the Node autoscaling scope described on this page, and those additional features may differ between them."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this indeed seems redundant, thanks!

Main differences vs Karpenter:
* Cluster Autoscaler (CA) provides features related to just Node autoscaling. Node lifecycle
beyond that is not in its scope.
* CA doesn't support auto-provisioning, the Node groups it can provision from have to be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elmiko what do you think about this sentence? One could argue that "scale from zero" is functionally equivalent to "auto-provisioning".

Additionally, the varieties of "auto-provisioning" in Karpenter that exist beyond core CA also require pre-configuration. If "auto" is defined as "no user configuration required" I don't think either solution actually supports that in practice.

cc @jonathan-innis

Copy link
Member

@Bryce-Soghigian Bryce-Soghigian Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than "auto-provisioning" maybe mixed sku autoscaling makes more sense.

We could also do a small writeup of the advantages of mixed sku autoscaling. Here are some examples that came to my mind when trying to write up some advantages to having access to all skus. Someone can perhaps word it better than me.


"75% of scale up failures for Azure's Cluster Autoscaler come from lack of configured quota for the 4 or so sizes customers configure, with less than 1% of customers configuring CAS with more than 4 vm sizes on aks." With mixed SKU autoscaling, you aren't limited to those 4 vm sizes.

The per Nodeclaim model vs the per node-group model allows for different flexibility in the retry model.

Karpenter puts the responsibility on the cloud provider to select the SKU(After nodepool level filtering). Meaning it can choose to use factors like zonal availability, quota, price, and allocation errors in its VM SKU selection decision.

CAS can't self mitigate the quota errors seen in scale up, but Karpenter can retry with a different SKU that has quota or a different zone if there are zonal allocation failures. So retrying continuously in Karpenter is less likely to fail.

The retry granularity of retrying a single VM, vs retrying the entire node-group takes up less of your throttling quota.

These are some of the benefits of having a mixed SKU model, and a per vm retry model. We can retry more intelligently caching skus as unavailable due to zonal, or quota related reasons.

Another large factor to consider is the pricing selection, being given all SKUs on a given cloudprovider, allows for much better cost savings. CAS has the pricing expander, but the real weakness is that it doesn't have 1000 sizes to pick from. It just can pick based on the current VM Sizes configured by the customer. (As stated earlier, is often no more than 4)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great details @Bryce-Soghigian , i'm not so comfortable about using "SKU" so much, as many people are using on-prem and bare metal deployments which look different from public cloud offerings. that said, i totally get what you are talking about.

@jackfrancis this is a tough one for me, because at the core, CAS needs to be informed about the pools it can scale, whereas karpenter can be given some recipes on how to find instances to create pools. scale from zero might appear like autoprovisioning, but the big difference for me is that you still need to configure the node group and link it to an instance type. in cluster-api the user would need to create all the various infrastructure machine templates to be able to have those instance types represented. but on karpenter that expression would be much more compact.

for me, the part to clarify is that CAS will only be able to scale groups that it knows about but karpenter can discover new instances by the constraints it is given. assuming we want to clarify this more.

perhaps we could use a section that describes the difference between "scaling" as opposed to "provisioning" ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it seems the core difference between the two instance-shape calculations is that Karpenter has more attribute based selection features, and CAS does more instance type matching. Karpenter's NodePools are defaulted to fully inclusive (default = *) where IIUC, CAS is defaulted to fully exclusive, where no instance types in a node group will not be compatible with any pod.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAS has the pricing expander, but the real weakness is that it doesn't have 1000 sizes to pick from.

In GKE (which is where I have the data on) we see many clusters with >100 nodepools (GKE nodepools; they can be thought of as equivalent of CAS nodegroups). That's enough to effectively give you full spectrum of shapes for optimization (GCE has less instance types than AWS; 100-150 is enough to cover full spectrum of general purpose instances). So even with CAS you can have the SKU selection and zonal handling.

A major difference is that you need to go through the process of setting all those nodegroups and this needs to be done outside of CAS. With auto-provisioning you no longer have to do any external setup and you get all the different shapes with just a simple configuration of autoscaler.

Copy link
Member

@Bryce-Soghigian Bryce-Soghigian Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that makes a lot of sense. My point was more from the azure perspective we do not see many clusters creating that many nodepools/nodegroups. GKE has a wonderful CAS integration the rest of us need to strive towards. Maybe there is more we could be doing to make that experience of integrating with the external setup to get all the different shapes in our autoscaler.

(Azure also has a limitation one can only create 100 nodepools but it hasn't been a problem because we don't have a lot of users configuring 100 nodepools)

I would also argue at a certain point having 1000 vm sizes that all do the same thing isn't a strength and that 150 or so sizes is more than enough for all intents and purposes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by this discussion, I think we should probably expand the "Node auto-provisioning" section under "Provisioning", and then this part will be clear?

@njtran's analogy is interesting, but I'm not sure if it maps onto CAS. I see the 2 modes like this, but perhaps I'm missing some Karpenter context:

  • Regular provisioning: The user defines a set of homogenous cloud provider VM groups. The groups allow for adding and removing VMs. Based on the pending pods, autoscaler chooses two parameters for provisioning: which group, how many VMs to add. For the autoscaler, this is a well-scoped problem: among the VM options that the user pre-configured, figure out how many new VMs would be needed, then choose the "best" <group, node_count> option, and provision that many Nodes from that group.
  • Auto-provisioning: The user doesn't have to define an explicit cloud provider VM group for every possible configuration that could be needed for their workloads. The user only defines a set of constraints on the VMs that can be provisioned. Based on the pending pods, and on the set of constraints, the autoscaler chooses two parameters: the VM config, how many VMs to add. Choosing a VM config from the vast space of possible cloud provider configs is IMO a fundamentally different problem than just selecting between pre-configured configs.

One could argue that "scale from zero" is functionally equivalent to "auto-provisioning".

@jackfrancis I would disagree here, there's no reason why the regular provisioning model I described above shouldn't work with scale-from-zero. Even while scaling from >0, the autoscaler has to predict how a new Node would look like if a new VM was added to a given group. This prediction is usually straightforward if there are existing Nodes in the group - just take the existing Node and sanitize it. If there are no Nodes in a given group, the prediction has to come from the cloud provider. So some cloud provider integrations will support it while some others won't, but IMO this is more of an implementation detail of a given cloud provider integration for regular provisioning.

From another perspective, the only thing that the autoscaler has to solve on top of scale-from-not-zero to have scale-from-zero is passing "VM template" information from the cloud provider VM group. This information should be available in the group in order to create new VMs, so this should only be the matter of some plumbing and then it's the same as from-not-zero. Contrast this with auto-provisioning where there's no pre-configured template that can be leveraged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more point to add here about node groups and discovery. for CAS providers that integrate directly with an infrastructure provider (eg GKE, Azure) the mechanism for discovering instances that are available in node groups is a different process than for the clusterapi provider. in clustereapi the user is in control of the node groups by creating clusterapi CRs in the cluster.

this change in user experience places a different emphasis on how node groups are discovered in clusterapi. i realize this pattern is isolated to our provider, but i thought it useful to share here in the context of this discussion.

@jackfrancis
Copy link

@Bryce-Soghigian @tallaxes because AKS supports both autoscalers it would be good to have your perspective on how well this document reflects those differences. Essentially, do the customer-facing docs that help AKS customers choose one or the other agree with the concepts in this document? (they should!)


Additional context:
* Documentation: https://karpenter.sh/
* Supported cloud providers: TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Supported cloud providers: TBD
* Supported cloud providers: https://github.com/kubernetes-sigs/karpenter?tab=readme-ov-file#supported-cloudproviders

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you!

constraints for the provisioned Nodes instead of pre-configuring Node groups explicitly.

Additional context:
* Documentation: https://karpenter.sh/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation is still not very cloud neutral. Its the best we have now but would be good to revisit this later

cc: @jonathan-innis, @njtran, @dtzar

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @chrisnegus and @dtzar are tackling this, we should definitely update this link with however this effort lands

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, should I leave a TODO for somebody?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already being worked on. Personally, I don't think we need to have a TODO here, unless we think that it'll actually help make sure this is updated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd imagine that we may end up using the same domain down the line anyways since I don't imagine that we want to shift all of the current users away from the current domain, just extend it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are working on making most of the Karpenter.sh site generic, with separate, well-defined sections for supported cloud providers.

always selects specific Nodes to remove as opposed to just resizing the underlying VM group.

Main differences vs Karpenter:
* Cluster Autoscaler (CA) provides features related to just Node autoscaling. Node lifecycle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth calling out the massive cloud provider support CAS has! Karpenter has a lot of work to do to catch up in that regard!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, included it in the new extracted section!

that best fits the requests of pending Pods. When consolidating, Cluster Autoscaler
always selects specific Nodes to remove as opposed to just resizing the underlying VM group.

Main differences vs Karpenter:
Copy link
Member

@Bryce-Soghigian Bryce-Soghigian Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this belongs in its own section, "CAS vs Karpenter" after the karpenter section?

Seems some of the points are repeated in both. Ex

  • CA doesn't support auto-provisioning, then karpenter supports "auto-provisioning"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, would be nice to have a side-by-side comparison, perhaps with a feature/config matrix (if it makes sense).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, very good point. Extracted this to a dedicated comparison section. Leaving the current format for now, but some sort of table sounds good, thanks for the suggestion!

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the content here looks nice, thanks for proposing it @towca . i have a few questions and comments.

This is especially useful if the number of Pods changes over time, for example
as a result of [combining workload and Node autoscaling](#multi-scaling).

Autoscalers provision the Nodes by creating and deleting cloud provider Virtual Machines (VMs) backing them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we could make this a little more generic to cover more cases, eg

Suggested change
Autoscalers provision the Nodes by creating and deleting cloud provider Virtual Machines (VMs) backing them.
Autoscalers provision the Nodes by creating and deleting cloud provider Virtual Machines (VMs), or other infrastructure, backing them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this into two sentences, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we need to split it, mainly i just wanted to expand on "virtual machines" to be wider in scope.

were would you split it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, have you seen this part after my latest update? I just meant that it's now 2 sentences:

Autoscalers provision the Nodes by creating and deleting cloud provider resources backing them. Most
commonly, the resources backing the Nodes are Virtual Machines (VMs).

that best fits the requests of pending Pods. When consolidating, Cluster Autoscaler
always selects specific Nodes to remove as opposed to just resizing the underlying VM group.

Main differences vs Karpenter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, would be nice to have a side-by-side comparison, perhaps with a feature/config matrix (if it makes sense).

Main differences vs Karpenter:
* Cluster Autoscaler (CA) provides features related to just Node autoscaling. Node lifecycle
beyond that is not in its scope.
* CA doesn't support auto-provisioning, the Node groups it can provision from have to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great details @Bryce-Soghigian , i'm not so comfortable about using "SKU" so much, as many people are using on-prem and bare metal deployments which look different from public cloud offerings. that said, i totally get what you are talking about.

@jackfrancis this is a tough one for me, because at the core, CAS needs to be informed about the pools it can scale, whereas karpenter can be given some recipes on how to find instances to create pools. scale from zero might appear like autoprovisioning, but the big difference for me is that you still need to configure the node group and link it to an instance type. in cluster-api the user would need to create all the various infrastructure machine templates to be able to have those instance types represented. but on karpenter that expression would be much more compact.

for me, the part to clarify is that CAS will only be able to scale groups that it knows about but karpenter can discover new instances by the constraints it is given. assuming we want to clarify this more.

perhaps we could use a section that describes the difference between "scaling" as opposed to "provisioning" ?

with a particular set of pods, lack of cloud provider capacity).

{{< note >}}
Only the scheduling constraints of pods (for example resource requests, node selectors) are taken into account when determining Nodes to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably pretty accurate when it comes to provisioning. It differs a bit (in Karpenter at least) when trying to do Consolidation through VM price comparisons, and also preferences are maybe handled differently in Karpenter, but I think it's fair to say:
"pod scheduling and resource constraints are the primary drivers of schedulability"

Comment on lines 62 to 63
run all Pods in the cluster, the Pods should utilize as much of the Nodes'
resources as possible. From this perspective, the overall Node utilization in
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a bit nitty, but I could see some users not agreeing with this. Some users probably want to artificially prescribe some overhead on their requests/limits as to not be OOMKilled or evicted when their resource usage spikes. I think it's up to the autoscaler to allow a user to prescribe a desired overhead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With which part exactly? I'm not sure how what you're saying is incompatible with this section. FWIW, this was only meant from the perspective of Node autoscalers, which mostly treat Pods as blackboxes.

Choosing Pod requests correctly is indeed just as important to the overall cost-effectiveness of a cluster, but it's a completely separate problem (running pods->optimal pod requests), the solution to which is an input to the problem that Node autoscalers solve (pod requests->optimal Nodes).

Do you have an idea for how we could improve this section? Maybe another note to the effect of "choosing Pod requests optimally is as important to the overall cost-effectiveness as optimizing Node utilization, see vertical workload autoscaling"? I'm not a fan of adjacent notes though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use normal text, and you can use --- (on a line by itself) within one note as a divider.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have the other note here now, added this as a note. This really feels like one though, it has no relation to the rest of this section, so leaving it as just normal text reads really weird to me. WDYT?

Comment on lines 92 to 113
Consolidation is prone to certain race conditions. For example, if a new Pod is
created right after a Node is consolidated, it can get scheduled on an existing Node
in the cluster instead of the Pods recreated because of consolidation - leaving
these Pods pending. This scenario is usually automatically recovered by provisioning.
{{< /note >}}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reading a bit weird to me. What I think you're trying to describe is the case where a pod schedules on a node that's being consolidated, and so the pod schedules, then is shortly after evicted. There are certain pods that shouldn't be evicted, and we're working on solving this race condition in Karpenter. Is there a case in CAS where pods that are blocked by PDBs or a "do-not-disrupt pod" (if there's a similar concept in CAS), are removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note was meant to highlight that "no pods should become pending as a result of consolidation" is not a hard guarantee.

What I think you're trying to describe is the case where a pod schedules on a node that's being consolidated, and so the pod schedules, then is shortly after evicted.

I was trying to describe a different scenario:

  • Pod A is running on node N.
  • Autoscaler consolidates node N, evicts pod A.
  • Pod B is created, and gets scheduled in the existing space where the autoscaler had thought pod A would fit.
  • Pod A is recreated, and is now pending as a result of consolidation.
  • Normally, the autoscaler should then provision another node for pod A.
  • If there are some constraints preventing the autoscaler from provisioning more Nodes (e.g. configured limits, cloud provider stockout/quota issues), pod A could stay pending indefinitely.

It sounds like a rare corner case, but we've had a surprising number of people in the CA community hit it over the years (usually in small clusters where some pods are "important", but the Node limits are configured quite low) - resulting in bug reports. I'd like to avoid setting a similar expectation in this documentation.

Any idea on how to make this read better?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I got it now.

WDYT about this?

In consolidating nodes, autoscalers try to simulate how the kube-scheduler will assign pods that are rescheduling. Since autoscalers do not control the actual scheduling process, autoscalers cannot guarantee that "no pods should become pending as a result of consolidation". This can change based on the churn in the cluster, the complexity of scheduling constraints, and other autoscaler-specific constraints.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the mention about not controlling the actual scheduling, but I'm missing a link between this and why the guarantee doesn't hold. WDYT about:

Autoscalers predict how a recreated Pod will likely be scheduled after a Node is provisioned or
consolidated, but they don't control the actual scheduling. Because of this, some pods might
actually become pending as a result of consolidation - if for example a completely new Pod appears
while consolidation is being performed.

Main differences vs Karpenter:
* Cluster Autoscaler (CA) provides features related to just Node autoscaling. Node lifecycle
beyond that is not in its scope.
* CA doesn't support auto-provisioning, the Node groups it can provision from have to be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it seems the core difference between the two instance-shape calculations is that Karpenter has more attribute based selection features, and CAS does more instance type matching. Karpenter's NodePools are defaulted to fully inclusive (default = *) where IIUC, CAS is defaulted to fully exclusive, where no instance types in a node group will not be compatible with any pod.

constraints for the provisioned Nodes instead of pre-configuring Node groups explicitly.

Additional context:
* Documentation: https://karpenter.sh/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @chrisnegus and @dtzar are tackling this, we should definitely update this link with however this effort lands

Comment on lines 205 to 207
You can use annotation `autoscaling.kubernetes.io/do-not-disrupt: true` on a Pod to
prevent it being disrupted by consolidation. Consolidation should never remove a Node
where a Pod annotated with `autoscaling.kubernetes.io/do-not-disrupt: true` is running.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this aspirational? Or is this the current CAS annotation that's respected? This is definitely not the annotation that we use in Karpenter today. Maybe we should include the karpenter.sh/do-not-disrupt annotation in here until we align on those if CAS has already made the migration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as mentioned in the PR description:

The documentation tentatively mentions one of the workload-level APIs we're supposed to align on (see kubernetes/autoscaler#6648). We can comment that section out until we align and implement the new version in both CA and Karpenter.

@towca towca force-pushed the jtuznik/node-autoscaling-docs branch from 0281f91 to 0ccbde7 Compare April 23, 2024 18:33
@towca
Copy link
Author

towca commented Apr 23, 2024

Thanks for all the comments @njtran @elmiko @Bryce-Soghigian @jackfrancis @sftim! I think I've either addressed or responded to everything, please let me know if I missed anything.

Comment on lines 261 to 264
The [balancer](https://github.com/kubernetes/autoscaler/blob/master/balancer/proposals/balancer.md)
is a component providing horizontal workload autoscaling, with the goal of balancing the number of
workload replicas between topologies. It can be [combined with Node autoscaling](#multi-scaling) to
achieve highly-available, cost-efficient workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid linking to KEPs in lieu of documentation and especially I wouldn't link to https://github.com/kubernetes/autoscaler/blob/master/balancer/proposals/balancer.md which says it's a KEP but doesn't look like what most contributors would call a KEP.

The best option is to add a new page about the topology balancer into the docs, and then link there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, removed the link to the "KEP", just linking to the code now. I don't think there's a better link.

Documenting balancer on a new page is out of scope of this PR, I can also just remove this section if that's somehow better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove it TBH.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

to be pre-configured. Karpenter supports auto-provisioning, so the user only has to configure a
set of constraints for the provisioned Nodes instead of fully configuring homogenous groups.
* Cluster Autoscaler provides integrations with numerous cloud providers, including smaller and less
popular providers. Karpenter currently only provides integrations with AWS and Azure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Suggested change
popular providers. Karpenter currently only provides integrations with AWS and Azure.
popular providers. The Kubernetes project does not directly publish any autoscaler solution based
on Karpenter, but third-party releases include integrations with AWS and Azure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording you're proposing is IMO very confusing in the context of the whole page - the rest of the page positions Karpenter as one of the autoscalers, while this sentence implies otherwise.

In any case, I'll leave this wording to the Karpenter stakeholders - @njtran @jonathan-innis WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Karpenter, within Kubernetes, is a library you can use to build a node autoscaler. Have I got that right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see what you're saying @sftim, but I think it's probably fair to just talk about Karpenter as it's been commonly implemented in cloud providers. Specifically in response to Tim's suggestion, here's what I'd propose:

* For Cluster Autoscaler, the Kubernetes project directly provides standalone Cloud Provider solutions. For Karpenter, the Kubernetes project publishes Karpenter as a core library of controllers with a common interface which is integrated and implemented by Cloud Providers to build a node autoscaler. Well known implementations include AWS and Azure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure that the changes we merge don't conflict with requirements from our content guide.

Also, it's good to have content that is timeless. “Karpenter currently only provides integrations with AWS and Azure.” will go stale as soon as someone publishes a third integration.


I tweaked the text from #45802 (comment); @njtran WDYT?

* For Cluster Autoscaler, the Kubernetes project directly provides standalone cloud provider
  integrations.
  For Karpenter, the Kubernetes project publishes Karpenter as a core library of controllers with
  a common interface that cloud providers can integrate and extend to build a node autoscaler.
  The implementations of Karpenter node autoscaling include AWS and Azure.

We don't limit ourselves to mentioning well known implementations; either we mention no third parties, or even a tiny cloud provider with four customers is welcome to mention their tool.


As a personal opinion, it'd be nice to see the CNCF landscape gain the ability to Kubernetes addons (perhaps after running a certification check to make sure they are compatible with stable Kubernetes?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps to add to this suggestion, we link out to the AWS and Azure implementations that we mention here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the suggestions from @sftim and @njtran, added links as @jonathan-innis suggested. WDYT about the current version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a “currently”. Can we make this timeless and also avoid the risk that this goes stale due to an external change, such as someone using Karpenter to make an autoscaler?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 23, 2024
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MaciekPytel
Once this PR has been reviewed and has the lgtm label, please assign reylejano for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@towca
Copy link
Author

towca commented Sep 23, 2024

@jonathan-innis We have an LGTM from the CA side now - just missing the Karpenter LGTM and a final pass from the docs approvers. I addressed all of your comments, could we close on this soon?

@sftim
Copy link
Contributor

sftim commented Sep 23, 2024

There's quite a lot of pending feedback, too. See https://github.com/kubernetes/website/pull/45802/files.

@towca towca force-pushed the jtuznik/node-autoscaling-docs branch from b5a447f to 4db9e64 Compare September 23, 2024 12:24
@towca
Copy link
Author

towca commented Sep 23, 2024

@sftim Ugh, I messed up the push and scrapped the latest round of changes. Sorry for that, they should be restored now

@towca towca force-pushed the jtuznik/node-autoscaling-docs branch from 4db9e64 to fd0893e Compare September 23, 2024 12:41
@sftim
Copy link
Contributor

sftim commented Sep 23, 2024

This is OK to unhold once tech LGTM(s) are in place from SIG Autoscaling.

@tengqm
Copy link
Contributor

tengqm commented Oct 11, 2024

Why are we changing "cluster autoscaling" to "node autoscaling"?
My (previous and current) understanding is that cluster scaling deals with the scale of a cluster, while node scaling means adding/removing resources to/from a node. This could be wrong although I believe many users may get confused as I do.

@sftim
Copy link
Contributor

sftim commented Oct 11, 2024

Why are we changing "cluster autoscaling" to "node autoscaling"?

SIG Autoscaling had a brief to do this; see kubernetes/autoscaler#6646 for more context.

I don't object to that change. Horizontally autoscaling your nodes is analogous to horizontally autoscaling your VMs, Pods, etc. Or you can use a multi dimensional autoscaler like Karpenter.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@towca
Copy link
Author

towca commented Oct 29, 2024

@jonathan-innis @njtran Ping, could we close on this review? It's been open forever, and I think we've reached a consensus on the controversial parts now.

@njtran
Copy link

njtran commented Oct 29, 2024

Apologies, didn't realize review was waiting on us. I'll try and take a look this week or the next. A lot on my plate at the moment.

@sftim
Copy link
Contributor

sftim commented Dec 20, 2024

@towca could you rebase this against main? If this gets LGTMs, we can then merge it providing there is no conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 We could mention that your cloud provider may provide managed node autoscaling, and to consult your cloud provider's docs for more details.

The [descheduler](https://github.com/kubernetes-sigs/descheduler) is a component providing Node
consolidation functionality based on custom policies, as well as other features related to
optimizing Nodes and Pods (for example deleting frequently restarting Pods).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally

Suggested change
If you use a mechanism that
{{< glossary_tooltip text="taints" term_id="taint" >}} underutilized nodes, the descheduler can drain Pods
from those nodes so that your node autoscaler can eventually switch the node off (once nothing is
running there).

@sftim
Copy link
Contributor

sftim commented Dec 27, 2024

I am glad we merged #45197 already.

@towca, you might be able to make this change land sooner by splitting it into several smaller PRs that add up to the same thing. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. language/en Issues or PRs related to English language needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.