-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concepts/ClusterAdministration: Expand Node Autoscaling documentation #45802
base: main
Are you sure you want to change the base?
Conversation
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify site configuration. |
/hold |
with a particular set of pods, lack of cloud provider capacity). | ||
|
||
{{< note >}} | ||
Only the scheduling constraints of pods (for example resource requests, node selectors) are taken into account when determining Nodes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this statement "Only the scheduling constraints of pods... are taken into account when determining Nodes to provision." is not entirely correct for Karpenter. Maybe a better way to put it would be:
"The fundamental input for determining node provisioning is pod scheduling constraints."
This would leave open other node provisioning inputs (such as vm cost). I don't think we have to state this, but I think saying Only the scheduling constraints of pods... is too strong and excludes other node autoprovisioning considerations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent was to emphasize that real resource usage is not really considered and neither CA nor Karpenter will do anything if your pod is OOMing. I think that is worth emphasizing.
But I agree that the wording is misleading - both Karpenter and CA look at VM price (in case of CA only if you enable price expander). And I could imagine a lot of other signals - e.g. spot availability. Let's rephrase this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably pretty accurate when it comes to provisioning. It differs a bit (in Karpenter at least) when trying to do Consolidation through VM price comparisons, and also preferences are maybe handled differently in Karpenter, but I think it's fair to say:
"pod scheduling and resource constraints are the primary drivers of schedulability"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I like your suggestion! Is this ok?
to autoscale Nodes based on actual resource usage, you can [combine workload and Node autoscaling](#multi-scaling). | ||
{{< /note >}} | ||
|
||
Depending on the choice of autoscaler, provisioning can also be triggered by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think minimum (or maximum) node thresholds are more of an autoscaler configuration rather than a required behavior of a particular autoscaler. So I'd probably replace "Depending on the choice of autoscaler..." with
"Autoscaler configuration may also include other node provisioning triggers (for example the number of Nodes falling below a configured minimum limit)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph was only meant to highlight that there are "non-standard" triggers for provisioning that particular autoscalers might have, not trying to imply any required behavior for autoscalers in general - hence the "depending on the choice of autoscaler".
Your suggestion IMO conveys the same thing, changed.
content/en/docs/concepts/cluster-administration/node-autoscaling.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/cluster-administration/node-autoscaling.md
Outdated
Show resolved
Hide resolved
experience. Both will provision new Nodes for unschedulable Pods, and both will | ||
consolidate the Nodes that are no longer optimally utilized. | ||
|
||
The most of the difference between the autoscalers is in how they're set up and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've already stated the first part of this sentence already in the doc. We could add a simple statement here "Different autoscalers may provide features outsice the Node autoscaling scope described on this page, and those additional features may differ between them."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this indeed seems redundant, thanks!
Main differences vs Karpenter: | ||
* Cluster Autoscaler (CA) provides features related to just Node autoscaling. Node lifecycle | ||
beyond that is not in its scope. | ||
* CA doesn't support auto-provisioning, the Node groups it can provision from have to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elmiko what do you think about this sentence? One could argue that "scale from zero" is functionally equivalent to "auto-provisioning".
Additionally, the varieties of "auto-provisioning" in Karpenter that exist beyond core CA also require pre-configuration. If "auto" is defined as "no user configuration required" I don't think either solution actually supports that in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than "auto-provisioning" maybe mixed sku autoscaling makes more sense.
We could also do a small writeup of the advantages of mixed sku autoscaling. Here are some examples that came to my mind when trying to write up some advantages to having access to all skus. Someone can perhaps word it better than me.
"75% of scale up failures for Azure's Cluster Autoscaler come from lack of configured quota for the 4 or so sizes customers configure, with less than 1% of customers configuring CAS with more than 4 vm sizes on aks." With mixed SKU autoscaling, you aren't limited to those 4 vm sizes.
The per Nodeclaim model vs the per node-group model allows for different flexibility in the retry model.
Karpenter puts the responsibility on the cloud provider to select the SKU(After nodepool level filtering). Meaning it can choose to use factors like zonal availability, quota, price, and allocation errors in its VM SKU selection decision.
CAS can't self mitigate the quota errors seen in scale up, but Karpenter can retry with a different SKU that has quota or a different zone if there are zonal allocation failures. So retrying continuously in Karpenter is less likely to fail.
The retry granularity of retrying a single VM, vs retrying the entire node-group takes up less of your throttling quota.
These are some of the benefits of having a mixed SKU model, and a per vm retry model. We can retry more intelligently caching skus as unavailable due to zonal, or quota related reasons.
Another large factor to consider is the pricing selection, being given all SKUs on a given cloudprovider, allows for much better cost savings. CAS has the pricing expander, but the real weakness is that it doesn't have 1000 sizes to pick from. It just can pick based on the current VM Sizes configured by the customer. (As stated earlier, is often no more than 4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great details @Bryce-Soghigian , i'm not so comfortable about using "SKU" so much, as many people are using on-prem and bare metal deployments which look different from public cloud offerings. that said, i totally get what you are talking about.
@jackfrancis this is a tough one for me, because at the core, CAS needs to be informed about the pools it can scale, whereas karpenter can be given some recipes on how to find instances to create pools. scale from zero might appear like autoprovisioning, but the big difference for me is that you still need to configure the node group and link it to an instance type. in cluster-api the user would need to create all the various infrastructure machine templates to be able to have those instance types represented. but on karpenter that expression would be much more compact.
for me, the part to clarify is that CAS will only be able to scale groups that it knows about but karpenter can discover new instances by the constraints it is given. assuming we want to clarify this more.
perhaps we could use a section that describes the difference between "scaling" as opposed to "provisioning" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, it seems the core difference between the two instance-shape calculations is that Karpenter has more attribute based selection features, and CAS does more instance type matching. Karpenter's NodePools are defaulted to fully inclusive (default = *) where IIUC, CAS is defaulted to fully exclusive, where no instance types in a node group will not be compatible with any pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CAS has the pricing expander, but the real weakness is that it doesn't have 1000 sizes to pick from.
In GKE (which is where I have the data on) we see many clusters with >100 nodepools (GKE nodepools; they can be thought of as equivalent of CAS nodegroups). That's enough to effectively give you full spectrum of shapes for optimization (GCE has less instance types than AWS; 100-150 is enough to cover full spectrum of general purpose instances). So even with CAS you can have the SKU selection and zonal handling.
A major difference is that you need to go through the process of setting all those nodegroups and this needs to be done outside of CAS. With auto-provisioning you no longer have to do any external setup and you get all the different shapes with just a simple configuration of autoscaler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that makes a lot of sense. My point was more from the azure perspective we do not see many clusters creating that many nodepools/nodegroups. GKE has a wonderful CAS integration the rest of us need to strive towards. Maybe there is more we could be doing to make that experience of integrating with the external setup to get all the different shapes in our autoscaler.
(Azure also has a limitation one can only create 100 nodepools but it hasn't been a problem because we don't have a lot of users configuring 100 nodepools)
I would also argue at a certain point having 1000 vm sizes that all do the same thing isn't a strength and that 150 or so sizes is more than enough for all intents and purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging by this discussion, I think we should probably expand the "Node auto-provisioning" section under "Provisioning", and then this part will be clear?
@njtran's analogy is interesting, but I'm not sure if it maps onto CAS. I see the 2 modes like this, but perhaps I'm missing some Karpenter context:
- Regular provisioning: The user defines a set of homogenous cloud provider VM groups. The groups allow for adding and removing VMs. Based on the pending pods, autoscaler chooses two parameters for provisioning: which group, how many VMs to add. For the autoscaler, this is a well-scoped problem: among the VM options that the user pre-configured, figure out how many new VMs would be needed, then choose the "best" <group, node_count> option, and provision that many Nodes from that group.
- Auto-provisioning: The user doesn't have to define an explicit cloud provider VM group for every possible configuration that could be needed for their workloads. The user only defines a set of constraints on the VMs that can be provisioned. Based on the pending pods, and on the set of constraints, the autoscaler chooses two parameters: the VM config, how many VMs to add. Choosing a VM config from the vast space of possible cloud provider configs is IMO a fundamentally different problem than just selecting between pre-configured configs.
One could argue that "scale from zero" is functionally equivalent to "auto-provisioning".
@jackfrancis I would disagree here, there's no reason why the regular provisioning model I described above shouldn't work with scale-from-zero. Even while scaling from >0, the autoscaler has to predict how a new Node would look like if a new VM was added to a given group. This prediction is usually straightforward if there are existing Nodes in the group - just take the existing Node and sanitize it. If there are no Nodes in a given group, the prediction has to come from the cloud provider. So some cloud provider integrations will support it while some others won't, but IMO this is more of an implementation detail of a given cloud provider integration for regular provisioning.
From another perspective, the only thing that the autoscaler has to solve on top of scale-from-not-zero to have scale-from-zero is passing "VM template" information from the cloud provider VM group. This information should be available in the group in order to create new VMs, so this should only be the matter of some plumbing and then it's the same as from-not-zero. Contrast this with auto-provisioning where there's no pre-configured template that can be leveraged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more point to add here about node groups and discovery. for CAS providers that integrate directly with an infrastructure provider (eg GKE, Azure) the mechanism for discovering instances that are available in node groups is a different process than for the clusterapi provider. in clustereapi the user is in control of the node groups by creating clusterapi CRs in the cluster.
this change in user experience places a different emphasis on how node groups are discovered in clusterapi. i realize this pattern is isolated to our provider, but i thought it useful to share here in the context of this discussion.
@Bryce-Soghigian @tallaxes because AKS supports both autoscalers it would be good to have your perspective on how well this document reflects those differences. Essentially, do the customer-facing docs that help AKS customers choose one or the other agree with the concepts in this document? (they should!) |
|
||
Additional context: | ||
* Documentation: https://karpenter.sh/ | ||
* Supported cloud providers: TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Supported cloud providers: TBD | |
* Supported cloud providers: https://github.com/kubernetes-sigs/karpenter?tab=readme-ov-file#supported-cloudproviders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thank you!
constraints for the provisioned Nodes instead of pre-configuring Node groups explicitly. | ||
|
||
Additional context: | ||
* Documentation: https://karpenter.sh/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation is still not very cloud neutral. Its the best we have now but would be good to revisit this later
cc: @jonathan-innis, @njtran, @dtzar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @chrisnegus and @dtzar are tackling this, we should definitely update this link with however this effort lands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, should I leave a TODO for somebody?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already being worked on. Personally, I don't think we need to have a TODO here, unless we think that it'll actually help make sure this is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'd imagine that we may end up using the same domain down the line anyways since I don't imagine that we want to shift all of the current users away from the current domain, just extend it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are working on making most of the Karpenter.sh site generic, with separate, well-defined sections for supported cloud providers.
always selects specific Nodes to remove as opposed to just resizing the underlying VM group. | ||
|
||
Main differences vs Karpenter: | ||
* Cluster Autoscaler (CA) provides features related to just Node autoscaling. Node lifecycle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth calling out the massive cloud provider support CAS has! Karpenter has a lot of work to do to catch up in that regard!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, included it in the new extracted section!
that best fits the requests of pending Pods. When consolidating, Cluster Autoscaler | ||
always selects specific Nodes to remove as opposed to just resizing the underlying VM group. | ||
|
||
Main differences vs Karpenter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this belongs in its own section, "CAS vs Karpenter" after the karpenter section?
Seems some of the points are repeated in both. Ex
- CA doesn't support auto-provisioning, then karpenter supports "auto-provisioning"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, would be nice to have a side-by-side comparison, perhaps with a feature/config matrix (if it makes sense).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, very good point. Extracted this to a dedicated comparison section. Leaving the current format for now, but some sort of table sounds good, thanks for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the content here looks nice, thanks for proposing it @towca . i have a few questions and comments.
This is especially useful if the number of Pods changes over time, for example | ||
as a result of [combining workload and Node autoscaling](#multi-scaling). | ||
|
||
Autoscalers provision the Nodes by creating and deleting cloud provider Virtual Machines (VMs) backing them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we could make this a little more generic to cover more cases, eg
Autoscalers provision the Nodes by creating and deleting cloud provider Virtual Machines (VMs) backing them. | |
Autoscalers provision the Nodes by creating and deleting cloud provider Virtual Machines (VMs), or other infrastructure, backing them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split this into two sentences, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we need to split it, mainly i just wanted to expand on "virtual machines" to be wider in scope.
were would you split it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, have you seen this part after my latest update? I just meant that it's now 2 sentences:
Autoscalers provision the Nodes by creating and deleting cloud provider resources backing them. Most
commonly, the resources backing the Nodes are Virtual Machines (VMs).
content/en/docs/concepts/cluster-administration/node-autoscaling.md
Outdated
Show resolved
Hide resolved
that best fits the requests of pending Pods. When consolidating, Cluster Autoscaler | ||
always selects specific Nodes to remove as opposed to just resizing the underlying VM group. | ||
|
||
Main differences vs Karpenter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, would be nice to have a side-by-side comparison, perhaps with a feature/config matrix (if it makes sense).
Main differences vs Karpenter: | ||
* Cluster Autoscaler (CA) provides features related to just Node autoscaling. Node lifecycle | ||
beyond that is not in its scope. | ||
* CA doesn't support auto-provisioning, the Node groups it can provision from have to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great details @Bryce-Soghigian , i'm not so comfortable about using "SKU" so much, as many people are using on-prem and bare metal deployments which look different from public cloud offerings. that said, i totally get what you are talking about.
@jackfrancis this is a tough one for me, because at the core, CAS needs to be informed about the pools it can scale, whereas karpenter can be given some recipes on how to find instances to create pools. scale from zero might appear like autoprovisioning, but the big difference for me is that you still need to configure the node group and link it to an instance type. in cluster-api the user would need to create all the various infrastructure machine templates to be able to have those instance types represented. but on karpenter that expression would be much more compact.
for me, the part to clarify is that CAS will only be able to scale groups that it knows about but karpenter can discover new instances by the constraints it is given. assuming we want to clarify this more.
perhaps we could use a section that describes the difference between "scaling" as opposed to "provisioning" ?
content/en/docs/concepts/cluster-administration/node-autoscaling.md
Outdated
Show resolved
Hide resolved
with a particular set of pods, lack of cloud provider capacity). | ||
|
||
{{< note >}} | ||
Only the scheduling constraints of pods (for example resource requests, node selectors) are taken into account when determining Nodes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably pretty accurate when it comes to provisioning. It differs a bit (in Karpenter at least) when trying to do Consolidation through VM price comparisons, and also preferences are maybe handled differently in Karpenter, but I think it's fair to say:
"pod scheduling and resource constraints are the primary drivers of schedulability"
run all Pods in the cluster, the Pods should utilize as much of the Nodes' | ||
resources as possible. From this perspective, the overall Node utilization in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a bit nitty, but I could see some users not agreeing with this. Some users probably want to artificially prescribe some overhead on their requests/limits as to not be OOMKilled or evicted when their resource usage spikes. I think it's up to the autoscaler to allow a user to prescribe a desired overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With which part exactly? I'm not sure how what you're saying is incompatible with this section. FWIW, this was only meant from the perspective of Node autoscalers, which mostly treat Pods as blackboxes.
Choosing Pod requests correctly is indeed just as important to the overall cost-effectiveness of a cluster, but it's a completely separate problem (running pods->optimal pod requests), the solution to which is an input to the problem that Node autoscalers solve (pod requests->optimal Nodes).
Do you have an idea for how we could improve this section? Maybe another note to the effect of "choosing Pod requests optimally is as important to the overall cost-effectiveness as optimizing Node utilization, see vertical workload autoscaling"? I'm not a fan of adjacent notes though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use normal text, and you can use ---
(on a line by itself) within one note as a divider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't have the other note here now, added this as a note. This really feels like one though, it has no relation to the rest of this section, so leaving it as just normal text reads really weird to me. WDYT?
Consolidation is prone to certain race conditions. For example, if a new Pod is | ||
created right after a Node is consolidated, it can get scheduled on an existing Node | ||
in the cluster instead of the Pods recreated because of consolidation - leaving | ||
these Pods pending. This scenario is usually automatically recovered by provisioning. | ||
{{< /note >}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reading a bit weird to me. What I think you're trying to describe is the case where a pod schedules on a node that's being consolidated, and so the pod schedules, then is shortly after evicted. There are certain pods that shouldn't be evicted, and we're working on solving this race condition in Karpenter. Is there a case in CAS where pods that are blocked by PDBs or a "do-not-disrupt pod" (if there's a similar concept in CAS), are removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This note was meant to highlight that "no pods should become pending as a result of consolidation" is not a hard guarantee.
What I think you're trying to describe is the case where a pod schedules on a node that's being consolidated, and so the pod schedules, then is shortly after evicted.
I was trying to describe a different scenario:
- Pod A is running on node N.
- Autoscaler consolidates node N, evicts pod A.
- Pod B is created, and gets scheduled in the existing space where the autoscaler had thought pod A would fit.
- Pod A is recreated, and is now pending as a result of consolidation.
- Normally, the autoscaler should then provision another node for pod A.
- If there are some constraints preventing the autoscaler from provisioning more Nodes (e.g. configured limits, cloud provider stockout/quota issues), pod A could stay pending indefinitely.
It sounds like a rare corner case, but we've had a surprising number of people in the CA community hit it over the years (usually in small clusters where some pods are "important", but the Node limits are configured quite low) - resulting in bug reports. I'd like to avoid setting a similar expectation in this documentation.
Any idea on how to make this read better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I got it now.
WDYT about this?
In consolidating nodes, autoscalers try to simulate how the kube-scheduler will assign pods that are rescheduling. Since autoscalers do not control the actual scheduling process, autoscalers cannot guarantee that "no pods should become pending as a result of consolidation". This can change based on the churn in the cluster, the complexity of scheduling constraints, and other autoscaler-specific constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the mention about not controlling the actual scheduling, but I'm missing a link between this and why the guarantee doesn't hold. WDYT about:
Autoscalers predict how a recreated Pod will likely be scheduled after a Node is provisioned or
consolidated, but they don't control the actual scheduling. Because of this, some pods might
actually become pending as a result of consolidation - if for example a completely new Pod appears
while consolidation is being performed.
Main differences vs Karpenter: | ||
* Cluster Autoscaler (CA) provides features related to just Node autoscaling. Node lifecycle | ||
beyond that is not in its scope. | ||
* CA doesn't support auto-provisioning, the Node groups it can provision from have to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, it seems the core difference between the two instance-shape calculations is that Karpenter has more attribute based selection features, and CAS does more instance type matching. Karpenter's NodePools are defaulted to fully inclusive (default = *) where IIUC, CAS is defaulted to fully exclusive, where no instance types in a node group will not be compatible with any pod.
constraints for the provisioned Nodes instead of pre-configuring Node groups explicitly. | ||
|
||
Additional context: | ||
* Documentation: https://karpenter.sh/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @chrisnegus and @dtzar are tackling this, we should definitely update this link with however this effort lands
You can use annotation `autoscaling.kubernetes.io/do-not-disrupt: true` on a Pod to | ||
prevent it being disrupted by consolidation. Consolidation should never remove a Node | ||
where a Pod annotated with `autoscaling.kubernetes.io/do-not-disrupt: true` is running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this aspirational? Or is this the current CAS annotation that's respected? This is definitely not the annotation that we use in Karpenter today. Maybe we should include the karpenter.sh/do-not-disrupt
annotation in here until we align on those if CAS has already made the migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as mentioned in the PR description:
The documentation tentatively mentions one of the workload-level APIs we're supposed to align on (see kubernetes/autoscaler#6648). We can comment that section out until we align and implement the new version in both CA and Karpenter.
0281f91
to
0ccbde7
Compare
Thanks for all the comments @njtran @elmiko @Bryce-Soghigian @jackfrancis @sftim! I think I've either addressed or responded to everything, please let me know if I missed anything. |
The [balancer](https://github.com/kubernetes/autoscaler/blob/master/balancer/proposals/balancer.md) | ||
is a component providing horizontal workload autoscaling, with the goal of balancing the number of | ||
workload replicas between topologies. It can be [combined with Node autoscaling](#multi-scaling) to | ||
achieve highly-available, cost-efficient workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We avoid linking to KEPs in lieu of documentation and especially I wouldn't link to https://github.com/kubernetes/autoscaler/blob/master/balancer/proposals/balancer.md which says it's a KEP but doesn't look like what most contributors would call a KEP.
The best option is to add a new page about the topology balancer into the docs, and then link there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, removed the link to the "KEP", just linking to the code now. I don't think there's a better link.
Documenting balancer on a new page is out of scope of this PR, I can also just remove this section if that's somehow better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove it TBH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
content/en/docs/concepts/cluster-administration/node-autoscaling.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/cluster-administration/node-autoscaling.md
Outdated
Show resolved
Hide resolved
to be pre-configured. Karpenter supports auto-provisioning, so the user only has to configure a | ||
set of constraints for the provisioned Nodes instead of fully configuring homogenous groups. | ||
* Cluster Autoscaler provides integrations with numerous cloud providers, including smaller and less | ||
popular providers. Karpenter currently only provides integrations with AWS and Azure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
popular providers. Karpenter currently only provides integrations with AWS and Azure. | |
popular providers. The Kubernetes project does not directly publish any autoscaler solution based | |
on Karpenter, but third-party releases include integrations with AWS and Azure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording you're proposing is IMO very confusing in the context of the whole page - the rest of the page positions Karpenter as one of the autoscalers, while this sentence implies otherwise.
In any case, I'll leave this wording to the Karpenter stakeholders - @njtran @jonathan-innis WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Karpenter, within Kubernetes, is a library you can use to build a node autoscaler. Have I got that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see what you're saying @sftim, but I think it's probably fair to just talk about Karpenter as it's been commonly implemented in cloud providers. Specifically in response to Tim's suggestion, here's what I'd propose:
* For Cluster Autoscaler, the Kubernetes project directly provides standalone Cloud Provider solutions. For Karpenter, the Kubernetes project publishes Karpenter as a core library of controllers with a common interface which is integrated and implemented by Cloud Providers to build a node autoscaler. Well known implementations include AWS and Azure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make sure that the changes we merge don't conflict with requirements from our content guide.
Also, it's good to have content that is timeless. “Karpenter currently only provides integrations with AWS and Azure.” will go stale as soon as someone publishes a third integration.
I tweaked the text from #45802 (comment); @njtran WDYT?
* For Cluster Autoscaler, the Kubernetes project directly provides standalone cloud provider
integrations.
For Karpenter, the Kubernetes project publishes Karpenter as a core library of controllers with
a common interface that cloud providers can integrate and extend to build a node autoscaler.
The implementations of Karpenter node autoscaling include AWS and Azure.
We don't limit ourselves to mentioning well known implementations; either we mention no third parties, or even a tiny cloud provider with four customers is welcome to mention their tool.
As a personal opinion, it'd be nice to see the CNCF landscape gain the ability to Kubernetes addons (perhaps after running a certification check to make sure they are compatible with stable Kubernetes?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps to add to this suggestion, we link out to the AWS and Azure implementations that we mention here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified the suggestions from @sftim and @njtran, added links as @jonathan-innis suggested. WDYT about the current version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a “currently”. Can we make this timeless and also avoid the risk that this goes stale due to an external change, such as someone using Karpenter to make an autoscaler?
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: MaciekPytel The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@jonathan-innis We have an LGTM from the CA side now - just missing the Karpenter LGTM and a final pass from the docs approvers. I addressed all of your comments, could we close on this soon? |
content/en/docs/concepts/cluster-administration/node-autoscaling.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/cluster-administration/node-autoscaling.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/cluster-administration/node-autoscaling.md
Outdated
Show resolved
Hide resolved
content/en/docs/concepts/cluster-administration/node-autoscaling.md
Outdated
Show resolved
Hide resolved
There's quite a lot of pending feedback, too. See https://github.com/kubernetes/website/pull/45802/files. |
b5a447f
to
4db9e64
Compare
@sftim Ugh, I messed up the push and scrapped the latest round of changes. Sorry for that, they should be restored now |
301 redirect added, and a bunch of references on other pages renamed.
4db9e64
to
fd0893e
Compare
This is OK to unhold once tech LGTM(s) are in place from SIG Autoscaling. |
Why are we changing "cluster autoscaling" to "node autoscaling"? |
SIG Autoscaling had a brief to do this; see kubernetes/autoscaler#6646 for more context. I don't object to that change. Horizontally autoscaling your nodes is analogous to horizontally autoscaling your VMs, Pods, etc. Or you can use a multi dimensional autoscaler like Karpenter. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@jonathan-innis @njtran Ping, could we close on this review? It's been open forever, and I think we've reached a consensus on the controversial parts now. |
Apologies, didn't realize review was waiting on us. I'll try and take a look this week or the next. A lot on my plate at the moment. |
@towca could you rebase this against main? If this gets LGTMs, we can then merge it providing there is no conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 We could mention that your cloud provider may provide managed node autoscaling, and to consult your cloud provider's docs for more details.
The [descheduler](https://github.com/kubernetes-sigs/descheduler) is a component providing Node | ||
consolidation functionality based on custom policies, as well as other features related to | ||
optimizing Nodes and Pods (for example deleting frequently restarting Pods). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optionally
If you use a mechanism that | |
{{< glossary_tooltip text="taints" term_id="taint" >}} underutilized nodes, the descheduler can drain Pods | |
from those nodes so that your node autoscaler can eventually switch the node off (once nothing is | |
running there). | |
Change Cluster Autoscaling concept page to Node Autoscaling, and expand it in accordance to the Cluster Autoscaler/Karpenter Alignment AEP: https://docs.google.com/document/d/1rHhltfLV5V1kcnKr_mKRKDC4ZFPYGP4Tde2Zy-LE72w/edit#heading=h.iof64m6gewln.
The documentation tentatively mentions one of the workload-level APIs we're supposed to align on (see kubernetes/autoscaler#6648). We can comment that section out until we align and implement the new version in both CA and Karpenter.
Tracking issue: kubernetes/autoscaler#6646
Staged preview: https://deploy-preview-45802--kubernetes-io-main-staging.netlify.app/docs/concepts/cluster-administration/node-autoscaling/
cc @jonathan-innis @MaciekPytel @gjtempleton WDYT?