-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mega Issue: Manual node provisioning #749
Comments
One design option would be to introduce a Node Group custom resource that maintains a group of nodes with a node template + replica count. This CR would be identical to the Provisioner CR, except TTLs are replaced with replicas.
|
most of us have aws managed node groups on ASG with at least 2 nodes to handle this |
I agree that many of these cases are handled by simply using ASG or MNG. Still worth collating these requests to see if this assumption is bad for some cases. |
I would love to have karpenter handle if all... But we still need a place to run karpenter from. And only dirty way I see it doing that is to deploy an ec2, deploy karpenter there with two replicas with anti affinity hostname, karpenter would deploy a second node now managed by karpenter, deploy the second replica, and kill off the first manually deployed vm. Or we can just have it tagged with something that makes karpenter manage it until it hits its TTL. |
We're considering running karpenter and coredns on Fargate and karpenter then provisioning capacity for everything else. I believe there was some documentation about this somewhere. Also, it was an AWS SA who suggested also running coredns on Fargate (we were originally thinking about just running karpenter on Fargate) |
For this use case, wouldn't changing the minReplicas on desired application HPAs work better? That's what we do, so that there is no delay in spinning up more Pods for a rapid surge in traffic. |
I tried that a couple weeks ago and was a very frustrating experience with huge deployment times, and many many timeouts and failures. |
Does this not work: https://docs.aws.amazon.com/eks/latest/userguide/fargate-getting-started.html? |
We are a terraform house, so the steps are slightly different. |
In some cases, especially in critical workloads that always need some node to be up and running like auto-scaling, It will be great if we can set that X minimum node is always available for scheduling nodes. |
Another use case is when you'd like to prewarm nodes at scheduled times. Currently, customers have to wait for Karpenter to provision nodes when pods are pending or create dummy pods that trigger scaling before production workload begins. Reactive scaling is slow and the alternative seems a workaround. Ideally, customers should be able to create a provisioner schedule that creates and deletes nodes based on a defined schedule. Alternatively, Karpenter can have a CRD that customers can manipulate themselves to precreate nodes (without having pending pods). |
our use case is we need to increase our node count during version upgrades which can take hours/days, during that time we cannot have any scale downs, so being able to have our upgrading app be able to manually control what's going on would be ideal. (also for context, our case isn't a web app, but an app the maintains a large in memory state that needs to be replicated during an upgrade, before being swapped out.) |
Following from the "reserve capacity" or Ocean's "headroom" issue here: aws/karpenter-provider-aws#987 Our specific use case is we have some vendored controller that polls an API for workloads, and then schedules pods to execute workloads as they come in. The vendored controller checks to see if nodes have the resources to execute the workload before creating a pod for it. Because of this, no pods are ever created once the cluster is considered "full" by the controller. We've put in a feature request to the vendor to enable a feature flag on this behavior, but I still think there could be benefit to having some headroom functionality as described in Ocean docs here: https://docs.spot.io/ocean/features/headroom for speedier scaling Maybe headroom could be implemented on a per provisioner level? The provisioners already know exactly how much cpu/memory they provision, and with the recent consolidation work I'd assume there's already some logic for knowing how utilized the nodes themselves are. |
Fyi we add headroom to our clusters by scheduling low priority pods, but it
sounds like that would not work for your case either.
…On Sun, Nov 6, 2022, 1:22 PM Matt Conway ***@***.***> wrote:
Following from the "reserve capacity" or Ocean's "headroom" issue here:
aws/karpenter-provider-aws#987 <aws/karpenter-provider-aws#987>
Our specific use case is we have some vendored controller that polls an
API for workloads, and then schedules pods to execute workloads as they
come in. The vendored controller checks to see if nodes have the resources
to execute the workload before creating a pod for it. Because of this, no
pods are ever created once the cluster is considered "full" by the
controller. We've put in a feature request to the vendor to enable a
feature flag on this behavior, but I still think there could be benefit to
having some headroom functionality as described in Ocean docs here:
https://docs.spot.io/ocean/features/headroom for speedier scaling
Maybe headroom could be implemented on a per provisioner level? The
provisioners already know exactly how much cpu/memory they provision, and
with the recent consolidation work I'd assume there's already some logic
for knowing how utilized the nodes themselves are.
—
Reply to this email directly, view it on GitHub
<#749>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAACYZ6CVCEGUREI4XLWU5DWHAHQRANCNFSM52W45W4A>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
FYI here is a rough draft how I think this feature could look like |
We would also hope Karpenter to support warm pool. Right now it takes 6 minutes to spin up a node, which is too long for us. We would like to have a warm pool feature similar to asg. |
+1 to warm pool, I opened another issue which is more towards the warm pool options. @crazywill feel free to chime in there if you have a chance |
Adding some thoughts here. We could emulate cluster-autoscaler's
One obvious difference between cluster-autoscaler and karpenter is that by design "number of nodes" is not a first class operational attribute in terms of describing cluster node capacity (because nodes are not homogeneous). So using "minimum number of nodes" to express desired configuration for solving some of the stories here (specifically the warm pool story) isn't sufficient by itself: you would also need "type/SKU of node". With both "number" + "SKU" you can deterministically guarantee a known capacity, and now you're sort of copying the cluster-autoscaler approach. However, the above IMO isn't really super karpenter-idiomatic. It would seem better to express "guaranteed minimum capacity" in a way that was closer to the operational reality that actually informs the karpenter provisioner. Something like:
etc. Basically, some sufficient amount of input that karpenter could use to simulate a "pod set" to then sort of "dry run" into the scheduler:
It gets trickier when you consider the full set of critical criteria that folks use in the pod/scheduler ecosystem: GPU, confidential compute, etc. But I think it's doable. |
/remove-lifecycle stale |
When migrating from Cluster Auto Scaler to Karpenter, we would like Karpenter to provision a node beforehand, before we perform a drain on the old node. It takes time for Karpenter to provision a new node based on the unscheduled workloads, and due to this, the pods are kept in a Pending state for too long. |
@chamarakera would just requesting a node via nodeclaim be enough in this case? No managed solution is requried if you just want to create a node from my testing just applying nodeclaims with a reference to valid nodepool is enough to create a node ahead of time in karpenter. Note this example uses a instance type size from azure. k apply -f nodeclaim.yaml apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
name: temporary-capacity
labels:
karpenter.sh/nodepool: general-purpose
annotations:
karpenter.sh/do-not-disrupt: "true"
spec:
nodeClassRef:
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: node.kubernetes.io/instance-type
operator: In
values:
- Standard_D32_v3
resources:
requests:
cpu: 2310m
memory: 725280Ki
pods: "7"
status: {} |
@chamarakera how bad is the pod latency you're describing here? Do you have a long bootstrap/startup time on your instance? What would be an acceptable amount of pod latency to not need prewarmed instances ? |
@Bryce-Soghigian - To use NodeClaims, we would need to generate several NodeClaim resources depending on the cluster's size. I would like to see a configurable option within the NodePool itself to provision nodes prior to the migration. I think, having a way to configure min size/max size parameters in the NodePool itself would be a good solution. @njtran - Usually it takes for 1 - 2mins for startup, this is ok in non-prod, but in a production I would like to have pending pods scheduled in a node as soon as possible (within few seconds). |
This is almost like implementing an overprovision pod with low Priority
Class attached so it gets bumped but avoids having to wait for resources.
…On Thu, Feb 8, 2024 at 12:48 AM Chamara Keragala ***@***.***> wrote:
@Bryce-Soghigian <https://github.com/Bryce-Soghigian> - To use
NodeClaims, we would need to generate several NodeClaim resources depending
on the cluster's size. I would like to see a configurable option within the
NodePool itself to provision nodes prior to the migration. I think, having
a way to configure min size/max size parameters in the NodePool itself
would be a good solution.
@njtran <https://github.com/njtran> - Usually it takes for 1 - 2mins for
startup, this is ok in non-prod, but in a production I would like to have
all pods scheduled in a node as soon as possible (few seconds).
—
Reply to this email directly, view it on GitHub
<#749 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFPZYKLMNCMLVQV3LHHWVLYSRRJTAVCNFSM6AAAAAA62KU7WCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZTGQYDEOBYGA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
After chatting a bit in Slack on the Karpenter channel with awiesner4 I think i have some thoughts around this problem. First I want to bring up Karpenter's primary objective then break down Karpenter's current responsibility and maybe this will help drive the design choice. I think Karpenter's main objective is be an efficiency cluster autoscaler so it makes it difficult for leaving around nodes that isn't doing work go against what Karpenter is trying to achieve. I think to add something that works around it would likely be problematic because you would have to work around everything Karpenter is built to do. However, at the moment is doing more than just autoscaling which is where I think problems and usability issue arises. It autoscales but it also manages nodes. It takes over how nodes are configured and the lifecycle of nodes and it closes the door for other things to manage nodes. What does this mean? I agree with those who are saying nodeclaim should be able to create nodes essentially outside of Karpenter's main autoscaling logic so we don't change what Karpenter is trying to do (save money). I think at this time NodePools does hold the logic and concept of how Karpenter tries to optimize the cluster so I don't think manual provisioning should live there. On the max/min nodes on NodePool, it was pointed out to me that how would a nodepool know what instance types those min nodes should be and in order to protect those min nodes Karpenter would have to cut through most of its disruption logic to support don't drop node count below min. That isn't the say supporting node management or different autoscaling priority isn't possible but I think the entity that contains that logic should not be node pool in its current form. If Karpenter expands NodePool such that it is extensible, _karpenter_nodepool_provider_N, that allows users to group nodeclaims with different intention that differs from the primary objective of Karpenter. Users can create nodepools variants where keeping a min make sense. Where scheduling logic is different and so on. From my view, I think its important to keep Karpenter's main focus clear and clean because its complicated enough. But if we allow for more extension on the base concepts, we may be able to support use cases to fall out of what Karpenter is trying to do. |
There are mentions of node autohealing, using budgets to manage upgrades, etc. Seems its moving to be a node lifecycle management tool. It has more value than just autoscaling. So I am for a static provisioning CR that helps manage lifecycle of static pools. |
Hi. I am really looking forward to this feature too. Is there an open PR for it? Or is it still in the discussions phase? |
We have 9 minimum nodes in our ASG for a batch job workload that gets kicked off by users through a UI. The users find the EC2 spin-up time unacceptable and expect their pod to spin up quicker than the EC2 can start. We must have a way to run a minimum number of nodes in a nodepool. |
You can already do that (run low-priority placeholder Pods), but AFAIK there's no controller that does exactly this. |
Same case here, we need to have a minimum set of nodes, which should be evenly spread among the AZs (AWS). We want to have always extra capacity available for the workload peaks and we can't wait for the spin up/down dance most of the time. @sftim not sure but maybe the cluster autoscaler provides something similar? About the low-prio placeholder Pods, have you seen any good guide to do it? Sounds like a hack tho. |
Cluster Autoscaler documents how to overprovision cluster to offset node provisioning by running preemptable pods. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler In the same time this solution isn't exclusive to Cluster Autoscalers, it works just fine with Karpenter and any other potential autoscaler. I wouldn't consider it as a hack, it's implemented via stable kubernetes resources using common practices. There is ready to use helm chart at https://github.com/deliveryhero/helm-charts/tree/master/stable/cluster-overprovisioner. |
Cluster Autoscaler now also supports ProvisioningRequest CRD |
There is an additional financial impact caused by the de-facto solution for a warm pool of capacity by overprovisioning. When high priority pods are scheduled, preempting the |
Does Karpenter have a plan to implement this? It's really helpful for AI workloads. |
+1 |
Cross ref-ing the ProvisioningRequest ask here: #742 (comment) |
cc @raywainman who is tracking warm replicas stories on behalf of WG Serving here: https://docs.google.com/document/d/1QsN4ubjerEqo5L4bQamOFFS2lmCv5zNPis2Z8gcIITg |
For everyone's context, we did a little bit of ideating and came-up with an API that we were pretty happy with from the Karpenter side (see https://github.com/jonathan-innis/karpenter-headroom-poc). We're having an open discussion with the CAS folks about the difference between how we are thinking about the Headroom API and the ProvisioningRequest API, feel free to take a look and comment on the doc if you have any thoughts: https://docs.google.com/document/d/1SyqStWUt407Rcwdtv25yG6MpHdNnbfB3KPmc4zQuz1M/edit?usp=sharing |
Hi guys, how's it going?. I have been looking for a fixed provisioning solution with Karpenter- We need to have a minimum amount of on-demand nodes (I'd say 2, one per AZ) and then scale-up/down with spot instances the rest of the workload, I haven't found a solution yet, we tried to play with "On-Demand/Spot Ratio Split" but didn't work and continues spreading Spot instances for the whole workload. Some workaround or thougts about how to solve it, we really want to fully use Karpenter for our workload. |
Hi, @balmha Based on Karpenter, we developed a feature to ensure a minimum number of non-spot replicas for each workload, as illustrated below: Under the hood, it's a webhook component that monitors the distribution of each workload and modifies the pods' affinity to prefer spot instances while requiring some to run on-demand. You can check it out here: CloudPilot Console. This is not a promotion, just a technical communication. |
You don't need this issue resolved for your use case @balmha.
Even better, you can let your workloads that can't tolerate interruptions, set nodeSelectors or affinity to run on on-demand nodes. That's something which is cleaner to do with Karpenter than on CA. |
@jonathan-innis I really like the look of the headroom APIs, I think they'd cover the requirements I was talking about in #993. Is there a separate issue to track the headroom APIs? Google docs are blocked from our corporate machines. |
Is this still planned to be implemented? Any kind of ETA? Our use case also would greatly benefit from it: We scale up GitLab Runners for our organization. However the cold-starts (60-90 secs) is not ideal for CI/CD. Having 1-2 nodes always available as headroom would make sure that every pipeline coming in, directly get started. Ideally, the minimum number of nodes can be set for a specific schedule (e.g. office hours only) to make sure that during night or weekend it ramps down to zero to minimize costs. |
Our team is using blue/green deployment and always needs to wait a few minutes to make the new nodes come. And then Karpenter will disrupt nodes (after deployment, there will be an overuse of resources). |
One suggestion for always have a fixed set of nodes available is having a deployment with proper topologyspreadconstraints +pdb to force the pods one node per replica. (set very low resource requirements) Don't set a lower priority so it won't get prempted (as in the prewarm/headroom Szenario mentioned before). If you combine this with keda you can do that even dynamically basted on external factors, e.g. cron time |
Tell us about your request
What do you want us to build?
I'm seeing a number of feature requests to launch nodes separately from pending pods. This issue is intended to broadly track this discussion.
Use Cases:
Community Note
The text was updated successfully, but these errors were encountered: