-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: cluster consolidation #2123
Conversation
✅ Deploy Preview for karpenter-docs-prod ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Help: "Number of nodes created in total by consolidation.", | ||
}, | ||
) | ||
var consolidationNodesTerminatedCounter = prometheus.NewCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This collides with Jason's PR. Worth syncing with him.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This counts nodes terminated for consolidation purposes, there's another counter for nodes created for consolidation. The idea was that you can track over time what consolidation specifically is doing to your cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should be marking the node with a deletion reason e.g., emptiness, consolidation, expiry, and track that with a single gaugevec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a consolidationActionsPerformedCounter
that tracks the consolidations actions performed (shrink/delete/delete empty) by action. If someone wants to rework metrics in a follow-up PR, I'm fine with that but I want to limit the scope of this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Let's cut a followup issue.
@tzneal thanks for the quick fix, the new version has reduced the CPU usage to a 50%. |
|
||
pods, err := c.getNodePods(ctx, n.Node.Name) | ||
if err != nil { | ||
logging.FromContext(ctx).Errorf("Determining node pods, %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide some sample log output from consolidation executing over a few minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this is the result of a two scale-downs that ended up deleting two nodes and replacing one. The first scale-down only allowed deleting two nodes, so I waited and scaled down a few more pods to make a replacement more likely:
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:31.214Z INFO controller.consolidation Consolidating via Delete, terminating 1 nodes ip-192-168-166-214.us-west-2.compute.internal/c6g.4xlarge {"commit": "63daf4b"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:31.248Z INFO controller.termination Cordoned node {"commit": "63daf4b", "node": "ip-192-168-166-214.us-west-2.compute.internal"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:31.577Z INFO controller.termination Deleted node {"commit": "63daf4b", "node": "ip-192-168-166-214.us-west-2.compute.internal"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:41.344Z INFO controller.consolidation Consolidating via Delete, terminating 1 nodes ip-192-168-190-133.us-west-2.compute.internal/t4g.xlarge {"commit": "63daf4b"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:41.385Z INFO controller.termination Cordoned node {"commit": "63daf4b", "node": "ip-192-168-190-133.us-west-2.compute.internal"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:41.678Z INFO controller.termination Deleted node {"commit": "63daf4b", "node": "ip-192-168-190-133.us-west-2.compute.internal"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:45:12.882Z INFO controller.consolidation Consolidating via Replace, terminating 1 nodes ip-192-168-191-127.us-west-2.compute.internal/c6g.4xlarge and replacing with a node from types a1.2xlarge, t4g.2xlarge, c6g.2xlarge, c7g.2xlarge, t3a.2xlarge and 37 other(s) {"commit": "63daf4b"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:45:15.281Z INFO controller.consolidation.cloudprovider Launched instance: i-01882e52c50acf61b, hostname: ip-192-168-31-116.us-west-2.compute.internal, type: t4g.2xlarge, zone: us-west-2d, capacityType: on-demand {"commit": "63daf4b", "provisioner": "default"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:45:15.286Z INFO controller.consolidation Created node with 5 pods requesting {"cpu":"5125m","pods":"10"} from types a1.2xlarge, t4g.2xlarge, c6g.2xlarge, c7g.2xlarge, t3a.2xlarge and 37 other(s) {"commit": "63daf4b", "provisioner": "default"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:46:19.967Z INFO controller.termination Deleted node {"commit": "63daf4b", "node": "ip-192-168-191-127.us-west-2.compute.internal"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this including debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I can collect another log with debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.491Z INFO controller.consolidation Consolidating via Delete (empty node), terminating 1 nodes ip-192-168-189-114.us-west-2.compute.internal/c6g.8xlarge {"commit": "ece47b1"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.491Z DEBUG controller.events Normal {"commit": "ece47b1", "object": {"kind":"Node","name":"ip-192-168-189-114.us-west-2.compute.internal","uid":"1e1bc41d-e1c4-402a-a4b0-22ee79d8c882","apiVersion":"v1","resourceVersion":"52764954"}, "reason": "ConsolidateTerminateNode", "message": "Consolidating node via Delete (empty node), terminating 1 nodes ip-192-168-189-114.us-west-2.compute.internal/c6g.8xlarge"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.542Z INFO controller.termination Cordoned node {"commit": "ece47b1", "node": "ip-192-168-189-114.us-west-2.compute.internal"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.726Z INFO controller.termination Deleted node {"commit": "ece47b1", "node": "ip-192-168-189-114.us-west-2.compute.internal"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:40.815Z INFO controller.consolidation Consolidating via Replace, terminating 1 nodes ip-192-168-81-216.us-west-2.compute.internal/c4.8xlarge and replacing with a node from types c6g.8xlarge, c7g.8xlarge, c6a.8xlarge, c6gd.8xlarge, m6g.8xlarge and 10 other(s) {"commit": "ece47b1"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.036Z DEBUG controller.consolidation.cloudprovider Discovered security groups: [sg-05c364fe5ccfc49b4 sg-05f88204faa8cf2d4] {"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.038Z DEBUG controller.consolidation.cloudprovider Discovered kubernetes version 1.21 {"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.087Z DEBUG controller.consolidation.cloudprovider Discovered ami-064ac13bc8e3c2ee1 for query "/aws/service/eks/optimized-ami/1.21/amazon-linux-2-arm64/recommended/image_id" {"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.113Z DEBUG controller.consolidation.cloudprovider Discovered ami-00cf63b12c53803a5 for query "/aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/image_id" {"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.191Z INFO controller.consolidation.cloudprovider Launched instance: i-0d136b6c50015d855, hostname: ip-192-168-187-186.us-west-2.compute.internal, type: c6g.8xlarge, zone: us-west-2a, capacityType: on-demand {"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.250Z INFO controller.consolidation Created node with 22 pods requesting {"cpu":"22125m","pods":"27"} from types c6g.8xlarge, c7g.8xlarge, c6a.8xlarge, c6gd.8xlarge, m6g.8xlarge and 10 other(s) {"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.250Z DEBUG controller.events Normal {"commit": "ece47b1", "object": {"kind":"Node","name":"ip-192-168-187-186.us-west-2.compute.internal","uid":"0dbb680e-2c7c-4fe0-aef6-e625e95e1eef","apiVersion":"v1","resourceVersion":"52765122"}, "reason": "ConsolidateLaunchNode", "message": "Launching node for Replace, terminating 1 nodes ip-192-168-81-216.us-west-2.compute.internal/c4.8xlarge and replacing with a node from types c6g.8xlarge, c7g.8xlarge, c6a.8xlarge, c6gd.8xlarge, m6g.8xlarge and 10 other(s)"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.250Z DEBUG controller.events Normal {"commit": "ece47b1", "object": {"kind":"Node","name":"ip-192-168-187-186.us-west-2.compute.internal","uid":"0dbb680e-2c7c-4fe0-aef6-e625e95e1eef","apiVersion":"v1","resourceVersion":"52765122"}, "reason": "ConsolidateWaiting", "message": "Waiting on readiness to continue consolidation"}
1a8948b
to
fb9fb5d
Compare
Discussed offline: Cloud Provider API
TTLAfterUnderutilized
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work! I had some comments and questions, but excited to see this get merged!
@@ -62,6 +64,10 @@ func (s *ProvisionerSpec) validateTTLSecondsAfterEmpty() (errs *apis.FieldError) | |||
if ptr.Int64Value(s.TTLSecondsAfterEmpty) < 0 { | |||
return errs.Also(apis.ErrInvalidValue("cannot be negative", "ttlSecondsAfterEmpty")) | |||
} | |||
// TTLSecondsAfterEmpty and consolidation are mutually exclusive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on using a structural schema in the Provisioner for this? https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#specifying-a-structural-schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, controller-gen doesn't support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, but I'm cautious of scheme-level validation. It feels a bit half baked and it's challenging to model enough of your validation using it that it becomes worth it. Given that we're already using a webhook, I'd lean towards making our validation logic explicit and avoiding schema-level validation (one thing to think about vs two)
} { | ||
p.removeTopologySpreadScheduleAnyway} | ||
|
||
if p.ToleratePreferNoSchedule { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just to make the logs less noisy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does reduce logs, but also eliminates extra work. The other functions don't perform any action if they can't do something (e.g. if there is no preferred pod affinity term, they can't remove it). This one always added a prefer no schedule taint even if it did nothing.
@@ -112,6 +112,5 @@ func (e *EvictionQueue) evict(ctx context.Context, nn types.NamespacedName) bool | |||
logging.FromContext(ctx).Error(err) | |||
return false | |||
} | |||
logging.FromContext(ctx).Debug("Evicted pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to remove this. Have you found that these logs are way too spammy? If we don't have the logs here, we should have record that we're evicting them somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You get notice already that the node is being acted for consolidation or expiration. During large scale-downs with consolidation, my screen was a stream of this log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Some thoughts:
- Do we want to maybe investigate a different level of logging? I know we've been only using debug and info, but we're making a lot of logging now, so may be useful.
- Maybe a good in-between is when we decide to delete a node, we just log the list of pods we're going to evict, so at least we can find this in the logs tied to the node. Another idea is to export these as events somewhere. Mostly just thinking out loud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to maybe investigate a different level of logging? I know we've been only using debug and info, but we're making a lot of logging now, so may be useful.
I'm cautious of this, as it adds subjectivity for when to make something trace or debug. You'll notice that many libraries in go only allow error/warn/info/debug. In our codebase, we only use error/info/debug. There are numerous blogs that argue different approaches, but I'm quite happy with our simple use of error/info/debug.
Adding trace adds an additional thing for customers to think about (e.g. when would a customer choose debug vs trace?). Logging is notoriously subjective and requires judgement required on the part of the author. In this case, I think there's enough logging via kube events re: eviction that it seems reasonable to drop this log if it's spammy.
func (c *Controller) stabilizationWindow(ctx context.Context) time.Duration { | ||
// no pending pods, and all replica sets/replication controllers are reporting ready so quickly consider another consolidation | ||
if !c.hasPendingPods(ctx) && c.replicaSetsReady(ctx) && | ||
c.replicationControllersReady(ctx) && c.statefulSetsReady(ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for this check and to have a 5 min. delay if it doesn't pass?
I worry that a single broken deployment can block/delay consolidation, which could be quite expensive if you have a huge scale up and later a big scale down leading to many nodes needing to be consolidated.
In our environment we have many teams deploying to a single cluster and some teams may have a broken deployment which is not getting fixed because there is no impact from their perspective, ofc. this could be considered a problem, but that it can block/delay the consolidation is probably a side effect that is not obvious to those teams/users.
The check for pending pods seems reasonable as it doesn't consider pods that are unschedulable to my understanding. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine this is also a problem if you have rolling updates of deployments. We have many teams deploying all the time during the day, so if a rolling update would lead to a 5 min. delay that wouldn't be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying to be conservative during consolidation so that we don't disrupt the cluster too much but still drive towards a lower cost. This looks at pending pods, deployments, etc. in an attempt to speed up consolidation if the overall cluster state looks good with respect to the workloads running on it.
Even with the 5 minute delay, you're still looking at consolidating ~12 nodes per hour or ~288 nodes per day. We also consolidate empty nodes in a single batch which should help.
There may be tunables in the future, but in the initial version we are trying to provide a simple interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently can scale 100s of nodes within minutes and expect them to be gone/consolidated soon after. Waiting hours/days for complete consolidation seems like a lot :)
This hardcoded 5 min. also have the effect that if nodes are empty within this time they have to wait the full window before being consolidated from my understanding. With TTLSecondsAfterEmpty
you could reduce this to seconds, but that is no longer possible with consolidation enabled.
I can also fully understand starting with a simple interface, just trying to give a bit of feedback from a user perspective as we're eagerly awaiting this consolidation feature for likely replacing the cluster-autoscaler. For context we run 200+ Kubernetes clusters (test+prod) with up to 1000 nodes per cluster (scaling a lot during the day), so reducing the time nodes are running empty has a big compounding cost efficiency effect. :)
I'm really impressed with what karpenter has become by now with this feature, and the potential that it has compared to the cluster-autoscaler. Great work! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! There's definitely some trade-off in just about everything. I hope to get another snapshot out tomorrow (they're linked on the issue at #1091). If you've got a dev/test cluster available that isn't critical, we would welcome some feedback on how it performs for you.
"context" | ||
|
||
v1 "k8s.io/api/core/v1" | ||
policyv1 "k8s.io/api/policy/v1beta1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just not to lose this: #2123 (comment)
What's the strategy for supporting newer k8s versions where this has been removed?
hello @tzneal, great job, I am really looking forward for this improvement - are you able to share any estimation about how much effort is left before releasing this awesome feature? |
Implements cluster consolidation via: - removing nodes if their workloads can run on other nodes - replacing nodes with cheaper instances Fixes aws#1091
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM
@@ -70,6 +71,10 @@ func (s *ProvisionerSpec) validateTTLSecondsAfterEmpty() (errs *apis.FieldError) | |||
if ptr.Int64Value(s.TTLSecondsAfterEmpty) < 0 { | |||
return errs.Also(apis.ErrInvalidValue("cannot be negative", "ttlSecondsAfterEmpty")) | |||
} | |||
// TTLSecondsAfterEmpty and consolidation are mutually exclusive | |||
if s.Consolidation != nil && aws.BoolValue(s.Consolidation.Enabled) && s.TTLSecondsAfterEmpty != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ptr
is probably better to use here so we're not depending on the aws-sdk in Karpenter core. Feel free to change in follow-up.
@tzneal we're experiencing problem with the new consolidation feature, while using the consolidation flag true on the provisioner.yaml this is our provisioner file:
and the error we're getting is : |
Is there any design doc for this feature? |
@francis-jjk https://github.com/aws/karpenter/blob/main/designs/consolidation.md |
@OrIlani055 perhaps you need to update CRDs on your cluster as stated in the upgrade guide - https://karpenter.sh/v0.15.0/upgrade-guide/#custom-resource-definition-crd-upgrades also keep in mind that you need to delete |
Fixes #1091
Description
Implements cluster consolidation via:
How was this change tested?
Unit testing, e2e tests and deployed to EKS.
Does this change impact docs?
Release Note
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.