feat: cluster consolidation #2123

tzneal · 2022-07-13T19:11:07Z

Fixes #1091

Description

Implements cluster consolidation via:

removing nodes if their workloads can run on other nodes
replacing nodes with cheaper instances

How was this change tested?

Unit testing, e2e tests and deployed to EKS.

Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: #
No

Release Note

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2022-07-13T19:11:13Z

✅ Deploy Preview for karpenter-docs-prod ready!

Name	Link
🔨 Latest commit	`c0e6270`
🔍 Latest deploy log	https://app.netlify.com/sites/karpenter-docs-prod/deploys/62f41749cd5fd00009b8818d
😎 Deploy Preview	https://deploy-preview-2123--karpenter-docs-prod.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

pkg/controllers/consolidation/pdblimits.go

tzneal · 2022-07-14T16:50:38Z

From @offzale

I will leave it running and keep an eye on it to gather some feedback. What I have noticed so far an increment on cpu usage by +640%, in comparison to the resources it normally takes.

This should be resolved with the 010e04f snapshot, I updated the install instructions on the issue.

website/content/en/preview/tasks/deprovisioning.md

charts/karpenter/templates/clusterrole.yaml

charts/karpenter/values.yaml

pkg/apis/provisioning/v1alpha5/labels.go

pkg/apis/provisioning/v1alpha5/provisioner.go

pkg/apis/provisioning/v1alpha5/provisioner_validation.go

pkg/cloudprovider/types.go

pkg/controllers/consolidation/pdblimits.go

ellistarn · 2022-07-15T01:14:03Z

pkg/controllers/consolidation/metrics.go

+		Help:      "Number of nodes created in total by consolidation.",
+	},
+)
+var consolidationNodesTerminatedCounter = prometheus.NewCounter(


This collides with Jason's PR. Worth syncing with him.

This counts nodes terminated for consolidation purposes, there's another counter for nodes created for consolidation. The idea was that you can track over time what consolidation specifically is doing to your cluster.

I wonder if we should be marking the node with a deletion reason e.g., emptiness, consolidation, expiry, and track that with a single gaugevec.

There is a consolidationActionsPerformedCounter that tracks the consolidations actions performed (shrink/delete/delete empty) by action. If someone wants to rework metrics in a follow-up PR, I'm fine with that but I want to limit the scope of this one.

SGTM. Let's cut a followup issue.

pkg/controllers/provisioning/provisioner.go

pkg/controllers/provisioning/scheduling/scheduler.go

offzale · 2022-07-15T14:03:43Z

@tzneal thanks for the quick fix, the new version has reduced the CPU usage to a 50%.

pkg/controllers/consolidation/controller.go

ellistarn · 2022-07-17T22:34:56Z

pkg/controllers/consolidation/controller.go

+
+		pods, err := c.getNodePods(ctx, n.Node.Name)
+		if err != nil {
+			logging.FromContext(ctx).Errorf("Determining node pods, %s", err)


Can you provide some sample log output from consolidation executing over a few minutes?

Sure, this is the result of a two scale-downs that ended up deleting two nodes and replacing one. The first scale-down only allowed deleting two nodes, so I waited and scaled down a few more pods to make a replacement more likely:

karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:31.214Z INFO controller.consolidation Consolidating via Delete, terminating 1 nodes ip-192-168-166-214.us-west-2.compute.internal/c6g.4xlarge {"commit": "63daf4b"} karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:31.248Z INFO controller.termination Cordoned node {"commit": "63daf4b", "node": "ip-192-168-166-214.us-west-2.compute.internal"} karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:31.577Z INFO controller.termination Deleted node {"commit": "63daf4b", "node": "ip-192-168-166-214.us-west-2.compute.internal"} karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:41.344Z INFO controller.consolidation Consolidating via Delete, terminating 1 nodes ip-192-168-190-133.us-west-2.compute.internal/t4g.xlarge {"commit": "63daf4b"} karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:41.385Z INFO controller.termination Cordoned node {"commit": "63daf4b", "node": "ip-192-168-190-133.us-west-2.compute.internal"} karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:41.678Z INFO controller.termination Deleted node {"commit": "63daf4b", "node": "ip-192-168-190-133.us-west-2.compute.internal"} karpenter-68d787db56-6x6q8 controller 2022-07-18T02:45:12.882Z INFO controller.consolidation Consolidating via Replace, terminating 1 nodes ip-192-168-191-127.us-west-2.compute.internal/c6g.4xlarge and replacing with a node from types a1.2xlarge, t4g.2xlarge, c6g.2xlarge, c7g.2xlarge, t3a.2xlarge and 37 other(s) {"commit": "63daf4b"} karpenter-68d787db56-6x6q8 controller 2022-07-18T02:45:15.281Z INFO controller.consolidation.cloudprovider Launched instance: i-01882e52c50acf61b, hostname: ip-192-168-31-116.us-west-2.compute.internal, type: t4g.2xlarge, zone: us-west-2d, capacityType: on-demand {"commit": "63daf4b", "provisioner": "default"} karpenter-68d787db56-6x6q8 controller 2022-07-18T02:45:15.286Z INFO controller.consolidation Created node with 5 pods requesting {"cpu":"5125m","pods":"10"} from types a1.2xlarge, t4g.2xlarge, c6g.2xlarge, c7g.2xlarge, t3a.2xlarge and 37 other(s) {"commit": "63daf4b", "provisioner": "default"} karpenter-68d787db56-6x6q8 controller 2022-07-18T02:46:19.967Z INFO controller.termination Deleted node {"commit": "63daf4b", "node": "ip-192-168-191-127.us-west-2.compute.internal"}

Is this including debug?

No, I can collect another log with debug.

karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.491Z INFO controller.consolidation Consolidating via Delete (empty node), terminating 1 nodes ip-192-168-189-114.us-west-2.compute.internal/c6g.8xlarge {"commit": "ece47b1"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.491Z DEBUG controller.events Normal {"commit": "ece47b1", "object": {"kind":"Node","name":"ip-192-168-189-114.us-west-2.compute.internal","uid":"1e1bc41d-e1c4-402a-a4b0-22ee79d8c882","apiVersion":"v1","resourceVersion":"52764954"}, "reason": "ConsolidateTerminateNode", "message": "Consolidating node via Delete (empty node), terminating 1 nodes ip-192-168-189-114.us-west-2.compute.internal/c6g.8xlarge"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.542Z INFO controller.termination Cordoned node {"commit": "ece47b1", "node": "ip-192-168-189-114.us-west-2.compute.internal"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.726Z INFO controller.termination Deleted node {"commit": "ece47b1", "node": "ip-192-168-189-114.us-west-2.compute.internal"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:40.815Z INFO controller.consolidation Consolidating via Replace, terminating 1 nodes ip-192-168-81-216.us-west-2.compute.internal/c4.8xlarge and replacing with a node from types c6g.8xlarge, c7g.8xlarge, c6a.8xlarge, c6gd.8xlarge, m6g.8xlarge and 10 other(s) {"commit": "ece47b1"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.036Z DEBUG controller.consolidation.cloudprovider Discovered security groups: [sg-05c364fe5ccfc49b4 sg-05f88204faa8cf2d4] {"commit": "ece47b1", "provisioner": "default"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.038Z DEBUG controller.consolidation.cloudprovider Discovered kubernetes version 1.21 {"commit": "ece47b1", "provisioner": "default"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.087Z DEBUG controller.consolidation.cloudprovider Discovered ami-064ac13bc8e3c2ee1 for query "/aws/service/eks/optimized-ami/1.21/amazon-linux-2-arm64/recommended/image_id" {"commit": "ece47b1", "provisioner": "default"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.113Z DEBUG controller.consolidation.cloudprovider Discovered ami-00cf63b12c53803a5 for query "/aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/image_id" {"commit": "ece47b1", "provisioner": "default"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.191Z INFO controller.consolidation.cloudprovider Launched instance: i-0d136b6c50015d855, hostname: ip-192-168-187-186.us-west-2.compute.internal, type: c6g.8xlarge, zone: us-west-2a, capacityType: on-demand {"commit": "ece47b1", "provisioner": "default"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.250Z INFO controller.consolidation Created node with 22 pods requesting {"cpu":"22125m","pods":"27"} from types c6g.8xlarge, c7g.8xlarge, c6a.8xlarge, c6gd.8xlarge, m6g.8xlarge and 10 other(s) {"commit": "ece47b1", "provisioner": "default"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.250Z DEBUG controller.events Normal {"commit": "ece47b1", "object": {"kind":"Node","name":"ip-192-168-187-186.us-west-2.compute.internal","uid":"0dbb680e-2c7c-4fe0-aef6-e625e95e1eef","apiVersion":"v1","resourceVersion":"52765122"}, "reason": "ConsolidateLaunchNode", "message": "Launching node for Replace, terminating 1 nodes ip-192-168-81-216.us-west-2.compute.internal/c4.8xlarge and replacing with a node from types c6g.8xlarge, c7g.8xlarge, c6a.8xlarge, c6gd.8xlarge, m6g.8xlarge and 10 other(s)"} karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.250Z DEBUG controller.events Normal {"commit": "ece47b1", "object": {"kind":"Node","name":"ip-192-168-187-186.us-west-2.compute.internal","uid":"0dbb680e-2c7c-4fe0-aef6-e625e95e1eef","apiVersion":"v1","resourceVersion":"52765122"}, "reason": "ConsolidateWaiting", "message": "Waiting on readiness to continue consolidation"}

pkg/controllers/consolidation/controller.go

ellistarn · 2022-07-18T20:15:29Z

Discussed offline:

Cloud Provider API

We can apply the spot assumption (do not consolidate spot) to all vendors, until it becomes a problem.
We can solve the prioritization problem by defaulting families/sizes in our vendor specific webhook. This improves UX in other ways. Not default instance types may still be configured at the provisioner level

TTLAfterUnderutilized

Currently we're losing the ability to wait after nodes become empty. The cluster drains very quickly.
There are algorithmic complexities to applying this rule to nodes with >0 pods
Having a TTL for 0 pods on a node, but no TTL for 1 pod on a node is a very jarring UX
We'll take this one as a fast follow, since it's nontrivial work that needs to be explored.

njtran

Really nice work! I had some comments and questions, but excited to see this get merged!

charts/karpenter/templates/clusterrole.yaml

njtran · 2022-07-18T19:15:28Z

pkg/apis/provisioning/v1alpha5/provisioner_validation.go

@@ -62,6 +64,10 @@ func (s *ProvisionerSpec) validateTTLSecondsAfterEmpty() (errs *apis.FieldError)
 	if ptr.Int64Value(s.TTLSecondsAfterEmpty) < 0 {
 		return errs.Also(apis.ErrInvalidValue("cannot be negative", "ttlSecondsAfterEmpty"))
 	}
+	// TTLSecondsAfterEmpty and consolidation are mutually exclusive


Thoughts on using a structural schema in the Provisioner for this? https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#specifying-a-structural-schema

As far as I know, controller-gen doesn't support this.

Great suggestion, but I'm cautious of scheme-level validation. It feels a bit half baked and it's challenging to model enough of your validation using it that it becomes worth it. Given that we're already using a webhook, I'd lean towards making our validation logic explicit and avoiding schema-level validation (one thing to think about vs two)

pkg/controllers/consolidation/controller.go

pkg/controllers/counter/controller.go

njtran · 2022-07-18T23:51:46Z

pkg/controllers/provisioning/scheduling/preferences.go

-	} {
+		p.removeTopologySpreadScheduleAnyway}
+
+	if p.ToleratePreferNoSchedule {


Is this just to make the logs less noisy?

It does reduce logs, but also eliminates extra work. The other functions don't perform any action if they can't do something (e.g. if there is no preferred pod affinity term, they can't remove it). This one always added a prefer no schedule taint even if it did nothing.

njtran · 2022-07-19T00:25:48Z

pkg/controllers/termination/eviction.go

@@ -112,6 +112,5 @@ func (e *EvictionQueue) evict(ctx context.Context, nn types.NamespacedName) bool
 		logging.FromContext(ctx).Error(err)
 		return false
 	}
-	logging.FromContext(ctx).Debug("Evicted pod")


I don't think we want to remove this. Have you found that these logs are way too spammy? If we don't have the logs here, we should have record that we're evicting them somewhere else.

You get notice already that the node is being acted for consolidation or expiration. During large scale-downs with consolidation, my screen was a stream of this log.

Makes sense. Some thoughts:

Do we want to maybe investigate a different level of logging? I know we've been only using debug and info, but we're making a lot of logging now, so may be useful.

Maybe a good in-between is when we decide to delete a node, we just log the list of pods we're going to evict, so at least we can find this in the logs tied to the node. Another idea is to export these as events somewhere. Mostly just thinking out loud.

Do we want to maybe investigate a different level of logging? I know we've been only using debug and info, but we're making a lot of logging now, so may be useful.

I'm cautious of this, as it adds subjectivity for when to make something trace or debug. You'll notice that many libraries in go only allow error/warn/info/debug. In our codebase, we only use error/info/debug. There are numerous blogs that argue different approaches, but I'm quite happy with our simple use of error/info/debug.

Adding trace adds an additional thing for customers to think about (e.g. when would a customer choose debug vs trace?). Logging is notoriously subjective and requires judgement required on the part of the author. In this case, I think there's enough logging via kube events re: eviction that it seems reasonable to drop this log if it's spammy.

pkg/events/dedupe.go

pkg/controllers/consolidation/controller.go

charts/karpenter/templates/clusterrole.yaml

mikkeloscar · 2022-07-19T21:27:56Z

pkg/controllers/consolidation/controller.go

+func (c *Controller) stabilizationWindow(ctx context.Context) time.Duration {
+	// no pending pods, and all replica sets/replication controllers are reporting ready so quickly consider another consolidation
+	if !c.hasPendingPods(ctx) && c.replicaSetsReady(ctx) &&
+		c.replicationControllersReady(ctx) && c.statefulSetsReady(ctx) {


What is the motivation for this check and to have a 5 min. delay if it doesn't pass?

I worry that a single broken deployment can block/delay consolidation, which could be quite expensive if you have a huge scale up and later a big scale down leading to many nodes needing to be consolidated.

In our environment we have many teams deploying to a single cluster and some teams may have a broken deployment which is not getting fixed because there is no impact from their perspective, ofc. this could be considered a problem, but that it can block/delay the consolidation is probably a side effect that is not obvious to those teams/users.

The check for pending pods seems reasonable as it doesn't consider pods that are unschedulable to my understanding. 👍

I imagine this is also a problem if you have rolling updates of deployments. We have many teams deploying all the time during the day, so if a rolling update would lead to a 5 min. delay that wouldn't be ideal.

We are trying to be conservative during consolidation so that we don't disrupt the cluster too much but still drive towards a lower cost. This looks at pending pods, deployments, etc. in an attempt to speed up consolidation if the overall cluster state looks good with respect to the workloads running on it.

Even with the 5 minute delay, you're still looking at consolidating ~12 nodes per hour or ~288 nodes per day. We also consolidate empty nodes in a single batch which should help.

There may be tunables in the future, but in the initial version we are trying to provide a simple interface.

We currently can scale 100s of nodes within minutes and expect them to be gone/consolidated soon after. Waiting hours/days for complete consolidation seems like a lot :)

This hardcoded 5 min. also have the effect that if nodes are empty within this time they have to wait the full window before being consolidated from my understanding. With TTLSecondsAfterEmpty you could reduce this to seconds, but that is no longer possible with consolidation enabled.

I can also fully understand starting with a simple interface, just trying to give a bit of feedback from a user perspective as we're eagerly awaiting this consolidation feature for likely replacing the cluster-autoscaler. For context we run 200+ Kubernetes clusters (test+prod) with up to 1000 nodes per cluster (scaling a lot during the day), so reducing the time nodes are running empty has a big compounding cost efficiency effect. :)

I'm really impressed with what karpenter has become by now with this feature, and the potential that it has compared to the cluster-autoscaler. Great work! 👍

Thanks! There's definitely some trade-off in just about everything. I hope to get another snapshot out tomorrow (they're linked on the issue at #1091). If you've got a dev/test cluster available that isn't critical, we would welcome some feedback on how it performs for you.

olemarkus · 2022-07-22T05:28:29Z

pkg/controllers/consolidation/pdblimits.go

+	"context"
+
+	v1 "k8s.io/api/core/v1"
+	policyv1 "k8s.io/api/policy/v1beta1"


Just not to lose this: #2123 (comment)

What's the strategy for supporting newer k8s versions where this has been removed?

psuj · 2022-08-04T14:16:17Z

hello @tzneal, great job, I am really looking forward for this improvement - are you able to share any estimation about how much effort is left before releasing this awesome feature?

Implements cluster consolidation via: - removing nodes if their workloads can run on other nodes - replacing nodes with cheaper instances Fixes aws#1091

bwagner5

Nice! LGTM

bwagner5 · 2022-08-10T21:23:36Z

pkg/apis/provisioning/v1alpha5/provisioner_validation.go

@@ -70,6 +71,10 @@ func (s *ProvisionerSpec) validateTTLSecondsAfterEmpty() (errs *apis.FieldError)
 	if ptr.Int64Value(s.TTLSecondsAfterEmpty) < 0 {
 		return errs.Also(apis.ErrInvalidValue("cannot be negative", "ttlSecondsAfterEmpty"))
 	}
+	// TTLSecondsAfterEmpty and consolidation are mutually exclusive
+	if s.Consolidation != nil && aws.BoolValue(s.Consolidation.Enabled) && s.TTLSecondsAfterEmpty != nil {


ptr is probably better to use here so we're not depending on the aws-sdk in Karpenter core. Feel free to change in follow-up.

OrIlani055 · 2022-08-17T12:35:53Z

@tzneal we're experiencing problem with the new consolidation feature, while using the consolidation flag true on the provisioner.yaml

this is our provisioner file:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: provisioner
spec:
   consolidation:
     enabled: true
  requirements:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: {{ .Values.karpenter.zones }}
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["t3a.2xlarge", "t3a.xlarge","c5.2xlarge"]
  limits:
    resources:
      cpu: 1000
  provider:
    launchTemplate: {{ .Values.karpenter.lt }}
    subnetSelector:
      Name: {{ .Values.karpenter.subnets }}
    tags:
      Name: {{ .Values.karpenter.tagname }}
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 86400

and the error we're getting is :
one or more objects failed to apply, reason: admission webhook "defaulting.webhook.provisioners.karpenter.sh" denied the request: mutation failed: cannot decode incoming new object: json: unknown field "consolidation".

francis-jjk · 2022-08-18T10:13:05Z

Is there any design doc for this feature?

FernandoMiguel · 2022-08-18T13:03:11Z

Is there any design doc for this feature?

@francis-jjk https://github.com/aws/karpenter/blob/main/designs/consolidation.md

psuj · 2022-08-18T14:05:06Z

@OrIlani055 perhaps you need to update CRDs on your cluster as stated in the upgrade guide - https://karpenter.sh/v0.15.0/upgrade-guide/#custom-resource-definition-crd-upgrades

also keep in mind that you need to delete ttlSecondsAfterEmpty if consolidation is enabled

tzneal requested a review from a team July 13, 2022 19:11

tzneal requested a review from a team as a code owner July 13, 2022 19:11

tzneal requested a review from dewjam July 13, 2022 19:11

olemarkus suggested changes Jul 14, 2022

View reviewed changes

pkg/controllers/consolidation/pdblimits.go Outdated Show resolved Hide resolved

tzneal mentioned this pull request Jul 14, 2022

Karpenter workload consolidation/defragmentation #1091

Closed

tzneal force-pushed the consolidation branch from 47afd62 to 010e04f Compare July 14, 2022 16:42

tzneal force-pushed the consolidation branch from 520bc41 to e55c61c Compare July 14, 2022 18:02