Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cluster consolidation #2123

Merged
merged 1 commit into from
Aug 10, 2022
Merged

feat: cluster consolidation #2123

merged 1 commit into from
Aug 10, 2022

Conversation

tzneal
Copy link
Contributor

@tzneal tzneal commented Jul 13, 2022

Fixes #1091

Description

Implements cluster consolidation via:

  • removing nodes if their workloads can run on other nodes
  • replacing nodes with cheaper instances

How was this change tested?

Unit testing, e2e tests and deployed to EKS.

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@tzneal tzneal requested a review from a team July 13, 2022 19:11
@tzneal tzneal requested a review from a team as a code owner July 13, 2022 19:11
@tzneal tzneal requested a review from dewjam July 13, 2022 19:11
@netlify
Copy link

netlify bot commented Jul 13, 2022

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit c0e6270
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/62f41749cd5fd00009b8818d
😎 Deploy Preview https://deploy-preview-2123--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@tzneal
Copy link
Contributor Author

tzneal commented Jul 14, 2022

From @offzale

I will leave it running and keep an eye on it to gather some feedback. What I have noticed so far an increment on cpu usage by +640%, in comparison to the resources it normally takes.

This should be resolved with the 010e04f snapshot, I updated the install instructions on the issue.

Help: "Number of nodes created in total by consolidation.",
},
)
var consolidationNodesTerminatedCounter = prometheus.NewCounter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collides with Jason's PR. Worth syncing with him.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This counts nodes terminated for consolidation purposes, there's another counter for nodes created for consolidation. The idea was that you can track over time what consolidation specifically is doing to your cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be marking the node with a deletion reason e.g., emptiness, consolidation, expiry, and track that with a single gaugevec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a consolidationActionsPerformedCounter that tracks the consolidations actions performed (shrink/delete/delete empty) by action. If someone wants to rework metrics in a follow-up PR, I'm fine with that but I want to limit the scope of this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Let's cut a followup issue.

@offzale
Copy link
Contributor

offzale commented Jul 15, 2022

@tzneal thanks for the quick fix, the new version has reduced the CPU usage to a 50%.


pods, err := c.getNodePods(ctx, n.Node.Name)
if err != nil {
logging.FromContext(ctx).Errorf("Determining node pods, %s", err)
Copy link
Contributor

@ellistarn ellistarn Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide some sample log output from consolidation executing over a few minutes?

Copy link
Contributor Author

@tzneal tzneal Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is the result of a two scale-downs that ended up deleting two nodes and replacing one. The first scale-down only allowed deleting two nodes, so I waited and scaled down a few more pods to make a replacement more likely:

karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:31.214Z	INFO	controller.consolidation	Consolidating via Delete, terminating 1 nodes ip-192-168-166-214.us-west-2.compute.internal/c6g.4xlarge	{"commit": "63daf4b"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:31.248Z	INFO	controller.termination	Cordoned node	{"commit": "63daf4b", "node": "ip-192-168-166-214.us-west-2.compute.internal"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:31.577Z	INFO	controller.termination	Deleted node	{"commit": "63daf4b", "node": "ip-192-168-166-214.us-west-2.compute.internal"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:41.344Z	INFO	controller.consolidation	Consolidating via Delete, terminating 1 nodes ip-192-168-190-133.us-west-2.compute.internal/t4g.xlarge	{"commit": "63daf4b"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:41.385Z	INFO	controller.termination	Cordoned node	{"commit": "63daf4b", "node": "ip-192-168-190-133.us-west-2.compute.internal"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:44:41.678Z	INFO	controller.termination	Deleted node	{"commit": "63daf4b", "node": "ip-192-168-190-133.us-west-2.compute.internal"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:45:12.882Z	INFO	controller.consolidation	Consolidating via Replace, terminating 1 nodes ip-192-168-191-127.us-west-2.compute.internal/c6g.4xlarge and replacing with a node from types a1.2xlarge, t4g.2xlarge, c6g.2xlarge, c7g.2xlarge, t3a.2xlarge and 37 other(s)	{"commit": "63daf4b"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:45:15.281Z	INFO	controller.consolidation.cloudprovider	Launched instance: i-01882e52c50acf61b, hostname: ip-192-168-31-116.us-west-2.compute.internal, type: t4g.2xlarge, zone: us-west-2d, capacityType: on-demand	{"commit": "63daf4b", "provisioner": "default"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:45:15.286Z	INFO	controller.consolidation	Created node with 5 pods requesting {"cpu":"5125m","pods":"10"} from types a1.2xlarge, t4g.2xlarge, c6g.2xlarge, c7g.2xlarge, t3a.2xlarge and 37 other(s)	{"commit": "63daf4b", "provisioner": "default"}
karpenter-68d787db56-6x6q8 controller 2022-07-18T02:46:19.967Z	INFO	controller.termination	Deleted node	{"commit": "63daf4b", "node": "ip-192-168-191-127.us-west-2.compute.internal"}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this including debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I can collect another log with debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.491Z	INFO	controller.consolidation	Consolidating via Delete (empty node), terminating 1 nodes ip-192-168-189-114.us-west-2.compute.internal/c6g.8xlarge	{"commit": "ece47b1"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.491Z	DEBUG	controller.events	Normal	{"commit": "ece47b1", "object": {"kind":"Node","name":"ip-192-168-189-114.us-west-2.compute.internal","uid":"1e1bc41d-e1c4-402a-a4b0-22ee79d8c882","apiVersion":"v1","resourceVersion":"52764954"}, "reason": "ConsolidateTerminateNode", "message": "Consolidating node via Delete (empty node), terminating 1 nodes ip-192-168-189-114.us-west-2.compute.internal/c6g.8xlarge"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.542Z	INFO	controller.termination	Cordoned node	{"commit": "ece47b1", "node": "ip-192-168-189-114.us-west-2.compute.internal"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:30.726Z	INFO	controller.termination	Deleted node	{"commit": "ece47b1", "node": "ip-192-168-189-114.us-west-2.compute.internal"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:40.815Z	INFO	controller.consolidation	Consolidating via Replace, terminating 1 nodes ip-192-168-81-216.us-west-2.compute.internal/c4.8xlarge and replacing with a node from types c6g.8xlarge, c7g.8xlarge, c6a.8xlarge, c6gd.8xlarge, m6g.8xlarge and 10 other(s)	{"commit": "ece47b1"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.036Z	DEBUG	controller.consolidation.cloudprovider	Discovered security groups: [sg-05c364fe5ccfc49b4 sg-05f88204faa8cf2d4]	{"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.038Z	DEBUG	controller.consolidation.cloudprovider	Discovered kubernetes version 1.21	{"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.087Z	DEBUG	controller.consolidation.cloudprovider	Discovered ami-064ac13bc8e3c2ee1 for query "/aws/service/eks/optimized-ami/1.21/amazon-linux-2-arm64/recommended/image_id"	{"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:41.113Z	DEBUG	controller.consolidation.cloudprovider	Discovered ami-00cf63b12c53803a5 for query "/aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/image_id"	{"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.191Z	INFO	controller.consolidation.cloudprovider	Launched instance: i-0d136b6c50015d855, hostname: ip-192-168-187-186.us-west-2.compute.internal, type: c6g.8xlarge, zone: us-west-2a, capacityType: on-demand	{"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.250Z	INFO	controller.consolidation	Created node with 22 pods requesting {"cpu":"22125m","pods":"27"} from types c6g.8xlarge, c7g.8xlarge, c6a.8xlarge, c6gd.8xlarge, m6g.8xlarge and 10 other(s)	{"commit": "ece47b1", "provisioner": "default"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.250Z	DEBUG	controller.events	Normal	{"commit": "ece47b1", "object": {"kind":"Node","name":"ip-192-168-187-186.us-west-2.compute.internal","uid":"0dbb680e-2c7c-4fe0-aef6-e625e95e1eef","apiVersion":"v1","resourceVersion":"52765122"}, "reason": "ConsolidateLaunchNode", "message": "Launching node for Replace, terminating 1 nodes ip-192-168-81-216.us-west-2.compute.internal/c4.8xlarge and replacing with a node from types c6g.8xlarge, c7g.8xlarge, c6a.8xlarge, c6gd.8xlarge, m6g.8xlarge and 10 other(s)"}
karpenter-68d7cfc48c-rfdg2 controller 2022-07-19T16:17:43.250Z	DEBUG	controller.events	Normal	{"commit": "ece47b1", "object": {"kind":"Node","name":"ip-192-168-187-186.us-west-2.compute.internal","uid":"0dbb680e-2c7c-4fe0-aef6-e625e95e1eef","apiVersion":"v1","resourceVersion":"52765122"}, "reason": "ConsolidateWaiting", "message": "Waiting on readiness to continue consolidation"}

@tzneal tzneal force-pushed the consolidation branch 2 times, most recently from 1a8948b to fb9fb5d Compare July 18, 2022 13:40
@ellistarn
Copy link
Contributor

Discussed offline:

Cloud Provider API

  • We can apply the spot assumption (do not consolidate spot) to all vendors, until it becomes a problem.
  • We can solve the prioritization problem by defaulting families/sizes in our vendor specific webhook. This improves UX in other ways. Not default instance types may still be configured at the provisioner level

TTLAfterUnderutilized

  • Currently we're losing the ability to wait after nodes become empty. The cluster drains very quickly.
  • There are algorithmic complexities to applying this rule to nodes with >0 pods
  • Having a TTL for 0 pods on a node, but no TTL for 1 pod on a node is a very jarring UX
  • We'll take this one as a fast follow, since it's nontrivial work that needs to be explored.

Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work! I had some comments and questions, but excited to see this get merged!

charts/karpenter/templates/clusterrole.yaml Outdated Show resolved Hide resolved
@@ -62,6 +64,10 @@ func (s *ProvisionerSpec) validateTTLSecondsAfterEmpty() (errs *apis.FieldError)
if ptr.Int64Value(s.TTLSecondsAfterEmpty) < 0 {
return errs.Also(apis.ErrInvalidValue("cannot be negative", "ttlSecondsAfterEmpty"))
}
// TTLSecondsAfterEmpty and consolidation are mutually exclusive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, controller-gen doesn't support this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, but I'm cautious of scheme-level validation. It feels a bit half baked and it's challenging to model enough of your validation using it that it becomes worth it. Given that we're already using a webhook, I'd lean towards making our validation logic explicit and avoiding schema-level validation (one thing to think about vs two)

pkg/controllers/consolidation/controller.go Show resolved Hide resolved
pkg/controllers/consolidation/controller.go Outdated Show resolved Hide resolved
pkg/controllers/consolidation/controller.go Show resolved Hide resolved
pkg/controllers/counter/controller.go Show resolved Hide resolved
} {
p.removeTopologySpreadScheduleAnyway}

if p.ToleratePreferNoSchedule {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just to make the logs less noisy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does reduce logs, but also eliminates extra work. The other functions don't perform any action if they can't do something (e.g. if there is no preferred pod affinity term, they can't remove it). This one always added a prefer no schedule taint even if it did nothing.

@@ -112,6 +112,5 @@ func (e *EvictionQueue) evict(ctx context.Context, nn types.NamespacedName) bool
logging.FromContext(ctx).Error(err)
return false
}
logging.FromContext(ctx).Debug("Evicted pod")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to remove this. Have you found that these logs are way too spammy? If we don't have the logs here, we should have record that we're evicting them somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get notice already that the node is being acted for consolidation or expiration. During large scale-downs with consolidation, my screen was a stream of this log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Some thoughts:

  • Do we want to maybe investigate a different level of logging? I know we've been only using debug and info, but we're making a lot of logging now, so may be useful.
  • Maybe a good in-between is when we decide to delete a node, we just log the list of pods we're going to evict, so at least we can find this in the logs tied to the node. Another idea is to export these as events somewhere. Mostly just thinking out loud.

Copy link
Contributor

@ellistarn ellistarn Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to maybe investigate a different level of logging? I know we've been only using debug and info, but we're making a lot of logging now, so may be useful.

I'm cautious of this, as it adds subjectivity for when to make something trace or debug. You'll notice that many libraries in go only allow error/warn/info/debug. In our codebase, we only use error/info/debug. There are numerous blogs that argue different approaches, but I'm quite happy with our simple use of error/info/debug.

Adding trace adds an additional thing for customers to think about (e.g. when would a customer choose debug vs trace?). Logging is notoriously subjective and requires judgement required on the part of the author. In this case, I think there's enough logging via kube events re: eviction that it seems reasonable to drop this log if it's spammy.

pkg/events/dedupe.go Outdated Show resolved Hide resolved
pkg/controllers/consolidation/controller.go Show resolved Hide resolved
func (c *Controller) stabilizationWindow(ctx context.Context) time.Duration {
// no pending pods, and all replica sets/replication controllers are reporting ready so quickly consider another consolidation
if !c.hasPendingPods(ctx) && c.replicaSetsReady(ctx) &&
c.replicationControllersReady(ctx) && c.statefulSetsReady(ctx) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for this check and to have a 5 min. delay if it doesn't pass?

I worry that a single broken deployment can block/delay consolidation, which could be quite expensive if you have a huge scale up and later a big scale down leading to many nodes needing to be consolidated.

In our environment we have many teams deploying to a single cluster and some teams may have a broken deployment which is not getting fixed because there is no impact from their perspective, ofc. this could be considered a problem, but that it can block/delay the consolidation is probably a side effect that is not obvious to those teams/users.

The check for pending pods seems reasonable as it doesn't consider pods that are unschedulable to my understanding. 👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this is also a problem if you have rolling updates of deployments. We have many teams deploying all the time during the day, so if a rolling update would lead to a 5 min. delay that wouldn't be ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to be conservative during consolidation so that we don't disrupt the cluster too much but still drive towards a lower cost. This looks at pending pods, deployments, etc. in an attempt to speed up consolidation if the overall cluster state looks good with respect to the workloads running on it.

Even with the 5 minute delay, you're still looking at consolidating ~12 nodes per hour or ~288 nodes per day. We also consolidate empty nodes in a single batch which should help.

There may be tunables in the future, but in the initial version we are trying to provide a simple interface.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently can scale 100s of nodes within minutes and expect them to be gone/consolidated soon after. Waiting hours/days for complete consolidation seems like a lot :)

This hardcoded 5 min. also have the effect that if nodes are empty within this time they have to wait the full window before being consolidated from my understanding. With TTLSecondsAfterEmpty you could reduce this to seconds, but that is no longer possible with consolidation enabled.

I can also fully understand starting with a simple interface, just trying to give a bit of feedback from a user perspective as we're eagerly awaiting this consolidation feature for likely replacing the cluster-autoscaler. For context we run 200+ Kubernetes clusters (test+prod) with up to 1000 nodes per cluster (scaling a lot during the day), so reducing the time nodes are running empty has a big compounding cost efficiency effect. :)

I'm really impressed with what karpenter has become by now with this feature, and the potential that it has compared to the cluster-autoscaler. Great work! 👍

Copy link
Contributor Author

@tzneal tzneal Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! There's definitely some trade-off in just about everything. I hope to get another snapshot out tomorrow (they're linked on the issue at #1091). If you've got a dev/test cluster available that isn't critical, we would welcome some feedback on how it performs for you.

"context"

v1 "k8s.io/api/core/v1"
policyv1 "k8s.io/api/policy/v1beta1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just not to lose this: #2123 (comment)

What's the strategy for supporting newer k8s versions where this has been removed?

@psuj
Copy link

psuj commented Aug 4, 2022

hello @tzneal, great job, I am really looking forward for this improvement - are you able to share any estimation about how much effort is left before releasing this awesome feature?

Implements cluster consolidation via:
- removing nodes if their workloads can run on other nodes
- replacing nodes with cheaper instances

Fixes aws#1091
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM

@@ -70,6 +71,10 @@ func (s *ProvisionerSpec) validateTTLSecondsAfterEmpty() (errs *apis.FieldError)
if ptr.Int64Value(s.TTLSecondsAfterEmpty) < 0 {
return errs.Also(apis.ErrInvalidValue("cannot be negative", "ttlSecondsAfterEmpty"))
}
// TTLSecondsAfterEmpty and consolidation are mutually exclusive
if s.Consolidation != nil && aws.BoolValue(s.Consolidation.Enabled) && s.TTLSecondsAfterEmpty != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptr is probably better to use here so we're not depending on the aws-sdk in Karpenter core. Feel free to change in follow-up.

@tzneal tzneal merged commit 473324d into aws:main Aug 10, 2022
@tzneal tzneal deleted the consolidation branch August 10, 2022 22:26
@OrIlani055
Copy link

@tzneal we're experiencing problem with the new consolidation feature, while using the consolidation flag true on the provisioner.yaml

this is our provisioner file:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: provisioner
spec:
   consolidation:
     enabled: true
  requirements:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: {{ .Values.karpenter.zones }}
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["t3a.2xlarge", "t3a.xlarge","c5.2xlarge"]
  limits:
    resources:
      cpu: 1000
  provider:
    launchTemplate: {{ .Values.karpenter.lt }}
    subnetSelector:
      Name: {{ .Values.karpenter.subnets }}
    tags:
      Name: {{ .Values.karpenter.tagname }}
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 86400

and the error we're getting is :
one or more objects failed to apply, reason: admission webhook "defaulting.webhook.provisioners.karpenter.sh" denied the request: mutation failed: cannot decode incoming new object: json: unknown field "consolidation".

@francis-jjk
Copy link

Is there any design doc for this feature?

@FernandoMiguel
Copy link
Contributor

Is there any design doc for this feature?

@francis-jjk https://github.com/aws/karpenter/blob/main/designs/consolidation.md

@psuj
Copy link

psuj commented Aug 18, 2022

@OrIlani055 perhaps you need to update CRDs on your cluster as stated in the upgrade guide - https://karpenter.sh/v0.15.0/upgrade-guide/#custom-resource-definition-crd-upgrades

also keep in mind that you need to delete ttlSecondsAfterEmpty if consolidation is enabled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Karpenter workload consolidation/defragmentation