Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Prevent KCP to create many private keys for each reconcile #8617

Merged

Conversation

fabriziopandini
Copy link
Member

What this PR does / why we need it:
When connecting to etcd on the workload clusters KCP creates a temporary certificate, however, one step of this operation, creating the private key for the new certificate, is CPU-consuming (see data on #8602 for more details).

This PR adds a private key to the clusterAccessor in the ClusterCacheTracker, created once for each cluster, so that KCP can re-use it at every reconciliation/in all the reconcile methods.

Which issue(s) this PR fixes:
Fixes #8602

cc @sbueringer @lentzi90

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 5, 2023
@lentzi90
Copy link
Contributor

lentzi90 commented May 8, 2023

Thanks for this @fabriziopandini ! I have validated the patch and it is a huge improvement! 🎉

Here is the dashboard view with the patch applied idling at 10 clusters:
after
As you can see the KCP controller is now around 50 mCPU. Without the patch it was 200-300!

The flame graph also confirms that we no longer have the 4 big blocks where it generated the private keys.
after-patch-10-clusters

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

I didn't expect it to be that straightforward. I assumed we have to cache something that actually expires.

controllers/remote/cluster_cache_tracker.go Outdated Show resolved Hide resolved
controllers/remote/cluster_cache_tracker.go Outdated Show resolved Hide resolved
Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

I didn't expect it to be that straightforward. I assumed we have to cache something that expires.

@fabriziopandini fabriziopandini force-pushed the cache-etcd-client-key branch from bf35756 to 5a8e9e1 Compare May 8, 2023 15:28
@fabriziopandini
Copy link
Member Author

/cherry-pick release-1.4

@k8s-infra-cherrypick-robot

@fabriziopandini: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fabriziopandini
Copy link
Member Author

/cherry-pick release-1.3

@k8s-infra-cherrypick-robot

@fabriziopandini: once the present PR merges, I will cherry-pick it on top of release-1.3 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-e2e-main-full

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main
  • /test pull-cluster-api-e2e-full-main
  • /test pull-cluster-api-e2e-informing-ipv6-main
  • /test pull-cluster-api-e2e-informing-main
  • /test pull-cluster-api-e2e-scale-main-experimental
  • /test pull-cluster-api-e2e-workload-upgrade-1-27-latest-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-informing-ipv6-main
  • pull-cluster-api-e2e-informing-main
  • pull-cluster-api-e2e-main
  • pull-cluster-api-test-main
  • pull-cluster-api-test-mink8s-main
  • pull-cluster-api-verify-main

In response to this:

/test pull-cluster-api-e2e-main-full

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-e2e-full-main

@sbueringer
Copy link
Member

/lgtm
/approve

/hold
feel free to hold cancel when you want to merge

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 8, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 8, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0e5f440696a1980bc912393a2b68cb8dd1ba15e7

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2023
@fabriziopandini
Copy link
Member Author

also the full E2E are green
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 8, 2023
@k8s-ci-robot k8s-ci-robot merged commit 299024f into kubernetes-sigs:main May 8, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.5 milestone May 8, 2023
@k8s-infra-cherrypick-robot

@fabriziopandini: new pull request created: #8619

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@fabriziopandini: #8617 failed to apply on top of branch "release-1.3":

Applying: cache the etcd client key
Using index info to reconstruct a base tree...
M	controllers/remote/cluster_cache_tracker.go
M	controlplane/kubeadm/internal/cluster.go
M	controlplane/kubeadm/internal/workload_cluster.go
Falling back to patching base and 3-way merge...
Auto-merging controlplane/kubeadm/internal/workload_cluster.go
Auto-merging controlplane/kubeadm/internal/cluster.go
Auto-merging controllers/remote/cluster_cache_tracker.go
CONFLICT (content): Merge conflict in controllers/remote/cluster_cache_tracker.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 cache the etcd client key
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@johannesfrey
Copy link
Contributor

/area provider/control-plane-kubeadm

@k8s-ci-robot k8s-ci-robot added the area/provider/control-plane-kubeadm Issues or PRs related to KCP label Jun 5, 2023
@sbueringer sbueringer mentioned this pull request Jun 12, 2023
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/control-plane-kubeadm Issues or PRs related to KCP cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KCP CPU hungry with many clusters
6 participants