-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 use cluster level lock instead of global lock for cluster accessor initialization #6380
Conversation
Welcome @fgutmann! |
Hi @fgutmann. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
Would be a problem to move this commit to a separated PR so we can get this merged ASAP/without waiting discussion on the other changes |
Thank you @fabriziopandini and @sbueringer very much for taking a look at this PR and providing feedback! I will address the suggested changes next week. I'm currently on vacation and don't have access to my regular work environment. |
Sure, no rush! Enjoy your vacation |
@fgutmann Thanks for the PR! I looked through the code and it the keyed lock is definitely a good improvement on the existing global mutex. Have we thought also to add some sort of timeout to get the cluster accessor when populating the caches? If there is no timeout today, we can still incur in the same issue regardless of key-level locks or not, given that there is always a fixed number of workers that can reconcile requests. |
@vincepri The dynamic rest mapper used by the discovery client gets a timeout from the rest config, which is 10 seconds per request. The discovery phase thus has a sensible timeout. However, the |
14c7cc2
to
17cbd0d
Compare
Before this commit, workload cluster client initialization required a global lock to be held. If initialization of a single workload cluster client took time, all other reconcile-loops who require a workload cluster connection were blocked until initialization finished. Initialization of a workload cluster client can take a significant amount of time, because it requires to initialize the discovery client, which sends multiple request to the API-server. With this change initialization of a workload cluster client only requires to hold a lock for the specific cluster. This means reconciliation for other clusters is not affected by a long running workload cluster client initialization.
Pushed an updated version with the changes discussed above. It also now contains a timeout of 5 minutes for initially synchronizing the cache. |
// keyedMutex is a mutex locking on the key provided to the Lock function. | ||
// Only one caller can hold the lock for a specific key at a time. | ||
type keyedMutex struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this copied from somewhere else? Can we reuse a library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this was written for this CR. Did some research but didn't find any library that provides this functionality.
a, err := t.newClusterAccessor(ctx, cluster, indexes...) | ||
if err != nil { | ||
log.V(4).Info("error creating new cluster accessor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this? If it's an error, it shouldn't be an info?
@fgutmann Do you have time to address the findings? |
Co-authored-by: Vince Prignano <[email protected]>
Updated the log messages and replied to the comments by @vincepri. |
@fgutmann: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Superseded by #6380 |
@fgutmann thanks for this PR, it was really valuable to get 6380 merged |
What this PR does / why we need it:
Currently, initialization of a cluster accessor requires a global lock to be held. Initializing an accessor includes creating the dynamic rest mapper for the workload cluster and waiting for caches to populate. On a high latency connection to a workload cluster this can take a significant amount of time, because there are 10s of requests sent to the API-server for initializing the dynamic rest mapper and populating caches. During this time all reconciliation loops which require an accessor for any workload cluster are fully blocked, effectively blocking reconciliation of all clusters.
This PR allows multiple accessors for different clusters to be initialized in parallel by splitting the global lock into one lock per cluster. The implemented locking mechanism ensures that: