-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 ClusterCacheTracker: use non-blocking per-cluster locking #7537
🌱 ClusterCacheTracker: use non-blocking per-cluster locking #7537
Conversation
37a60e5
to
59de0a9
Compare
/assign @fgutmann |
@sbueringer: GitHub didn't allow me to assign the following users: fgutmann. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
59de0a9
to
63ae233
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for picking this up Stefan!
The approach overall looks good to me. One drawback of it is that there will be some unneccesary backoffs during normal operations. Generally, I'm not too worried about that as I think it won't affect the overall system performance much.
One aspect I don't like about the current error handling though is that they are treated as a regular reconcile error and will show up in error metrics, making it hard to properly alarm on them. Shall we update the callers to go into back-off without returning an error for this specific case?
4036110
to
1318edc
Compare
1318edc
to
86ce450
Compare
/lgtm |
@fgutmann: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Co-authored-by: Florian Gutmann <[email protected]>
5fc0986
to
1348144
Compare
Thx for the reviews! Squashed! |
I'll open cherry-pick PR for release-1.2 once this PR is merged |
/lgtm |
Co-authored-by: Florian Gutmann [email protected]
What this PR does / why we need it:
This PR builds on top of #6380. All credits to @fgutmann for the initial implementation.
As of today we have a global lock, which means that as soon as the clusterAccessor creation doesn't work quickly all workers of our reconcilers are blocked. This is most problematic with the Machine reconciler as usually the Machines have the highest cardinality. If the apiserver of a cluster is not reachable or slow it can take up to 10 seconds until the global lock is unlocked again. In this time no one else is able to create or retrieve a client for workload clusters.
Let's assume Cluster1 has 250 Machines and is not reachable and Cluster2 has 10 Machines and is reachable.
This leads to the following situation for the Machine reconciler:
10 Machines (Cluster2)
The first step is to split the global lock, this leads to the following:
9 Machines (Cluster2)
1 Machine (Cluster2)
While this improves the situation we now have the problem that 8 Machines are actively waiting and blocking the worker nodes. This means that it will be very hard for the 10 Machines of Cluster2 to ever get reconciled. This of course gets a lot worse the more Cluster and Machines we have and if multiple Clusters are unreachable.
This is why we now don't let workers actively wait. If a worker can't get a lock for a Cluster an error is returned and the Machine is moved back into the queue with an exponential backoff.
9 Machines (Cluster2)
1 Machine (Cluster2)
This keeps the worker go routines free and increases the chance of all Machines to get reconciled.
Of course there are limits if for example there are far more clusters unreachable then we have workers. But I think this case would require a larger refactoring and with the new implementation + exponential backoff we should be fine.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #