-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Increase leader election lease values for KCP #3980
🌱 Increase leader election lease values for KCP #3980
Conversation
To improve a self-managed cluster resilience to temporary errors related to etcd leadership, this change increases the duration for all the lease times. The following are the most important values for leader election, we increase the amount the non-leader candidates wait (1m now) and we increase the renew deadline to 40s instead of 10, which should give enough time for etcd connectivity to be established again. - Lease duration is now 1 minute instead of 15s - Renew deadline has been increased to 40 seconds instead of 10 In addition: - Retry period has been increased to 5 seconds instead of 2 - Avoid overloading the API Server / etcd with lease retry requests Signed-off-by: Vince Prignano <[email protected]>
/milestone v0.4.0 |
/test pull-cluster-api-test-main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes going to affect timeouts that we have in place for operations in clusterctl
such as init and move operations that take place shortly after an init?
I suspect that there will be places that test timeouts need to be updated to take into account this change as well (or ensuring that leader election is disabled for tests)
Currently, clusterctl init does not wait for the controllers to be up and running. Same for clusterctl move, which assumes that the target cluster is already initialized |
/test pull-cluster-api-test-main |
@detiber These changes should not impact anything other than giving a bit more time for the controller to recover, they are a bit too aggressive for KCP, especially in the self-managed scenario which is going to be very common for a management cluster |
@vincepri the longer lease duration will affect startup times for controllers when a lock was previously present, but I'm good with dealing with any issues that may arise from that as they come up. |
Do we leverage the release on cancel feature of the leader election code? That could mitigate the long startup times when a new instance of the controller starts up |
@JoelSpeed I'd expect that to be in controller runtime, although we need to double check |
@vincepri It's an option in the |
Ah got it, it seems that option isn't in Controller Runtime v0.5.x (which was the one we're using in v0.3.x, or current stable release) — That said I think we should probably enable it in v1alpha4, let's open an issue to track it |
This PR seems to be for v1alpha4. That said, should we enable |
This commit was made to be backported @wfernandes |
/assign @CecileRobertMichon @detiber |
@CecileRobertMichon do you have some time to review these changes? |
Sorry I thought I had already approved this, I guess I never hit submit... /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-cluster-api-test-main |
1 similar comment
/test pull-cluster-api-test-main |
Signed-off-by: Vince Prignano [email protected]
What this PR does / why we need it:
To improve a self-managed cluster resilience to temporary errors related
to etcd leadership, this change increases the duration for all the lease
times.
The following are the most important values for leader election, we
increase the amount the non-leader candidates wait (1m now) and we
increase the renew deadline to 40s instead of 10, which should give
enough time for etcd connectivity to be established again.
In addition:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Related to #3978