Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Increase leader election lease values for KCP #3980

Conversation

vincepri
Copy link
Member

@vincepri vincepri commented Dec 2, 2020

Signed-off-by: Vince Prignano [email protected]

What this PR does / why we need it:

To improve a self-managed cluster resilience to temporary errors related
to etcd leadership, this change increases the duration for all the lease
times.

The following are the most important values for leader election, we
increase the amount the non-leader candidates wait (1m now) and we
increase the renew deadline to 40s instead of 10, which should give
enough time for etcd connectivity to be established again.

  • Lease duration is now 1 minute instead of 15s
  • Renew deadline has been increased to 40 seconds instead of 10

In addition:

  • Retry period has been increased to 5 seconds instead of 2
    • Avoid overloading the API Server / etcd with lease retry requests

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Related to #3978

To improve a self-managed cluster resilience to temporary errors related
to etcd leadership, this change increases the duration for all the lease
times.

The following are the most important values for leader election, we
increase the amount the non-leader candidates wait (1m now) and we
increase the renew deadline to 40s instead of 10, which should give
enough time for etcd connectivity to be established again.

- Lease duration is now 1 minute instead of 15s
- Renew deadline has been increased to 40 seconds instead of 10

In addition:

- Retry period has been increased to 5 seconds instead of 2
  - Avoid overloading the API Server / etcd with lease retry requests

Signed-off-by: Vince Prignano <[email protected]>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 2, 2020
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Dec 2, 2020
@vincepri vincepri changed the title 🌱 Increase leader election lease values 🌱 Increase leader election lease values for KCP Dec 2, 2020
@vincepri
Copy link
Member Author

vincepri commented Dec 2, 2020

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Dec 2, 2020
@fabriziopandini
Copy link
Member

/test pull-cluster-api-test-main

Copy link
Member

@detiber detiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes going to affect timeouts that we have in place for operations in clusterctl such as init and move operations that take place shortly after an init?

I suspect that there will be places that test timeouts need to be updated to take into account this change as well (or ensuring that leader election is disabled for tests)

@fabriziopandini
Copy link
Member

Currently, clusterctl init does not wait for the controllers to be up and running. Same for clusterctl move, which assumes that the target cluster is already initialized
There are timeouts in the E2E test, so might be those might require some tuning (the e2e job on this pr job passed without any change)

@vincepri
Copy link
Member Author

vincepri commented Dec 3, 2020

/test pull-cluster-api-test-main

@vincepri
Copy link
Member Author

vincepri commented Dec 3, 2020

@detiber These changes should not impact anything other than giving a bit more time for the controller to recover, they are a bit too aggressive for KCP, especially in the self-managed scenario which is going to be very common for a management cluster

@detiber
Copy link
Member

detiber commented Dec 3, 2020

@vincepri the longer lease duration will affect startup times for controllers when a lock was previously present, but I'm good with dealing with any issues that may arise from that as they come up.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 3, 2020
@JoelSpeed
Copy link
Contributor

Do we leverage the release on cancel feature of the leader election code? That could mitigate the long startup times when a new instance of the controller starts up

@vincepri
Copy link
Member Author

vincepri commented Dec 3, 2020

@JoelSpeed I'd expect that to be in controller runtime, although we need to double check

@JoelSpeed
Copy link
Contributor

@vincepri It's an option in the ctrl.Options that we could leverage, not sure if it's a blocker for this but we could plumb it through in the KCP main.go either as a flag option or an opinionated setting
https://github.com/kubernetes-sigs/controller-runtime/blob/00e7f851401bb78389db24d6f25fbfbc5f8edbe1/pkg/manager/manager.go#L179-L184

@vincepri
Copy link
Member Author

vincepri commented Dec 4, 2020

Ah got it, it seems that option isn't in Controller Runtime v0.5.x (which was the one we're using in v0.3.x, or current stable release) — That said I think we should probably enable it in v1alpha4, let's open an issue to track it

@wfernandes
Copy link
Contributor

This PR seems to be for v1alpha4. That said, should we enable LeaderElectionReleaseOnCancel in a separate issue/PR now that we are using controller-runtime v0.7.0-alpha.8.

@vincepri
Copy link
Member Author

vincepri commented Dec 15, 2020

This commit was made to be backported @wfernandes

@vincepri
Copy link
Member Author

/assign @CecileRobertMichon @detiber
for approval

@vincepri
Copy link
Member Author

vincepri commented Jan 4, 2021

@CecileRobertMichon do you have some time to review these changes?

@CecileRobertMichon
Copy link
Contributor

Sorry I thought I had already approved this, I guess I never hit submit...

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2021
@fabriziopandini
Copy link
Member

/test pull-cluster-api-test-main

1 similar comment
@fabriziopandini
Copy link
Member

/test pull-cluster-api-test-main

@k8s-ci-robot k8s-ci-robot merged commit 52794b5 into kubernetes-sigs:master Jan 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants