Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduced leaderElection options including: --leader-elect-lease-duration, --leader-elect-renew-deadline, --leader-elect-retry-period #4158

Merged

Conversation

yanfeng1992
Copy link
Member

@yanfeng1992 yanfeng1992 commented Oct 20, 2023

What type of PR is this?
/kind feature

What this PR does / why we need it:

scheduler add LeaderElection config

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

`karmada-scheduler`: Introduced leaderElection options including: `--leader-elect-lease-duration`, `--leader-elect-renew-deadline`, `--leader-elect-retry-period`, the default value not changed compared to previous version.

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 20, 2023
@karmada-bot karmada-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 20, 2023
@yanfeng1992
Copy link
Member Author

/assign @RainbowMango

@yanfeng1992 yanfeng1992 changed the title scheduler add LeaderElection config Introduced leaderElection options including: --leader-elect-lease-duration, --leader-elect-renew-deadline, --leader-elect-retry-period Oct 20, 2023
@RainbowMango
Copy link
Member

If I remember correctly, they all have default values, could you please explain why the default values do not fit?

@yanfeng1992
Copy link
Member Author

yanfeng1992 commented Oct 20, 2023

Because in our environment, karmada-scheduler has restarted many times due to lease reasons.

The default configuration is too short and strict for our actual environment.

image

image

@RainbowMango
Copy link
Member

Have you figured out why the instance can't renew the lease? Normally the default value is long enough, it might not work just raise the timeout.

(I'm saying that doesn't mean I don't like this patch, but trying to figure out the root cause.)

@yanfeng1992
Copy link
Member Author

yanfeng1992 commented Oct 23, 2023

karmada-apiserver disruption can happen for multiple reasons, including
1.karmada-apiserver rollout on a non-HA cluster (This does not exist in a production environment)
2.networking disruption on the host running the client
3.networking disruption on the host running the server

We have seen all of these cases, and more, disrupt connections. Many controllers and operators rely on the karmada-apiserver for making leader election.

the karmada-apiserver downtime tolerance is floor(renewDeadline/retryPeriod)*retryPeriod-retryPeriod. When using the default configuration, tolerance is floor(10/2)*2-2 =8s. In actual production, Leader election needs to be able to tolerate 60s of interruptions. Recommended defaults are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.

In addition, let me explain the original source of this PR and why we are so concerned about component restarts. In our large-scale environment, the number of some CRs is tens of thousands.
Every time karmada-scheduler restarts, duplicated type crb and rb will also be rescheduled, causing subsequent rb and crb to be queued.
Every time karmada-controller-manager is restarted, all objects need to be reconciled. If the synchronization of the informer cache is not completed within 30s at this time, the apiserver of the member cluster will be frequently accessed, causing the client to limit the current flow, resulting in the execution_controller's overall syncWork speed being very slow.
These are some of the problems I encountered in large-scale scenarios and the reasons for my preliminary analysis. If there are any errors or incompleteness, I hope you can point them out.

@RainbowMango

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks for the detailed clarification and feedback.

It looks good:

      --leader-elect                                                                                                                                                                               
                Enable leader election, which must be true when running multi instances. (default true)
      --leader-elect-lease-duration duration                                                                                                                                                       
                The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively
                the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 15s)
      --leader-elect-renew-deadline duration                                                                                                                                                       
                The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than or equal to the lease duration. This is only applicable
                if leader election is enabled. (default 10s)
      --leader-elect-resource-name string                                                                                                                                                          
                The name of resource object that is used for locking during leader election. (default "karmada-scheduler")
      --leader-elect-resource-namespace string                                                                                                                                                     
                The namespace of resource object that is used for locking during leader election. (default "karmada-system")
      --leader-elect-retry-period duration                                                                                                                                                         
                The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 2s)

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 24, 2023
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2023
@karmada-bot karmada-bot merged commit 670d3c3 into karmada-io:master Oct 24, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants