-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster: fix leader balancer using low level leadership transfer interface #8941
cluster: fix leader balancer using low level leadership transfer interface #8941
Conversation
/ci-repeat 5 debug skip-unit |
875220d
to
5491c26
Compare
This makes it easier to see what happened in a test log if we unexpectedly fail to gracefully quiesce the archiver during leadership transfer.
This will call cluster::partition::transfer_leadership, unlike the existing raft API that calls directly into consensus::transfer_leadership. This RPC will be safe for transactions and tiered storage, which rely on extra preparation in cluster::partition before doing the raft transfer.
This gives it flexibility to use raft client or cluster client depending on which leadership transfer API it will use (new or legacy).
...with a fallback to the legacy RPC if the new RPC is not found. This is done with a try+fallback approach rather than a feature flag, so that the change is backportable without having to bump the cluster version on a stable branch.
Previously we would drop out between writing segments and writing manifest if we were transferring leadership. What we want during a transfer is for the node to finish its segment uploads, update the archival_metadata_stm so that the new leader does not forget about these uploaded segments, and upload the manifest to S3 so that the written segment becomes visible promptly without having to wait for the new leader to do its next segment upload to write the manifest.
5491c26
to
4b0785d
Compare
/ci-repeat 5 debug skip-unit |
/backport v22.3.x |
Failed to run cherry-pick command. I executed the below command:
|
@jcsp , could you look into backporting this when you have a chance? |
When I marked this for backport, I must have forgotten that in v22.3.x we have a different code structure that makes this fix impractical (archivers do not belong to cluster::partition objects, so can't be cleanly handled during a leadership transfer). I think we're going to have to leave v22.3.x with the bad old behavior that leaks more orphan objects than we would like. |
The behavior in #8560 relied on leadership transfers flowing through cluster::partition, but the leader balancer was using an RPC that calls directly into consensus::transfer_leadership. The cluster::partition path was only being taken on maintenance mode leadership drains and explicit admin API leader transfers.
This would also have meant that the leader balancer was skipping the graceful transfer of transaction state.
In this PR:
Fixes: #8745
Backports Required
UX Changes
None
Release Notes
Bug fixes