cluster: fix leader balancer using low level leadership transfer interface #8941

jcsp · 2023-02-16T21:58:59Z

The behavior in #8560 relied on leadership transfers flowing through cluster::partition, but the leader balancer was using an RPC that calls directly into consensus::transfer_leadership. The cluster::partition path was only being taken on maintenance mode leadership drains and explicit admin API leader transfers.

This would also have meant that the leader balancer was skipping the graceful transfer of transaction state.

In this PR:

Add debug logging to make it more obvious what is happening if the archiver part of leadership transfer doesn't appear to be working properly.
Create a new RPC in controller_service to do leadership transfer via cluster::partition
Use this new RPC from leader balancer, with a fallback to the old RPC if the new one is not available. This will make upgrades work without requiring a feature flag, which is important to enable backporting this.
Tweak the logic in ntp_archiver to continue with uploading a manifest after uploading segments, even if in the process of a leadership transfer.

Fixes: #8745

Backports Required

UX Changes

None

Release Notes

Bug fixes

A bug is fixed where background leadership balancing could unexpectedly interrupt transactional workloads.

jcsp · 2023-02-16T22:01:28Z

/ci-repeat 5 debug skip-unit

This makes it easier to see what happened in a test log if we unexpectedly fail to gracefully quiesce the archiver during leadership transfer.

This will call cluster::partition::transfer_leadership, unlike the existing raft API that calls directly into consensus::transfer_leadership. This RPC will be safe for transactions and tiered storage, which rely on extra preparation in cluster::partition before doing the raft transfer.

This gives it flexibility to use raft client or cluster client depending on which leadership transfer API it will use (new or legacy).

...with a fallback to the legacy RPC if the new RPC is not found. This is done with a try+fallback approach rather than a feature flag, so that the change is backportable without having to bump the cluster version on a stable branch.

Previously we would drop out between writing segments and writing manifest if we were transferring leadership. What we want during a transfer is for the node to finish its segment uploads, update the archival_metadata_stm so that the new leader does not forget about these uploaded segments, and upload the manifest to S3 so that the written segment becomes visible promptly without having to wait for the new leader to do its next segment upload to write the manifest.

jcsp · 2023-02-17T10:57:46Z

/ci-repeat 5 debug skip-unit

jcsp · 2023-02-17T13:35:38Z

Test failures:

CI Failure (adjacent segment merger prevents compaction) in ShadowIndexingCompactedTopicTest.test_upload #8958
CI Failure can't fetch stable replicas in PartitionMoveInterruption.test_cancelling_partition_move_x_core #8908
Tiered storage upload might not upload manifest if restarted at wrong moment + no further data written #8959

vshtokman · 2023-02-21T20:50:09Z

/backport v22.3.x

vbotbuildovich · 2023-02-21T20:51:06Z

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x d7a66b8d74485ead58355f05442fadae7552614f fd8cae2ba84295e2b94235102f6abe22817071cb 42a224fdc1f11dad0e5b15cba50e57d9feb314e0 440a1976a324685473e07564de92d435f5eb08b1 c3b4218c87661ccdf9e96aa91d5535621579fbd4 6b4c45ae6fbd1b96d0838c30173c697f719503c4 4b0785deabb21da2f37d0ad5eb8291a0bca5ec62

Workflow run logs.

vshtokman · 2023-02-28T15:16:53Z

@jcsp , could you look into backporting this when you have a chance?

jcsp · 2023-03-08T10:35:11Z

When I marked this for backport, I must have forgotten that in v22.3.x we have a different code structure that makes this fix impractical (archivers do not belong to cluster::partition objects, so can't be cleanly handled during a leadership transfer). I think we're going to have to leave v22.3.x with the bad old behavior that leaks more orphan objects than we would like.

github-actions bot added the area/redpanda label Feb 16, 2023

jcsp force-pushed the debug-archival-leadership-transfer branch from 875220d to 5491c26 Compare February 17, 2023 10:48

jcsp added 7 commits February 17, 2023 10:56

cluster: debug logging around archival leadership transfer

d7a66b8

This makes it easier to see what happened in a test log if we unexpectedly fail to gracefully quiesce the archiver during leadership transfer.

raft: remove unused request_leadership function

fd8cae2

cluster: hook partition_manager into cluster::service

42a224f

cluster: pass connection_cache into leader balancer

c3b4218

This gives it flexibility to use raft client or cluster client depending on which leadership transfer API it will use (new or legacy).

cluster: use new transfer RPC in leader balancer

6b4c45a

...with a fallback to the legacy RPC if the new RPC is not found. This is done with a try+fallback approach rather than a feature flag, so that the change is backportable without having to bump the cluster version on a stable branch.

jcsp force-pushed the debug-archival-leadership-transfer branch from 5491c26 to 4b0785d Compare February 17, 2023 10:57

jcsp changed the title ~~cluster: debug logging around archival leadership transfer~~ cluster: fix leader balancer using low level leadership transfer interface Feb 17, 2023

jcsp requested a review from mmaslankaprv February 17, 2023 11:01

jcsp marked this pull request as ready for review February 17, 2023 11:01

mmaslankaprv approved these changes Feb 17, 2023

View reviewed changes

jcsp added area/controller area/cloud-storage Shadow indexing subsystem and removed area/redpanda labels Feb 17, 2023

jcsp merged commit bafa0e5 into redpanda-data:dev Feb 17, 2023

jcsp deleted the debug-archival-leadership-transfer branch February 17, 2023 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: fix leader balancer using low level leadership transfer interface #8941

cluster: fix leader balancer using low level leadership transfer interface #8941

jcsp commented Feb 16, 2023 •

edited

Loading

jcsp commented Feb 16, 2023

jcsp commented Feb 17, 2023

jcsp commented Feb 17, 2023 •

edited

Loading

vshtokman commented Feb 21, 2023

vbotbuildovich commented Feb 21, 2023

vshtokman commented Feb 28, 2023

jcsp commented Mar 8, 2023

cluster: fix leader balancer using low level leadership transfer interface #8941

cluster: fix leader balancer using low level leadership transfer interface #8941

Conversation

jcsp commented Feb 16, 2023 • edited Loading

Backports Required

UX Changes

Release Notes

Bug fixes

jcsp commented Feb 16, 2023

jcsp commented Feb 17, 2023

jcsp commented Feb 17, 2023 • edited Loading

vshtokman commented Feb 21, 2023

vbotbuildovich commented Feb 21, 2023

vshtokman commented Feb 28, 2023

jcsp commented Mar 8, 2023

jcsp commented Feb 16, 2023 •

edited

Loading

jcsp commented Feb 17, 2023 •

edited

Loading