Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base: enable Raft CheckQuorum by default #104042

Merged
merged 1 commit into from
Jun 28, 2023

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented May 29, 2023

This patch enables Raft CheckQuorum by default. In etcd/raft, this also has the effect of fully enabling PreVote, such that followers won't grant prevotes if they've heard from a leader in the past election timeout interval.

This is more robust against partial and asymmetric network partitions. Otherwise, a partitioned node may be able to hold spurious elections and steal leadership away from an established leader. This can cause the leader to become unreachable by the leaseholder, resulting in permanent range unavailability.

We are still able to hold immediate elections, e.g. when unquiescing a range to find a dead leader. If a quorum of followers consider the leader dead and forget it (becoming leaderless followers), they will grant prevotes despite having seen the leader recently (i.e. before quiescing), and can hold an election immediately.

This is compatible with 23.1 in mixed-version clusters:

  • Leaders with mixed CheckQuorum settings is fine: they only apply the step-down logic to themselves, and register follower activity regardless of the followers' settings.

  • Voters with mixed CheckQuorum settings if fine: the leader recency criterion is only applied to their own vote, so either they'll enforce it or not.

  • Campaigning on leader removal is fine-ish: before 23.2 finalization, the first range replica will campaign -- if this replica is 23.2 it will bypass pre-vote and call an immediate election, if it is 23.1 then it will use pre-vote. However, upon receiving the 23.1 pre-vote request, 23.2 nodes will check if the leader is still in the descriptor, and if it isn't they will forget it and grant the pre-vote. A quorum will likely apply the leader removal before receiving pre-vote requests. Otherwise, we will recover after an election timeout.

  • Campaigning after unquiescing is fine: the logic remains unchanged, and 23.2 nodes will forget the leader and grant prevotes if they find the leader dead according to liveness.

  • Campaigning during lease acquisitions is fine: this is needed to steal leadership away from an active leader that can't itself acquire an epoch lease because it's failing liveness heartbeats. If a 23.2 node also finds the leader dead in liveness, it will forget it and grant the prevote.

Resolves #92088.
Touches #49220.
Epic: none.

Release note (bug fix): The Raft PreVote and CheckQuorum mechanisms are now fully enabled. These prevent spurious elections when followers already have an active leader, and cause leaders to step down if they don't hear back from a quorum of followers. This improves reliability under partial and asymmetric network partitions, by avoiding spurious elections and preventing unavailability where a partially partitioned node could steal leadership away from an established leaseholder who would then no longer be able to reach the leader and submit writes.

@erikgrinaker erikgrinaker self-assigned this May 29, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the raft-enable-checkquorum branch 3 times, most recently from 0ab119f to f5e105d Compare June 2, 2023 12:04
@erikgrinaker erikgrinaker requested review from pav-kv, tbg and a team June 2, 2023 12:05
@erikgrinaker erikgrinaker marked this pull request as ready for review June 2, 2023 12:08
@erikgrinaker erikgrinaker requested review from a team as code owners June 2, 2023 12:08
@erikgrinaker
Copy link
Contributor Author

Hm, there could be mixed-version concerns here. I don't think we can rely on version gates, since we have to configure this during Raft initialization.

@erikgrinaker
Copy link
Contributor Author

I think most of the mixed-version aspects here are fine.

Mixing leaders with different CheckQuorum settings if fine, because they'll only apply the step-down logic to themselves, and register follower activity regardless of the followers' settings.

Mixing voters with different CheckQuorum settings is also fine, because voters will only apply the leader recency criterion to their own vote. So if a majority of voters haven't enabled CheckQuorum yet then we'll simply revert to the old behavior of not enforcing leader recency.

The one annoying exception is elections following leader replica removals, see #104189 (comment).

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 5, 2023

Following discussion in #104189 (comment), we'll need to version gate CheckQuorum and dynamically reinitialize the local Raft nodes. See #104322.

@erikgrinaker
Copy link
Contributor Author

we'll need to version gate CheckQuorum and dynamically reinitialize the local Raft nodes

This won't be necessary, see #104189 (comment).

I believe even the follower removal case should be fine in mixed-version clusters. In #104189 (i.e. 23.2), all replicas will campaign when the leader gets removed. If the first remaining replica is a 23.1 node, it will campaign, and can obtain prevotes either from other 23.1 nodes or from 23.2 nodes who themselves transition to pre-candidates. If the first remaining replica is as 23.2 node, it will be able to obtain prevotes both from 23.1 nodes and 23.2 nodes.

A similar argument holds e.g. for explicit campaigning during unquiescence. So I don't think there are any mixed-version concerns here, and we won't need a version gate (which is preferable anyway, so that operators can discover any issues before finalizing the upgrade).

@erikgrinaker erikgrinaker force-pushed the raft-enable-checkquorum branch 2 times, most recently from ed6e24f to ddc968f Compare June 9, 2023 08:02
@erikgrinaker erikgrinaker marked this pull request as draft June 19, 2023 09:35
@erikgrinaker
Copy link
Contributor Author

Marking as draft until we land #105126 and #105132.

@tbg tbg removed request for pav-kv and tbg June 20, 2023 09:04
@erikgrinaker erikgrinaker force-pushed the raft-enable-checkquorum branch 2 times, most recently from 56287e9 to 588a97e Compare June 26, 2023 09:12
This patch enables Raft CheckQuorum by default. In etcd/raft, this also
has the effect of fully enabling PreVote, such that followers won't
grant prevotes if they've heard from a leader in the past election
timeout interval.

This is more robust against partial and asymmetric network partitions.
Otherwise, a partitioned node may be able to hold spurious elections and
steal leadership away from an established leader. This can cause the
leader to become unreachable by the leaseholder, resulting in permanent
range unavailability.

We are still able to hold immediate elections, e.g. when unquiescing a
range to find a dead leader. If a quorum of followers consider the
leader dead and forget it (becoming leaderless followers), they will
grant prevotes despite having seen the leader recently (i.e. before
quiescing), and can hold an election immediately.

This is compatibile with 23.1 in mixed-version clusters:

* Leaders with mixed `CheckQuorum` settings is fine: they only apply
  the step-down logic to themselves, and register follower activity
  regardless of the followers' settings.

* Voters with mixed `CheckQuorum` settings if fine: the leader recency
  criterion is only applied to their own vote, so either they'll
  enforce it or not.

* Campaigning on leader removal is fine-ish: before 23.2 finalization,
  the first range replica will campaign -- if this replica is 23.2 it will
  bypass pre-vote and call an immediate election, if it is 23.1 then it
  will use pre-vote. However, upon receiving the 23.1 pre-vote request,
  23.2 nodes will check if the leader is still in the descriptor, and if
  it isn't they will forget it and grant the pre-vote. A quorum will
  likely apply the leader removal before receiving pre-vote requests.
  Otherwise, we will recover after an election timeout.

* Campaigning after unquiescing is fine: the logic remains unchanged,
  and 23.2 nodes will forget the leader and grant prevotes if they
  find the leader dead according to liveness.

* Campaigning during lease acquisitions is fine: this is needed to
  steal leadership away from an active leader that can't itself acquire
  an epoch lease because it's failing liveness heartbeats. If a 23.2 node
  also finds the leader dead in liveness, it will forget it and grant
  the prevote.

Epic: none
Release note (bug fix): The Raft PreVote and CheckQuorum mechanisms are
now fully enabled. These prevent spurious elections when followers
already have an active leader, and cause leaders to step down if they
don't hear back from a quorum of followers. This improves reliability
under partial and asymmetric network partitions, by avoiding spurious
elections and preventing unavailability where a partially partitioned
node could steal leadership away from an established leaseholder who
would then no longer be able to reach the leader and submit writes.
@erikgrinaker erikgrinaker force-pushed the raft-enable-checkquorum branch from 588a97e to de9b2b2 Compare June 27, 2023 11:25
@erikgrinaker erikgrinaker marked this pull request as ready for review June 27, 2023 11:25
@erikgrinaker erikgrinaker requested a review from tbg June 27, 2023 11:25
@erikgrinaker
Copy link
Contributor Author

Allright, let's flip the switch.

@erikgrinaker
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 28, 2023

Build failed:

@erikgrinaker
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 28, 2023

Build succeeded:

@craig craig bot merged commit 2689002 into cockroachdb:master Jun 28, 2023
@erikgrinaker erikgrinaker deleted the raft-enable-checkquorum branch November 14, 2023 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: Raft prevote does not prevent election despite active leader
4 participants