Skip to content

Commit

Permalink
kvserver: maybe forget leader on (Pre)Vote requests
Browse files Browse the repository at this point in the history
This patch will forget the Raft leader when receiving a (Pre)Vote
request and finding the leader to be dead (according to liveness) or
removed. This allows a quorum of replicas to campaign despite the
PreVote+CheckQuorum condition if they independently conider the leader
to be dead.

A more targeted alternative was explored, where we forget the leader
only when unquiescing to a dead leader. This would require changing the
case where we steal leadership away from a Raft leader who can't
heartbeat liveness during lease acquisitions to instead use
TransferLeader, but this could cause unavailability due to races where
multiple replicas request leader transfers and some are not up-to-date
on the log. Furthermore, we would have to add logic to handle the mixed
23.1/23.2 state, since 23.1 nodes could otherwise be unable to obtain
necessary prevotes. We instead choose to be lenient for now, and we can
consider tightening the conditions later when we're more confident in
the PreVote+CheckQuorum handling and may no longer need epoch leases.

Epic: none
Release note: None
  • Loading branch information
erikgrinaker committed Jun 26, 2023
1 parent fefe74e commit 9d0c7d8
Show file tree
Hide file tree
Showing 3 changed files with 49 additions and 25 deletions.
8 changes: 6 additions & 2 deletions pkg/kv/kvserver/client_raft_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1338,8 +1338,12 @@ func TestRequestsOnFollowerWithNonLiveLeaseholder(t *testing.T) {
ReplicationMode: base.ReplicationManual,
ServerArgs: base.TestServerArgs{
Settings: st,
// Reduce the election timeout some to speed up the test.
RaftConfig: base.RaftConfig{RaftElectionTimeoutTicks: 10},
RaftConfig: base.RaftConfig{
// Reduce the election timeout some to speed up the test.
RaftElectionTimeoutTicks: 10,
// This should work with PreVote+CheckQuorum too.
RaftEnableCheckQuorum: true,
},
Knobs: base.TestingKnobs{
NodeLiveness: kvserver.NodeLivenessTestingKnobs{
// This test waits for an epoch-based lease to expire, so we're
Expand Down
53 changes: 37 additions & 16 deletions pkg/kv/kvserver/replica_raft.go
Original file line number Diff line number Diff line change
Expand Up @@ -584,10 +584,6 @@ func (r *Replica) stepRaftGroup(req *kvserverpb.RaftMessageRequest) error {
// up-to-date on the log), then we'll immediately transfer leadership back
// to the leaseholder, i.e. the old leader, and the cycle repeats.
//
// Even though we don't campaign, if we find our leader dead according
// to liveness we'll forget it and become a leaderless follower, allowing
// us to grant (pre)votes. See forgetLeaderLocked().
//
// Note that such partial partitions will typically result in persistent
// mass unquiescence due to the continuous prevotes.
if r.mu.quiescent {
Expand All @@ -598,7 +594,12 @@ func (r *Replica) stepRaftGroup(req *kvserverpb.RaftMessageRequest) error {
r.maybeUnquiesceLocked(wakeLeader, mayCampaign)
}
r.mu.lastUpdateTimes.update(req.FromReplica.ReplicaID, timeutil.Now())
if req.Message.Type == raftpb.MsgSnap {
switch req.Message.Type {
case raftpb.MsgPreVote, raftpb.MsgVote:
// If we receive a (pre)vote request, and we find our leader to be dead or
// removed, forget it so we can grant the (pre)votes.
r.maybeForgetLeaderOnVoteRequestLocked()
case raftpb.MsgSnap:
// Occasionally a snapshot message may arrive under an outdated term,
// which would lead to Raft discarding the snapshot. This should be
// really rare in practice, but it does happen in tests and in particular
Expand Down Expand Up @@ -1097,7 +1098,8 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
// If the raft leader got removed, campaign on the leaseholder. Uses
// forceCampaignLocked() to bypass PreVote+CheckQuorum, since we otherwise
// wouldn't get prevotes from other followers who recently heard from the
// old leader. We know the leader isn't around anymore anyway.
// old leader and haven't applied the conf change. We know the leader
// isn't around anymore anyway.
leaseStatus := r.leaseStatusAtRLocked(ctx, r.store.Clock().NowAsClockTimestamp())
raftStatus := raftGroup.BasicStatus()
if shouldCampaignAfterConfChange(ctx, r.store.ClusterSettings(), r.store.StoreID(),
Expand Down Expand Up @@ -2147,21 +2149,40 @@ func (r *Replica) maybeCampaignOnWakeLocked(ctx context.Context) {
}
}

// maybeForgetLeaderOnWakeLocked is called when the replica wakes from being
// quiescent and should not campaign for leadership (to avoid election ties).
// If it is a follower of a now-dead leader (according to liveness) it will
// forget the leader to allow granting (pre)votes to a campaigner, such that if
// a quorum see the leader dead they can elect a new leader immediately. Not
// relevant when initializing the raft group, since it doesn't have a leader.
func (r *Replica) maybeForgetLeaderOnWakeLocked(ctx context.Context) {
// maybeForgetLeaderOnVoteRequestLocked is called when receiving a (Pre)Vote
// request. If the current leader is not live (according to liveness) or not in
// our range descriptor, we forget it and become a leaderless follower.
//
// Normally, with PreVote+CheckQuorum, we won't grant a (pre)vote if we've heard
// from a leader in the past election timeout interval. However, sometimes we
// want to call an election despite a recent leader. Forgetting the leader
// allows us to grant the (pre)vote, and if a quorum of replicas agree that the
// leader is dead they can win an election. Used specifically when:
//
// - Unquiescing to a dead leader. The first replica to wake will campaign,
// the others won't to avoid ties but should grant the vote. See
// maybeUnquiesceLocked().
//
// - Stealing leadership from a leader who can't heartbeat liveness.
// Otherwise, noone will be able to acquire an epoch lease, and the range is
// unavailable. See shouldCampaignOnLeaseRequestRedirect().
//
// - For backwards compatibility in mixed 23.1/23.2 clusters, where 23.2 first
// enabled CheckQuorum. The above two cases hold with 23.1. Additionally, when
// the leader is removed from the range in 23.1, the first replica in the
// range will campaign using (pre)vote, so 23.2 nodes must grant it.
//
// TODO(erikgrinaker): The above cases are only relevant with epoch leases and
// 23.1 compatibility. Consider removing this when no longer needed.
func (r *Replica) maybeForgetLeaderOnVoteRequestLocked() {
raftStatus := r.mu.internalRaftGroup.BasicStatus()
livenessMap, _ := r.store.livenessMap.Load().(livenesspb.IsLiveMap)
if shouldForgetLeaderOnWake(raftStatus, livenessMap, r.descRLocked()) {
r.forgetLeaderLocked(ctx)
if shouldForgetLeaderOnVoteRequest(raftStatus, livenessMap, r.descRLocked()) {
r.forgetLeaderLocked(r.AnnotateCtx(context.TODO()))
}
}

func shouldForgetLeaderOnWake(
func shouldForgetLeaderOnVoteRequest(
raftStatus raft.BasicStatus,
livenessMap livenesspb.IsLiveMap,
desc *roachpb.RangeDescriptor,
Expand Down
13 changes: 6 additions & 7 deletions pkg/kv/kvserver/replica_raft_quiesce.go
Original file line number Diff line number Diff line change
Expand Up @@ -71,11 +71,12 @@ func (r *Replica) maybeUnquiesce(wakeLeader, mayCampaign bool) bool {
//
// If mayCampaign is true, the replica may campaign if it thinks the leader has
// died in the meanwhile. This will respect PreVote and CheckQuorum, and thus
// won't disrupt a current leader. Otherwise, if the leader is dead it will
// forget about it and become a leaderless follower. Thus, if a quorum of
// replicas independently consider the leader to be dead when unquiescing, they
// can hold an election immediately despite PreVote+CheckQuorum. Should
// typically be true, unless the caller wants to avoid election ties.
// won't disrupt a current leader. Followers that also consider the leader dead
// will forget about it and become a leaderless follower when receiving the
// (pre)votes. Thus, if a quorum of replicas independently consider the leader
// to be dead when unquiescing, they can hold an election immediately despite
// PreVote+CheckQuorum. Should typically be true, unless the caller wants to
// avoid election ties.
func (r *Replica) maybeUnquiesceLocked(wakeLeader, mayCampaign bool) bool {
if !r.canUnquiesceRLocked() {
return false
Expand Down Expand Up @@ -110,8 +111,6 @@ func (r *Replica) maybeUnquiesceLocked(wakeLeader, mayCampaign bool) bool {
// we're wrong about it being dead.
if mayCampaign {
r.maybeCampaignOnWakeLocked(ctx)
} else {
r.maybeForgetLeaderOnWakeLocked(ctx)
}

return true
Expand Down

0 comments on commit 9d0c7d8

Please sign in to comment.