Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: improve suspect replica heuristics in replica GC queue #62075

Closed
erikgrinaker opened this issue Mar 16, 2021 · 0 comments · Fixed by #65062
Closed

kvserver: improve suspect replica heuristics in replica GC queue #62075

erikgrinaker opened this issue Mar 16, 2021 · 0 comments · Fixed by #65062
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 16, 2021

The kvserver.replicaGCQueue will normally check replicas against the main replica descriptor every 10 days, to see if it should be GCed. However, some replicas may be considered suspect (e.g. if they are a Raft candidate), in which case they are checked every second. These heuristics are currently pretty limited, and probably need to cover additional scenarios -- one example is a non-voting replica which has lost its leader, which currently isn't considered suspect.

var isSuspect bool
if raftStatus := repl.RaftStatus(); raftStatus == nil {
// If a replica doesn't have an active raft group, we should check
// whether or not it is active. If not, we should process the replica
// because it has probably already been removed from its raft group but
// doesn't know it. Without this, node decommissioning can stall on such
// dormant ranges. Make sure NodeLiveness isn't nil because it can be in
// tests/benchmarks.
if repl.store.cfg.NodeLiveness != nil {
if liveness, ok := repl.store.cfg.NodeLiveness.Self(); ok && !liveness.Membership.Active() {
return true, replicaGCPriorityDefault
}
}
} else if t := replDesc.GetType(); t != roachpb.VOTER_FULL && t != roachpb.NON_VOTER {
isSuspect = true
} else {
switch raftStatus.SoftState.RaftState {
case raft.StateCandidate, raft.StatePreCandidate:
isSuspect = true
case raft.StateLeader:
// If the replica is the leader, we check whether it has a quorum.
// Otherwise, it's possible that e.g. Node.ResetQuorum will be used
// to recover the range elsewhere, and we should relinquish our
// lease and GC the range.
if repl.store.cfg.NodeLiveness != nil {
livenessMap := repl.store.cfg.NodeLiveness.GetIsLiveMap()
isSuspect = !repl.Desc().Replicas().CanMakeProgress(func(d roachpb.ReplicaDescriptor) bool {
return livenessMap[d.NodeID].IsLive
})
}
}
}

Related to #61977.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant