Skip to content

Commit

Permalink
prevent post-removal activity on range
Browse files Browse the repository at this point in the history
as observed in cockroachdb#2593, the following could happen:

* client proposes command to raft, waits
* command commits, but hasn't reached client yet
* group gets removed, triggering the waiting client
* client decides the request is done, leaves
* committed command executes
* r.pendingCmds still holds the command, so the contained trace
  panics (use-after-finalize).

The issue here is that `Replica`, `Store` and `Raft` are
moving parts, that can't fully move in lockstep with the removal.
(the store provides the Raft storage, so you can't remove the
group under the lock).

Instead, this change first "quiesces" the Replica, after which
it can be dropped from Raft, and, finally, from `Store`.
  • Loading branch information
tbg committed Oct 30, 2015
1 parent a217686 commit 05677a6
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 0 deletions.
20 changes: 20 additions & 0 deletions storage/replica.go
Original file line number Diff line number Diff line change
Expand Up @@ -1557,3 +1557,23 @@ func (r *Replica) maybeAddToSplitQueue() {
r.store.splitQueue.MaybeAdd(r, r.store.Clock().Now())
}
}

// Quiesce drains the replica and prepares it for removal. First, proposal
// of new commands is disabled. Secondly, all pending commands are aborted.
// In both cases, a RangeNotFoundError results.
func (r *Replica) Quiesce() {
r.Lock()
r.proposeRaftCommandFn = func(_ cmdIDKey, _ roachpb.RaftCommand) <-chan error {
ch := make(chan error, 1)
ch <- multiraft.ErrGroupDeleted
return ch
}
for id, cmd := range r.pendingCmds {
cmd.done <- roachpb.ResponseWithError{Err: multiraft.ErrGroupDeleted}
// Deleting while iterating is ok in Go. We don't just want to `nil`
// this since there could be commands being queued up for execution on
// the replica.
delete(r.pendingCmds, id)
}
r.Unlock()
}
7 changes: 7 additions & 0 deletions storage/store.go
Original file line number Diff line number Diff line change
Expand Up @@ -1030,6 +1030,13 @@ func (s *Store) RemoveReplica(rep *Replica) error {
func (s *Store) removeReplicaImpl(rep *Replica) error {
rangeID := rep.Desc().RangeID

// Silence the Replica. This clears all outstanding commands and makes
// sure that whatever else slips in winds up with a RangeNotFoundError.
// Before this change, clients would be signaled that their command had
// been aborted when in fact it could commit just after that, which led
// to #2593.
rep.Quiesce()

// RemoveGroup needs to access the storage, which in turn needs the
// lock. Some care is needed to avoid deadlocks. We remove the group
// from multiraft outside the scope of s.mu; this is effectively
Expand Down

0 comments on commit 05677a6

Please sign in to comment.