Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: clobbering between uninitialized replica and RHS of a split #7600

Closed
tbg opened this issue Jul 2, 2016 · 15 comments
Closed

storage: clobbering between uninitialized replica and RHS of a split #7600

tbg opened this issue Jul 2, 2016 · 15 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Jul 2, 2016

As discussed in #6991, but now also directly observed (on master) in

$ make stressrace PKG=./storage TESTS=TestStoreRangeDownReplicate STRESSFLAGS='-p 16 -maxfails 1 -stderr':

E160701 20:01:54.022089 raft/raft.go:925  [group 2] 3 state.commit 0 is out of range [10, 10]
panic: [group 2] 3 state.commit 0 is out of range [10, 10]

goroutine 320 [running]:
panic(0x5611dc0, 0xc820994420)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/panic.go:481 +0x3ff
github.com/cockroachdb/cockroach/storage.(*raftLogger).Panicf(0xc820994330, 0x5bc7f20, 0x2b, 0xc82099c5c0, 0x4, 0x4)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/storage/raft.go:117 +0x1cb
github.com/coreos/etcd/raft.(*raft).loadState(0xc82095c340, 0x6, 0x1, 0x0, 0x0, 0x0, 0x0)
    /Users/tschottdorf/go/src/github.com/coreos/etcd/raft/raft.go:925 +0x3f6
github.com/coreos/etcd/raft.newRaft(0xc820945b38, 0x30303c820945860)
    /Users/tschottdorf/go/src/github.com/coreos/etcd/raft/raft.go:225 +0xcd9
github.com/coreos/etcd/raft.NewRawNode(0xc820945b38, 0x0, 0x0, 0x0, 0xc8209fa190, 0x0, 0x0)

#7598 should fix this.

@tbg tbg self-assigned this Jul 2, 2016
@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 2, 2016
@tbg
Copy link
Member Author

tbg commented Jul 2, 2016

Whatever I thought is going on here isn't. I'm still investigating.

@tbg
Copy link
Member Author

tbg commented Jul 2, 2016

Ok, this rabbit hole is deep. Here's what I think is happening:

(tl;dr: #7531 allows both the msg-induced uninitialized version and the split-generated version of a split RHS to mutate things concurrently)

  • the first range splits on a multi-node cluster.
  • one of the nodes is a bit behind, so other nodes talk to it before it has created the raft group through the split trigger
  • consequently, we create an uninitialized replica - it starts with a trivial HardState.
  • the split trigger executes and "seeds" what it believes is going to be the new right-hand side of the split. it's oblivious to the fact that there's already a replica out there operating on that raft state. HardState and TruncatedState are overwritten, and that batch committed.
  • our uninitialized replica is now in the danger zone. Its Raft instance has stepped based on a trivial HardState, and, for example, responded to a VoteRequest (giving it the new HardState Term:6 Commit:0, Vote:<someone>.
  • it persists that HardState, clobbering the on-disk one Term:5 Commit:10 Vote:0.
  • the split trigger now inserts the "new" replica into the store, kicking out the uninitialized one. The new replica inits its raft group and crashes: truncated state says our first entry is 11, but the RaftState has Commit:0.

Unfortunately I don't see the "obvious" solution. I think we need to refactor the way Store.SplitRange works - it needs to take a batch and, under some appropriate lock, dismantle any existing replica for that range, insert the desired range, commit the batch, unlock itself, and only then let the new replica loose - and that's not strictly enough. I think we've avoided these races for a while since we synthesized more of the Raft state and blocked snapshots to apply for the uninitialized range.

I suspected that it wasn't possible for this anomaly to occur in practice until #7531 landed, but even with var storeReplicaRaftReadyConcurrency = 1 the test failure remains. Bisecting now.

@tbg tbg changed the title storage: applying snapshot without HardState leads to Raft panic storage: clobbering between uninitialized replica and RHS of a split Jul 2, 2016
@tbg
Copy link
Member Author

tbg commented Jul 2, 2016

cc @bdarnell

@tbg
Copy link
Member Author

tbg commented Jul 2, 2016

bisect blames #7468 (fixing preemptive snapshots). With the above explanation (and my debugging efforts) I don't see it having a direct influence though it theoretically cause similar issues (#7598), but this repro'ed reliably with my patch set there (in a version which would actually write the HardState). I never saw a preemptive snapshot applied for range 2 (which is always the one which causes the issue).

My current theory is that working preemptive snapshots allows the first range to make it to more nodes quicker, which allows the split race to happen.

@tbg
Copy link
Member Author

tbg commented Jul 2, 2016

The bug still repros with the following diff applied:

diff --git a/storage/store.go b/storage/store.go
index 85ced80..2a9c2dc 100644
--- a/storage/store.go
+++ b/storage/store.go
@@ -2096,7 +2096,8 @@ func (s *Store) handleRaftMessage(req *RaftMessageRequest) error {
        }

        if req.ToReplica.ReplicaID == 0 {
-               if req.Message.Type == raftpb.MsgSnap {
+               if req.RangeID != 1 && req.Message.Type == raftpb.MsgSnap {
+                       log.Warningf("SNAP %d", req.RangeID)
                        // Allow snapshots to be applied to replicas before they are members of
                        // the raft group (i.e. replicas with an ID of 0). This is the only
                        // operation that can be performed on a replica before it is part of the

However, it takes longer (~300-400 iters) and there is no SNAP output (meaning that no preemptive snapshot was ever applied). This is in line with my expectation that this hasn't been introduced in that PR, just made more likely with recent changes.

@tbg
Copy link
Member Author

tbg commented Jul 2, 2016

The behavior was likely exposed by the relatively recent changes during my Raft state refactor:

diff --git a/storage/replica_command.go b/storage/replica_command.go
index 564978a..51f6937 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -1553,7 +1553,7 @@ func (r *Replica) LeaderLease(
    args.Lease.Start = effectiveStart

    // Store the lease to disk & in-memory.
-   if err := engine.MVCCPutProto(ctx, batch, ms, keys.RangeLeaderLeaseKey(r.RangeID), hlc.ZeroTimestamp, nil, &args.Lease); err != nil {
+   if err := setLease(batch, ms, r.RangeID, &args.Lease); err != nil {
        return reply, err
    }
    r.mu.state.Lease = &args.Lease
@@ -2250,10 +2250,13 @@ func (r *Replica) splitTrigger(
    }
    log.Trace(ctx, fmt.Sprintf("copied abort cache (%d entries)", seqCount))

-   // Create the new Replica representing the right side of the split.
-   newRng, err := NewReplica(&split.NewDesc, r.store, 0)
+   // Write the initial state for the new Raft group of the right-hand side.
+   // Node that this needs to go into deltaMS (which is the total difference
+   // in bytes reported to the store in the end). We compute the RHS' stats
+   // from it below.
+   deltaMS, err = writeInitialState(batch, deltaMS, split.NewDesc)
    if err != nil {
-       return err
+       return util.Errorf("unable to write initial state: %s", err)
    }

    rightMS := deltaMS
@@ -2261,23 +2264,32 @@ func (r *Replica) splitTrigger(
    rightMS.Add(origStats)
    // Remove stats from the left side of the split.
    rightMS.Subtract(leftMS)
-   if err := setMVCCStats(batch, newRng.RangeID, rightMS); err != nil {
+
+   if err := setMVCCStats(batch, split.NewDesc.RangeID, rightMS); err != nil {
        return util.Errorf("unable to write MVCC stats: %s", err)
    }
    log.Trace(ctx, "computed stats for new range")

-   // Copy the timestamp cache into the new range.
-   r.mu.Lock()
-   newRng.mu.Lock()
-   newRng.mu.state.Stats = rightMS
-   r.mu.tsCache.MergeInto(newRng.mu.tsCache, true /* clear */)
-   newRng.mu.Unlock()
-   r.mu.Unlock()
-   log.Trace(ctx, "copied timestamp cache")
-
    // Note: you must not use the trace inside of this defer since it may
    // run after the trace has already completed.
    batch.Defer(func() {
+       // Create the new Replica representing the right side of the split.
+       // Our error handling options at this point are very limited, but
+       // we need to do this after our batch has committed.
+       newRng, err := NewReplica(&split.NewDesc, r.store, 0)
+       if err != nil {
+           panic(err)
+       }
+
+       // Copy the timestamp cache into the new range.
+       r.mu.Lock()
+       newRng.mu.Lock()
+       newRng.mu.state.Stats = rightMS
+       r.mu.tsCache.MergeInto(newRng.mu.tsCache, true /* clear */)
+       newRng.mu.Unlock()
+       r.mu.Unlock()
+       log.Trace(ctx, "copied timestamp cache")
+
        // Add the new split replica to the store. This step atomically
        // updates the EndKey of the updated replica and also adds the
        // new replica to the store's replica map.

Previously, we were muddling through with NewReplica which started from an uninitialized state (but now we call writeInitialState).

We're going to need/want a less ad-hoc system which ties together a Replica with its RangeID and a corresponding subspace of the Engine.

@bdarnell
Copy link
Contributor

bdarnell commented Jul 5, 2016

We're going to need/want a less ad-hoc system which ties together a Replica with its RangeID and a corresponding subspace of the Engine.

Segmented storage (#1866) would have eliminated a lot of these corner cases.

I think we need to refactor the way Store.SplitRange works - it needs to take a batch and, under some appropriate lock, dismantle any existing replica for that range, insert the desired range, commit the batch, unlock itself, and only then let the new replica loose - and that's not strictly enough.

SplitRange is called via batch.Defer, so it really shouldn't be touching disk. Any on-disk changes it needs to make should happen earlier in splitTrigger so it can be atomic with the command batch.

Maybe SplitRange should drop and recreate the raft group, just like setReplicaID does.

Writing HardState in writeInitialState may have been a mistake: we should let raft be the source of truth for this variable and only write the HardState when raft tells us to.

@petermattis
Copy link
Collaborator

@bdarnell Given the subtleties involved here and @tschottdorf's busyness, can you take on this issue?

@tbg
Copy link
Member Author

tbg commented Jul 5, 2016

Writing HardState in writeInitialState may have been a mistake: we should let raft be the source of truth for this variable and only write the HardState when raft tells us to.

But Raft can't know, right? It's our job here to set a commit index to mark that this range contains data (which we just put there disregarding what was there before). We also can't simply create the RHS of the split since the "uninitialized" RHS cannot accept snapshots. So we must effectively absorb whatever already exists. It's this "absorbing" that we don't handle at all (it probably worked before my changes since the synthesization of Raft state effectively did some of that absorption).

@tbg
Copy link
Member Author

tbg commented Jul 5, 2016

FWIW, this diff fixes it (the crucial part are calls to Destroy() which mean that the other replica can't clobber the updated HardState back before it gets removed by the call to store.SplitRange (which happens too late).

diff --git a/storage/replica_command.go b/storage/replica_command.go
index 041538d..63adc74 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -2262,7 +2262,7 @@ func (r *Replica) splitTrigger(
    // from it below.
    deltaMS, err = writeInitialState(batch, deltaMS, split.NewDesc)
    if err != nil {
-       return errors.Errorf("unable to write initial state: %s", err)
+       return errors.Wrap(err, "unable to write initial state")
    }

    rightMS := deltaMS
@@ -2276,6 +2276,18 @@ func (r *Replica) splitTrigger(
    }
    log.Trace(ctx, "computed stats for new range")

+   r.store.mu.Lock()
+   if r := r.store.mu.uninitReplicas[split.NewDesc.RangeID]; r != nil {
+       r.Destroy(split.NewDesc)
+   }
+   if r := r.store.mu.replicas[split.NewDesc.RangeID]; r != nil {
+       r.Destroy(split.NewDesc)
+   }
+
+   delete(r.store.mu.uninitReplicas, split.NewDesc.RangeID)
+   delete(r.store.mu.replicas, split.NewDesc.RangeID)
+   r.store.mu.Unlock()
+
    // Note: you must not use the trace inside of this defer since it may
    // run after the trace has already completed.
    batch.Defer(func() {
diff --git a/storage/replica_state.go b/storage/replica_state.go
index 85f640c..3de3856 100644
--- a/storage/replica_state.go
+++ b/storage/replica_state.go
@@ -27,7 +27,6 @@ import (
    "github.com/cockroachdb/cockroach/util/hlc"
    "github.com/cockroachdb/cockroach/util/log"
    "github.com/cockroachdb/cockroach/util/protoutil"
-   "github.com/coreos/etcd/raft"
    "github.com/coreos/etcd/raft/raftpb"
    "golang.org/x/net/context"
 )
@@ -350,13 +349,13 @@ func updateHardState(eng engine.ReadWriter, s storagebase.ReplicaState) error {
        Commit: s.RaftAppliedIndex,
    }

-   if !raft.IsEmptyHardState(oldHS) {
-       if oldHS.Commit > newHS.Commit {
-           newHS.Commit = oldHS.Commit
-       }
-       if oldHS.Term > newHS.Term {
-           newHS.Term = oldHS.Term
-       }
+   if oldHS.Commit > newHS.Commit {
+       newHS.Commit = oldHS.Commit
+   }
+   if oldHS.Term > newHS.Term {
+       newHS.Term = oldHS.Term
+   }
+   if newHS.Vote == 0 {
        newHS.Vote = oldHS.Vote
    }

@bdarnell
Copy link
Contributor

bdarnell commented Jul 6, 2016

I'm scared of calling Destroy here - that method has some really subtle locking requirements. More importantly, we can only delete a HardState when the replica ID changes. If the replica ID is stable (which I think it is in this case), we must preserve the HardState that was written before the splitTrigger.

@bdarnell
Copy link
Contributor

bdarnell commented Jul 6, 2016

The problem here is that preemptive snapshots bypass some of the raft machinery. When we pass MsgSnap into the raftGroup, raft does some things internally (specifically increasing HardState.Commit) and then returns the snapshot to us through raft.Ready and we call applySnapshot. In the preemptive path, we just call applySnapshot.

When a raftGroup exists, we should use it to make sure the snapshot is handled the same way as a non-preemptive one (also relevant for #7619). When no raftGroup exists (and we cannot create one because our replica ID has not yet been assigned), we must set up our on-disk state the same way that Raft would have. This is subtle. I think HardState.Commit is the only piece we're missing, but it's difficult to guarantee that.

Stepping back, we started down this path of using a replica ID of zero because it was easy and seemed to just work. Now that we've determined that it's more complicated than that, maybe we should reconsider preallocating replica IDs. We could split ChangeReplicas into two transactions: one that just increments NextReplicaID and one for the rest. In between the two transactions we would send the preemptive snapshot with its real replica ID so the raftGroup could always be created and nothing in handleRaftMessage needs to change.

cc @tamird

@tamird
Copy link
Contributor

tamird commented Jul 6, 2016

Stepping back, we started down this path of using a replica ID of zero because it was easy and seemed to just work. Now that we've determined that it's more complicated than that, maybe we should reconsider preallocating replica IDs. We could split ChangeReplicas into two transactions: one that just increments NextReplicaID and one for the rest. In between the two transactions we would send the preemptive snapshot with its real replica ID so the raftGroup could always be created and nothing in handleRaftMessage needs to change.

Unfortunately, this approach is also problematic (we tried it originally). What you're proposing is:

  1. nextReplicaID := RangeDescriptor.NextReplicaID
  2. txn 1: RangeDescriptor.NextReplicaID++
  3. repDesc := ReplicaDescriptor{..., ReplicaID: nextReplicaID}
  4. send snapshot with repDesc
  5. txn 2: RangeDescriptor.Replicas = append(RangeDescriptor.Replicas, repDesc)

The problem is a race between the final transaction and the replica GC queue; the preemptive snapshot will cause the recipient store to create a replica with its proper ID, but that replica will not yet be present in the range descriptor. If the replica GC queue stumbles upon that replica before the final transaction arrives, it will GC the replica and create a tombstone, and then it's game over, that replica can never be re-created.

As discussed in person, our options are:

  1. continue to massage so called "handwritten" raft state until we find the magic combination
  2. allow raft groups with group ID zero to be created and funnel preemptive snapshots through them
  3. add a "pending" flag and timeout replicas created via preemptive snapshots so the replica GC queue doesn't interfere
  4. avoid sending preemptive snapshots altogether. instead, only generate them during ChangeReplicas and let raft do the work of sending them over the wire. note that preemptive snapshots are already asynchronous with respect to ChangeReplicas because of the queues in RaftTransport
  5. move discussion of preemptive snapshots upstream to etcd/raft

@tbg
Copy link
Member Author

tbg commented Jul 6, 2016

I'm scared of calling Destroy here - that method has some really subtle locking requirements.

Not suggesting we do that, simply wanted to point out that that does fix "it".

I have some WIP on HS handling which I hope to massage and post prior to our chat today. Updating the HardState doesn't seem so difficult and I do think that it is manageable (I see the complexity in making sure we never have two replicas active on the same keyspace during splits, something which is fixed above by the brute-force destroy before commit) but I agree that we need to be very careful here.

I agree that we should be passing snapshots through Raft whenever possible. We do that for, surprise, actual Raft snapshots. What other situations do we have? Pre-emptive snapshots can't have a RaftGroup as you point out, so no dice there. That only leaves splits, and conceptually we could try to frame a Split as the creation of a new Raft group and then an initialization from a snapshot. But that's of course not how things are set up in that case, and we certainly want to minimize artificial data juggling (making a snapshot, wiping our state, reapplying it all in the same batch) in the split trigger, so I expect us to massage data there as well. Am I missing another situation in which your comment would apply?

@tbg tbg mentioned this issue Jul 6, 2016
@tbg
Copy link
Member Author

tbg commented Jul 6, 2016

Going to look at fixing this up for beta tomorrow as (I think) we discussed today. Goal is a non-invasive change in the spirit of #7600 (comment)

@tbg tbg self-assigned this Jul 6, 2016
tbg added a commit to tbg/cockroach that referenced this issue Jul 7, 2016
Before this fix,

```
make stressrace PKG=./storage TESTS=TestStoreRangeDownReplicate \
  STRESSFLAGS='-p 16 -maxfails 1 -stderr'
```

failed in a few hundred iterations (and in <20 with some assertions and
shortcuts added in my WIP which I am going to publish as well).

See cockroachdb#7600.
tbg added a commit to tbg/cockroach that referenced this issue Jul 7, 2016
We discovered that this exacerbates (and makes harder to fix) problems
discovered in cockroachdb#7600.
tbg added a commit to tbg/cockroach that referenced this issue Jul 7, 2016
Before this fix,

```
make stressrace PKG=./storage TESTS=TestStoreRangeDownReplicate \
  STRESSFLAGS='-p 16 -maxfails 1 -stderr'
```

failed in a few hundred iterations (and in <20 with some assertions and
shortcuts added in my WIP which I am going to publish as well).

See cockroachdb#7600.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Jul 7, 2016
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
This took quite a bit of fiddling, but that version fails both test
and testrace almost instantly (<30iters) when the code added in cockroachdb#7672
is disabled (and even faster with more aggressive Raft tick intervals).
It does not catch the clobbering that could happen if
storeReplicaRaftReadyConcurrency were increased, at least not within 500
iterations.
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
This took quite a bit of fiddling, but that version fails both test
and testrace almost instantly (<30iters) when the code added in cockroachdb#7672
is disabled (and even faster with more aggressive Raft tick intervals).
It does not catch the clobbering that could happen if
storeReplicaRaftReadyConcurrency were increased, at least not within 500
iterations.
@tbg tbg closed this as completed in #7700 Jul 9, 2016
tbg added a commit that referenced this issue Jul 9, 2016
storage: add test to prevent regression of #7600
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

4 participants