storage: clobbering between uninitialized replica and RHS of a split #7600

tbg · 2016-07-02T00:03:38Z

As discussed in #6991, but now also directly observed (on master) in

$ make stressrace PKG=./storage TESTS=TestStoreRangeDownReplicate STRESSFLAGS='-p 16 -maxfails 1 -stderr':

E160701 20:01:54.022089 raft/raft.go:925  [group 2] 3 state.commit 0 is out of range [10, 10]
panic: [group 2] 3 state.commit 0 is out of range [10, 10]

goroutine 320 [running]:
panic(0x5611dc0, 0xc820994420)
    /usr/local/Cellar/go/1.6.2/libexec/src/runtime/panic.go:481 +0x3ff
github.com/cockroachdb/cockroach/storage.(*raftLogger).Panicf(0xc820994330, 0x5bc7f20, 0x2b, 0xc82099c5c0, 0x4, 0x4)
    /Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/storage/raft.go:117 +0x1cb
github.com/coreos/etcd/raft.(*raft).loadState(0xc82095c340, 0x6, 0x1, 0x0, 0x0, 0x0, 0x0)
    /Users/tschottdorf/go/src/github.com/coreos/etcd/raft/raft.go:925 +0x3f6
github.com/coreos/etcd/raft.newRaft(0xc820945b38, 0x30303c820945860)
    /Users/tschottdorf/go/src/github.com/coreos/etcd/raft/raft.go:225 +0xcd9
github.com/coreos/etcd/raft.NewRawNode(0xc820945b38, 0x0, 0x0, 0x0, 0xc8209fa190, 0x0, 0x0)

#7598 should fix this.

The text was updated successfully, but these errors were encountered:

tbg · 2016-07-02T01:55:04Z

Whatever I thought is going on here isn't. I'm still investigating.

tbg · 2016-07-02T02:54:45Z

Ok, this rabbit hole is deep. Here's what I think is happening:

(tl;dr: #7531 allows both the msg-induced uninitialized version and the split-generated version of a split RHS to mutate things concurrently)

the first range splits on a multi-node cluster.
one of the nodes is a bit behind, so other nodes talk to it before it has created the raft group through the split trigger
consequently, we create an uninitialized replica - it starts with a trivial HardState.
the split trigger executes and "seeds" what it believes is going to be the new right-hand side of the split. it's oblivious to the fact that there's already a replica out there operating on that raft state. HardState and TruncatedState are overwritten, and that batch committed.
our uninitialized replica is now in the danger zone. Its Raft instance has stepped based on a trivial HardState, and, for example, responded to a VoteRequest (giving it the new HardState Term:6 Commit:0, Vote:<someone>.
it persists that HardState, clobbering the on-disk one Term:5 Commit:10 Vote:0.
the split trigger now inserts the "new" replica into the store, kicking out the uninitialized one. The new replica inits its raft group and crashes: truncated state says our first entry is 11, but the RaftState has Commit:0.

Unfortunately I don't see the "obvious" solution. I think we need to refactor the way Store.SplitRange works - it needs to take a batch and, under some appropriate lock, dismantle any existing replica for that range, insert the desired range, commit the batch, unlock itself, and only then let the new replica loose - and that's not strictly enough. I think we've avoided these races for a while since we synthesized more of the Raft state and blocked snapshots to apply for the uninitialized range.

I suspected that it wasn't possible for this anomaly to occur in practice until #7531 landed, but even with var storeReplicaRaftReadyConcurrency = 1 the test failure remains. Bisecting now.

tbg · 2016-07-02T03:07:27Z

cc @bdarnell

tbg · 2016-07-02T03:24:50Z

bisect blames #7468 (fixing preemptive snapshots). With the above explanation (and my debugging efforts) I don't see it having a direct influence though it theoretically cause similar issues (#7598), but this repro'ed reliably with my patch set there (in a version which would actually write the HardState). I never saw a preemptive snapshot applied for range 2 (which is always the one which causes the issue).

My current theory is that working preemptive snapshots allows the first range to make it to more nodes quicker, which allows the split race to happen.

tbg · 2016-07-02T15:36:47Z

The bug still repros with the following diff applied:

diff --git a/storage/store.go b/storage/store.go
index 85ced80..2a9c2dc 100644
--- a/storage/store.go
+++ b/storage/store.go
@@ -2096,7 +2096,8 @@ func (s *Store) handleRaftMessage(req *RaftMessageRequest) error {
        }

        if req.ToReplica.ReplicaID == 0 {
-               if req.Message.Type == raftpb.MsgSnap {
+               if req.RangeID != 1 && req.Message.Type == raftpb.MsgSnap {
+                       log.Warningf("SNAP %d", req.RangeID)
                        // Allow snapshots to be applied to replicas before they are members of
                        // the raft group (i.e. replicas with an ID of 0). This is the only
                        // operation that can be performed on a replica before it is part of the

However, it takes longer (~300-400 iters) and there is no SNAP output (meaning that no preemptive snapshot was ever applied). This is in line with my expectation that this hasn't been introduced in that PR, just made more likely with recent changes.

tbg · 2016-07-02T21:47:33Z

The behavior was likely exposed by the relatively recent changes during my Raft state refactor:

diff --git a/storage/replica_command.go b/storage/replica_command.go
index 564978a..51f6937 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -1553,7 +1553,7 @@ func (r *Replica) LeaderLease(
    args.Lease.Start = effectiveStart

    // Store the lease to disk & in-memory.
-   if err := engine.MVCCPutProto(ctx, batch, ms, keys.RangeLeaderLeaseKey(r.RangeID), hlc.ZeroTimestamp, nil, &args.Lease); err != nil {
+   if err := setLease(batch, ms, r.RangeID, &args.Lease); err != nil {
        return reply, err
    }
    r.mu.state.Lease = &args.Lease
@@ -2250,10 +2250,13 @@ func (r *Replica) splitTrigger(
    }
    log.Trace(ctx, fmt.Sprintf("copied abort cache (%d entries)", seqCount))

-   // Create the new Replica representing the right side of the split.
-   newRng, err := NewReplica(&split.NewDesc, r.store, 0)
+   // Write the initial state for the new Raft group of the right-hand side.
+   // Node that this needs to go into deltaMS (which is the total difference
+   // in bytes reported to the store in the end). We compute the RHS' stats
+   // from it below.
+   deltaMS, err = writeInitialState(batch, deltaMS, split.NewDesc)
    if err != nil {
-       return err
+       return util.Errorf("unable to write initial state: %s", err)
    }

    rightMS := deltaMS
@@ -2261,23 +2264,32 @@ func (r *Replica) splitTrigger(
    rightMS.Add(origStats)
    // Remove stats from the left side of the split.
    rightMS.Subtract(leftMS)
-   if err := setMVCCStats(batch, newRng.RangeID, rightMS); err != nil {
+
+   if err := setMVCCStats(batch, split.NewDesc.RangeID, rightMS); err != nil {
        return util.Errorf("unable to write MVCC stats: %s", err)
    }
    log.Trace(ctx, "computed stats for new range")

-   // Copy the timestamp cache into the new range.
-   r.mu.Lock()
-   newRng.mu.Lock()
-   newRng.mu.state.Stats = rightMS
-   r.mu.tsCache.MergeInto(newRng.mu.tsCache, true /* clear */)
-   newRng.mu.Unlock()
-   r.mu.Unlock()
-   log.Trace(ctx, "copied timestamp cache")
-
    // Note: you must not use the trace inside of this defer since it may
    // run after the trace has already completed.
    batch.Defer(func() {
+       // Create the new Replica representing the right side of the split.
+       // Our error handling options at this point are very limited, but
+       // we need to do this after our batch has committed.
+       newRng, err := NewReplica(&split.NewDesc, r.store, 0)
+       if err != nil {
+           panic(err)
+       }
+
+       // Copy the timestamp cache into the new range.
+       r.mu.Lock()
+       newRng.mu.Lock()
+       newRng.mu.state.Stats = rightMS
+       r.mu.tsCache.MergeInto(newRng.mu.tsCache, true /* clear */)
+       newRng.mu.Unlock()
+       r.mu.Unlock()
+       log.Trace(ctx, "copied timestamp cache")
+
        // Add the new split replica to the store. This step atomically
        // updates the EndKey of the updated replica and also adds the
        // new replica to the store's replica map.

Previously, we were muddling through with NewReplica which started from an uninitialized state (but now we call writeInitialState).

We're going to need/want a less ad-hoc system which ties together a Replica with its RangeID and a corresponding subspace of the Engine.

bdarnell · 2016-07-05T15:47:33Z

We're going to need/want a less ad-hoc system which ties together a Replica with its RangeID and a corresponding subspace of the Engine.

Segmented storage (#1866) would have eliminated a lot of these corner cases.

I think we need to refactor the way Store.SplitRange works - it needs to take a batch and, under some appropriate lock, dismantle any existing replica for that range, insert the desired range, commit the batch, unlock itself, and only then let the new replica loose - and that's not strictly enough.

SplitRange is called via batch.Defer, so it really shouldn't be touching disk. Any on-disk changes it needs to make should happen earlier in splitTrigger so it can be atomic with the command batch.

Maybe SplitRange should drop and recreate the raft group, just like setReplicaID does.

Writing HardState in writeInitialState may have been a mistake: we should let raft be the source of truth for this variable and only write the HardState when raft tells us to.

petermattis · 2016-07-05T16:04:53Z

@bdarnell Given the subtleties involved here and @tschottdorf's busyness, can you take on this issue?

tbg · 2016-07-05T21:44:38Z

Writing HardState in writeInitialState may have been a mistake: we should let raft be the source of truth for this variable and only write the HardState when raft tells us to.

But Raft can't know, right? It's our job here to set a commit index to mark that this range contains data (which we just put there disregarding what was there before). We also can't simply create the RHS of the split since the "uninitialized" RHS cannot accept snapshots. So we must effectively absorb whatever already exists. It's this "absorbing" that we don't handle at all (it probably worked before my changes since the synthesization of Raft state effectively did some of that absorption).

tbg · 2016-07-05T22:04:16Z

FWIW, this diff fixes it (the crucial part are calls to Destroy() which mean that the other replica can't clobber the updated HardState back before it gets removed by the call to store.SplitRange (which happens too late).

diff --git a/storage/replica_command.go b/storage/replica_command.go
index 041538d..63adc74 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -2262,7 +2262,7 @@ func (r *Replica) splitTrigger(
    // from it below.
    deltaMS, err = writeInitialState(batch, deltaMS, split.NewDesc)
    if err != nil {
-       return errors.Errorf("unable to write initial state: %s", err)
+       return errors.Wrap(err, "unable to write initial state")
    }

    rightMS := deltaMS
@@ -2276,6 +2276,18 @@ func (r *Replica) splitTrigger(
    }
    log.Trace(ctx, "computed stats for new range")

+   r.store.mu.Lock()
+   if r := r.store.mu.uninitReplicas[split.NewDesc.RangeID]; r != nil {
+       r.Destroy(split.NewDesc)
+   }
+   if r := r.store.mu.replicas[split.NewDesc.RangeID]; r != nil {
+       r.Destroy(split.NewDesc)
+   }
+
+   delete(r.store.mu.uninitReplicas, split.NewDesc.RangeID)
+   delete(r.store.mu.replicas, split.NewDesc.RangeID)
+   r.store.mu.Unlock()
+
    // Note: you must not use the trace inside of this defer since it may
    // run after the trace has already completed.
    batch.Defer(func() {
diff --git a/storage/replica_state.go b/storage/replica_state.go
index 85f640c..3de3856 100644
--- a/storage/replica_state.go
+++ b/storage/replica_state.go
@@ -27,7 +27,6 @@ import (
    "github.com/cockroachdb/cockroach/util/hlc"
    "github.com/cockroachdb/cockroach/util/log"
    "github.com/cockroachdb/cockroach/util/protoutil"
-   "github.com/coreos/etcd/raft"
    "github.com/coreos/etcd/raft/raftpb"
    "golang.org/x/net/context"
 )
@@ -350,13 +349,13 @@ func updateHardState(eng engine.ReadWriter, s storagebase.ReplicaState) error {
        Commit: s.RaftAppliedIndex,
    }

-   if !raft.IsEmptyHardState(oldHS) {
-       if oldHS.Commit > newHS.Commit {
-           newHS.Commit = oldHS.Commit
-       }
-       if oldHS.Term > newHS.Term {
-           newHS.Term = oldHS.Term
-       }
+   if oldHS.Commit > newHS.Commit {
+       newHS.Commit = oldHS.Commit
+   }
+   if oldHS.Term > newHS.Term {
+       newHS.Term = oldHS.Term
+   }
+   if newHS.Vote == 0 {
        newHS.Vote = oldHS.Vote
    }

bdarnell · 2016-07-06T14:16:56Z

I'm scared of calling Destroy here - that method has some really subtle locking requirements. More importantly, we can only delete a HardState when the replica ID changes. If the replica ID is stable (which I think it is in this case), we must preserve the HardState that was written before the splitTrigger.

bdarnell · 2016-07-06T14:33:39Z

The problem here is that preemptive snapshots bypass some of the raft machinery. When we pass MsgSnap into the raftGroup, raft does some things internally (specifically increasing HardState.Commit) and then returns the snapshot to us through raft.Ready and we call applySnapshot. In the preemptive path, we just call applySnapshot.

When a raftGroup exists, we should use it to make sure the snapshot is handled the same way as a non-preemptive one (also relevant for #7619). When no raftGroup exists (and we cannot create one because our replica ID has not yet been assigned), we must set up our on-disk state the same way that Raft would have. This is subtle. I think HardState.Commit is the only piece we're missing, but it's difficult to guarantee that.

Stepping back, we started down this path of using a replica ID of zero because it was easy and seemed to just work. Now that we've determined that it's more complicated than that, maybe we should reconsider preallocating replica IDs. We could split ChangeReplicas into two transactions: one that just increments NextReplicaID and one for the rest. In between the two transactions we would send the preemptive snapshot with its real replica ID so the raftGroup could always be created and nothing in handleRaftMessage needs to change.

cc @tamird

tamird · 2016-07-06T14:58:32Z

Stepping back, we started down this path of using a replica ID of zero because it was easy and seemed to just work. Now that we've determined that it's more complicated than that, maybe we should reconsider preallocating replica IDs. We could split ChangeReplicas into two transactions: one that just increments NextReplicaID and one for the rest. In between the two transactions we would send the preemptive snapshot with its real replica ID so the raftGroup could always be created and nothing in handleRaftMessage needs to change.

Unfortunately, this approach is also problematic (we tried it originally). What you're proposing is:

nextReplicaID := RangeDescriptor.NextReplicaID
txn 1: RangeDescriptor.NextReplicaID++
repDesc := ReplicaDescriptor{..., ReplicaID: nextReplicaID}
send snapshot with repDesc
txn 2: RangeDescriptor.Replicas = append(RangeDescriptor.Replicas, repDesc)

The problem is a race between the final transaction and the replica GC queue; the preemptive snapshot will cause the recipient store to create a replica with its proper ID, but that replica will not yet be present in the range descriptor. If the replica GC queue stumbles upon that replica before the final transaction arrives, it will GC the replica and create a tombstone, and then it's game over, that replica can never be re-created.

As discussed in person, our options are:

continue to massage so called "handwritten" raft state until we find the magic combination
allow raft groups with group ID zero to be created and funnel preemptive snapshots through them
add a "pending" flag and timeout replicas created via preemptive snapshots so the replica GC queue doesn't interfere
avoid sending preemptive snapshots altogether. instead, only generate them during ChangeReplicas and let raft do the work of sending them over the wire. note that preemptive snapshots are already asynchronous with respect to ChangeReplicas because of the queues in RaftTransport
move discussion of preemptive snapshots upstream to etcd/raft

tbg · 2016-07-06T15:22:37Z

I'm scared of calling Destroy here - that method has some really subtle locking requirements.

Not suggesting we do that, simply wanted to point out that that does fix "it".

I have some WIP on HS handling which I hope to massage and post prior to our chat today. Updating the HardState doesn't seem so difficult and I do think that it is manageable (I see the complexity in making sure we never have two replicas active on the same keyspace during splits, something which is fixed above by the brute-force destroy before commit) but I agree that we need to be very careful here.

I agree that we should be passing snapshots through Raft whenever possible. We do that for, surprise, actual Raft snapshots. What other situations do we have? Pre-emptive snapshots can't have a RaftGroup as you point out, so no dice there. That only leaves splits, and conceptually we could try to frame a Split as the creation of a new Raft group and then an initialization from a snapshot. But that's of course not how things are set up in that case, and we certainly want to minimize artificial data juggling (making a snapshot, wiping our state, reapplying it all in the same batch) in the split trigger, so I expect us to massage data there as well. Am I missing another situation in which your comment would apply?

tbg · 2016-07-06T23:36:16Z

Going to look at fixing this up for beta tomorrow as (I think) we discussed today. Goal is a non-invasive change in the spirit of #7600 (comment)

Before this fix, ``` make stressrace PKG=./storage TESTS=TestStoreRangeDownReplicate \ STRESSFLAGS='-p 16 -maxfails 1 -stderr' ``` failed in a few hundred iterations (and in <20 with some assertions and shortcuts added in my WIP which I am going to publish as well). See cockroachdb#7600.

We discovered that this exacerbates (and makes harder to fix) problems discovered in cockroachdb#7600.

Before this fix, ``` make stressrace PKG=./storage TESTS=TestStoreRangeDownReplicate \ STRESSFLAGS='-p 16 -maxfails 1 -stderr' ``` failed in a few hundred iterations (and in <20 with some assertions and shortcuts added in my WIP which I am going to publish as well). See cockroachdb#7600.

Fixes cockroachdb#7659 Updates cockroachdb#7600 Updates cockroachdb#7619

This took quite a bit of fiddling, but that version fails both test and testrace almost instantly (<30iters) when the code added in cockroachdb#7672 is disabled (and even faster with more aggressive Raft tick intervals). It does not catch the clobbering that could happen if storeReplicaRaftReadyConcurrency were increased, at least not within 500 iterations.

storage: add test to prevent regression of #7600

tbg self-assigned this Jul 2, 2016

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 2, 2016

tbg mentioned this issue Jul 2, 2016

storage: always write a HardState #7598

Merged

tbg changed the title ~~storage: applying snapshot without HardState leads to Raft panic~~ storage: clobbering between uninitialized replica and RHS of a split Jul 2, 2016

tbg mentioned this issue Jul 2, 2016

util/log: convert log from a singleton #1779

Closed

bdarnell assigned bdarnell and unassigned tbg Jul 5, 2016

tbg mentioned this issue Jul 5, 2016

stability: crash at startup: panic: [group 903] entries[1:12) is unavailable from storage #7389

Closed

tbg mentioned this issue Jul 6, 2016

storage: outside-of-raft snapshots put consistency in jeopardy #7619

Closed

tbg mentioned this issue Jul 6, 2016

beta-20160707 #7669

Closed

tbg self-assigned this Jul 6, 2016

tbg mentioned this issue Jul 7, 2016

storage: Make uninitialized replicas distinct types #6144

Closed

tbg mentioned this issue Jul 7, 2016

storage: stop-gap HardState clobbering on split #7672

Merged

tbg added a commit to tbg/cockroach that referenced this issue Jul 7, 2016

storage: disable handleRaftReady concurrency

8c02f0c

We discovered that this exacerbates (and makes harder to fix) problems discovered in cockroachdb#7600.

tbg mentioned this issue Jul 7, 2016

storage: disable handleRaftReady concurrency #7673

Merged

bdarnell added a commit to bdarnell/cockroach that referenced this issue Jul 7, 2016

storage: Temporarily disable preemptive snapshots

f7734ba

Fixes cockroachdb#7659 Updates cockroachdb#7600 Updates cockroachdb#7619

bdarnell mentioned this issue Jul 7, 2016

storage: Temporarily disable preemptive snapshots #7674

Merged

tbg mentioned this issue Jul 8, 2016

storage: add test to prevent regression of #7600 #7700

Merged

tbg closed this as completed in #7700 Jul 9, 2016

tbg added a commit that referenced this issue Jul 9, 2016

Merge pull request #7700 from tschottdorf/split-race-test

ed11c3d

storage: add test to prevent regression of #7600

tbg mentioned this issue Jul 27, 2016

storage: improve conflicting split error messsage #8047

Merged

irfansharif mentioned this issue Jun 27, 2017

storage: RHS HardState clobbering during splits #16749

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: clobbering between uninitialized replica and RHS of a split #7600

storage: clobbering between uninitialized replica and RHS of a split #7600

tbg commented Jul 2, 2016

tbg commented Jul 2, 2016

tbg commented Jul 2, 2016

tbg commented Jul 2, 2016

tbg commented Jul 2, 2016

tbg commented Jul 2, 2016 •

edited

Loading

tbg commented Jul 2, 2016 •

edited

Loading

bdarnell commented Jul 5, 2016

petermattis commented Jul 5, 2016

tbg commented Jul 5, 2016

tbg commented Jul 5, 2016

bdarnell commented Jul 6, 2016

bdarnell commented Jul 6, 2016

tamird commented Jul 6, 2016

tbg commented Jul 6, 2016

tbg commented Jul 6, 2016 •

edited

Loading

storage: clobbering between uninitialized replica and RHS of a split #7600

storage: clobbering between uninitialized replica and RHS of a split #7600

Comments

tbg commented Jul 2, 2016

tbg commented Jul 2, 2016

tbg commented Jul 2, 2016

tbg commented Jul 2, 2016

tbg commented Jul 2, 2016

tbg commented Jul 2, 2016 • edited Loading

tbg commented Jul 2, 2016 • edited Loading

bdarnell commented Jul 5, 2016

petermattis commented Jul 5, 2016

tbg commented Jul 5, 2016

tbg commented Jul 5, 2016

bdarnell commented Jul 6, 2016

bdarnell commented Jul 6, 2016

tamird commented Jul 6, 2016

tbg commented Jul 6, 2016

tbg commented Jul 6, 2016 • edited Loading

tbg commented Jul 2, 2016 •

edited

Loading

tbg commented Jul 2, 2016 •

edited

Loading

tbg commented Jul 6, 2016 •

edited

Loading