-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: clobbering between uninitialized replica and RHS of a split #7600
Comments
Whatever I thought is going on here isn't. I'm still investigating. |
Ok, this rabbit hole is deep. Here's what I think is happening: (tl;dr: #7531 allows both the msg-induced uninitialized version and the split-generated version of a split RHS to mutate things concurrently)
Unfortunately I don't see the "obvious" solution. I think we need to refactor the way I suspected that it wasn't possible for this anomaly to occur in practice until #7531 landed, but even with |
cc @bdarnell |
bisect blames #7468 (fixing preemptive snapshots). With the above explanation (and my debugging efforts) I don't see it having a direct influence though it theoretically cause similar issues (#7598), but this repro'ed reliably with my patch set there (in a version which would actually write the HardState). I never saw a preemptive snapshot applied for range 2 (which is always the one which causes the issue). My current theory is that working preemptive snapshots allows the first range to make it to more nodes quicker, which allows the split race to happen. |
The bug still repros with the following diff applied: diff --git a/storage/store.go b/storage/store.go
index 85ced80..2a9c2dc 100644
--- a/storage/store.go
+++ b/storage/store.go
@@ -2096,7 +2096,8 @@ func (s *Store) handleRaftMessage(req *RaftMessageRequest) error {
}
if req.ToReplica.ReplicaID == 0 {
- if req.Message.Type == raftpb.MsgSnap {
+ if req.RangeID != 1 && req.Message.Type == raftpb.MsgSnap {
+ log.Warningf("SNAP %d", req.RangeID)
// Allow snapshots to be applied to replicas before they are members of
// the raft group (i.e. replicas with an ID of 0). This is the only
// operation that can be performed on a replica before it is part of the However, it takes longer (~300-400 iters) and there is no |
The behavior was likely exposed by the relatively recent changes during my Raft state refactor: diff --git a/storage/replica_command.go b/storage/replica_command.go
index 564978a..51f6937 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -1553,7 +1553,7 @@ func (r *Replica) LeaderLease(
args.Lease.Start = effectiveStart
// Store the lease to disk & in-memory.
- if err := engine.MVCCPutProto(ctx, batch, ms, keys.RangeLeaderLeaseKey(r.RangeID), hlc.ZeroTimestamp, nil, &args.Lease); err != nil {
+ if err := setLease(batch, ms, r.RangeID, &args.Lease); err != nil {
return reply, err
}
r.mu.state.Lease = &args.Lease
@@ -2250,10 +2250,13 @@ func (r *Replica) splitTrigger(
}
log.Trace(ctx, fmt.Sprintf("copied abort cache (%d entries)", seqCount))
- // Create the new Replica representing the right side of the split.
- newRng, err := NewReplica(&split.NewDesc, r.store, 0)
+ // Write the initial state for the new Raft group of the right-hand side.
+ // Node that this needs to go into deltaMS (which is the total difference
+ // in bytes reported to the store in the end). We compute the RHS' stats
+ // from it below.
+ deltaMS, err = writeInitialState(batch, deltaMS, split.NewDesc)
if err != nil {
- return err
+ return util.Errorf("unable to write initial state: %s", err)
}
rightMS := deltaMS
@@ -2261,23 +2264,32 @@ func (r *Replica) splitTrigger(
rightMS.Add(origStats)
// Remove stats from the left side of the split.
rightMS.Subtract(leftMS)
- if err := setMVCCStats(batch, newRng.RangeID, rightMS); err != nil {
+
+ if err := setMVCCStats(batch, split.NewDesc.RangeID, rightMS); err != nil {
return util.Errorf("unable to write MVCC stats: %s", err)
}
log.Trace(ctx, "computed stats for new range")
- // Copy the timestamp cache into the new range.
- r.mu.Lock()
- newRng.mu.Lock()
- newRng.mu.state.Stats = rightMS
- r.mu.tsCache.MergeInto(newRng.mu.tsCache, true /* clear */)
- newRng.mu.Unlock()
- r.mu.Unlock()
- log.Trace(ctx, "copied timestamp cache")
-
// Note: you must not use the trace inside of this defer since it may
// run after the trace has already completed.
batch.Defer(func() {
+ // Create the new Replica representing the right side of the split.
+ // Our error handling options at this point are very limited, but
+ // we need to do this after our batch has committed.
+ newRng, err := NewReplica(&split.NewDesc, r.store, 0)
+ if err != nil {
+ panic(err)
+ }
+
+ // Copy the timestamp cache into the new range.
+ r.mu.Lock()
+ newRng.mu.Lock()
+ newRng.mu.state.Stats = rightMS
+ r.mu.tsCache.MergeInto(newRng.mu.tsCache, true /* clear */)
+ newRng.mu.Unlock()
+ r.mu.Unlock()
+ log.Trace(ctx, "copied timestamp cache")
+
// Add the new split replica to the store. This step atomically
// updates the EndKey of the updated replica and also adds the
// new replica to the store's replica map. Previously, we were muddling through with NewReplica which started from an uninitialized state (but now we call We're going to need/want a less ad-hoc system which ties together a Replica with its RangeID and a corresponding subspace of the Engine. |
Segmented storage (#1866) would have eliminated a lot of these corner cases.
Maybe Writing |
@bdarnell Given the subtleties involved here and @tschottdorf's busyness, can you take on this issue? |
But Raft can't know, right? It's our job here to set a commit index to mark that this range contains data (which we just put there disregarding what was there before). We also can't simply create the RHS of the split since the "uninitialized" RHS cannot accept snapshots. So we must effectively absorb whatever already exists. It's this "absorbing" that we don't handle at all (it probably worked before my changes since the synthesization of Raft state effectively did some of that absorption). |
FWIW, this diff fixes it (the crucial part are calls to diff --git a/storage/replica_command.go b/storage/replica_command.go
index 041538d..63adc74 100644
--- a/storage/replica_command.go
+++ b/storage/replica_command.go
@@ -2262,7 +2262,7 @@ func (r *Replica) splitTrigger(
// from it below.
deltaMS, err = writeInitialState(batch, deltaMS, split.NewDesc)
if err != nil {
- return errors.Errorf("unable to write initial state: %s", err)
+ return errors.Wrap(err, "unable to write initial state")
}
rightMS := deltaMS
@@ -2276,6 +2276,18 @@ func (r *Replica) splitTrigger(
}
log.Trace(ctx, "computed stats for new range")
+ r.store.mu.Lock()
+ if r := r.store.mu.uninitReplicas[split.NewDesc.RangeID]; r != nil {
+ r.Destroy(split.NewDesc)
+ }
+ if r := r.store.mu.replicas[split.NewDesc.RangeID]; r != nil {
+ r.Destroy(split.NewDesc)
+ }
+
+ delete(r.store.mu.uninitReplicas, split.NewDesc.RangeID)
+ delete(r.store.mu.replicas, split.NewDesc.RangeID)
+ r.store.mu.Unlock()
+
// Note: you must not use the trace inside of this defer since it may
// run after the trace has already completed.
batch.Defer(func() {
diff --git a/storage/replica_state.go b/storage/replica_state.go
index 85f640c..3de3856 100644
--- a/storage/replica_state.go
+++ b/storage/replica_state.go
@@ -27,7 +27,6 @@ import (
"github.com/cockroachdb/cockroach/util/hlc"
"github.com/cockroachdb/cockroach/util/log"
"github.com/cockroachdb/cockroach/util/protoutil"
- "github.com/coreos/etcd/raft"
"github.com/coreos/etcd/raft/raftpb"
"golang.org/x/net/context"
)
@@ -350,13 +349,13 @@ func updateHardState(eng engine.ReadWriter, s storagebase.ReplicaState) error {
Commit: s.RaftAppliedIndex,
}
- if !raft.IsEmptyHardState(oldHS) {
- if oldHS.Commit > newHS.Commit {
- newHS.Commit = oldHS.Commit
- }
- if oldHS.Term > newHS.Term {
- newHS.Term = oldHS.Term
- }
+ if oldHS.Commit > newHS.Commit {
+ newHS.Commit = oldHS.Commit
+ }
+ if oldHS.Term > newHS.Term {
+ newHS.Term = oldHS.Term
+ }
+ if newHS.Vote == 0 {
newHS.Vote = oldHS.Vote
} |
I'm scared of calling |
The problem here is that preemptive snapshots bypass some of the raft machinery. When we pass MsgSnap into the raftGroup, raft does some things internally (specifically increasing HardState.Commit) and then returns the snapshot to us through raft.Ready and we call applySnapshot. In the preemptive path, we just call applySnapshot. When a Stepping back, we started down this path of using a replica ID of zero because it was easy and seemed to just work. Now that we've determined that it's more complicated than that, maybe we should reconsider preallocating replica IDs. We could split ChangeReplicas into two transactions: one that just increments NextReplicaID and one for the rest. In between the two transactions we would send the preemptive snapshot with its real replica ID so the raftGroup could always be created and nothing in handleRaftMessage needs to change. cc @tamird |
Unfortunately, this approach is also problematic (we tried it originally). What you're proposing is:
The problem is a race between the final transaction and the replica GC queue; the preemptive snapshot will cause the recipient store to create a replica with its proper ID, but that replica will not yet be present in the range descriptor. If the replica GC queue stumbles upon that replica before the final transaction arrives, it will GC the replica and create a tombstone, and then it's game over, that replica can never be re-created. As discussed in person, our options are:
|
Not suggesting we do that, simply wanted to point out that that does fix "it". I have some WIP on HS handling which I hope to massage and post prior to our chat today. Updating the HardState doesn't seem so difficult and I do think that it is manageable (I see the complexity in making sure we never have two replicas active on the same keyspace during splits, something which is fixed above by the brute-force destroy before commit) but I agree that we need to be very careful here. I agree that we should be passing snapshots through Raft whenever possible. We do that for, surprise, actual Raft snapshots. What other situations do we have? Pre-emptive snapshots can't have a RaftGroup as you point out, so no dice there. That only leaves splits, and conceptually we could try to frame a Split as the creation of a new Raft group and then an initialization from a snapshot. But that's of course not how things are set up in that case, and we certainly want to minimize artificial data juggling (making a snapshot, wiping our state, reapplying it all in the same batch) in the split trigger, so I expect us to massage data there as well. Am I missing another situation in which your comment would apply? |
Going to look at fixing this up for beta tomorrow as (I think) we discussed today. Goal is a non-invasive change in the spirit of #7600 (comment) |
Before this fix, ``` make stressrace PKG=./storage TESTS=TestStoreRangeDownReplicate \ STRESSFLAGS='-p 16 -maxfails 1 -stderr' ``` failed in a few hundred iterations (and in <20 with some assertions and shortcuts added in my WIP which I am going to publish as well). See cockroachdb#7600.
We discovered that this exacerbates (and makes harder to fix) problems discovered in cockroachdb#7600.
Before this fix, ``` make stressrace PKG=./storage TESTS=TestStoreRangeDownReplicate \ STRESSFLAGS='-p 16 -maxfails 1 -stderr' ``` failed in a few hundred iterations (and in <20 with some assertions and shortcuts added in my WIP which I am going to publish as well). See cockroachdb#7600.
Fixes cockroachdb#7659 Updates cockroachdb#7600 Updates cockroachdb#7619
This took quite a bit of fiddling, but that version fails both test and testrace almost instantly (<30iters) when the code added in cockroachdb#7672 is disabled (and even faster with more aggressive Raft tick intervals). It does not catch the clobbering that could happen if storeReplicaRaftReadyConcurrency were increased, at least not within 500 iterations.
This took quite a bit of fiddling, but that version fails both test and testrace almost instantly (<30iters) when the code added in cockroachdb#7672 is disabled (and even faster with more aggressive Raft tick intervals). It does not catch the clobbering that could happen if storeReplicaRaftReadyConcurrency were increased, at least not within 500 iterations.
storage: add test to prevent regression of #7600
As discussed in #6991, but now also directly observed (on
master
) in$ make stressrace PKG=./storage TESTS=TestStoreRangeDownReplicate STRESSFLAGS='-p 16 -maxfails 1 -stderr'
:#7598 should fix this.
The text was updated successfully, but these errors were encountered: