Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: RHS HardState clobbering during splits #16749

Closed
irfansharif opened this issue Jun 27, 2017 · 12 comments
Closed

storage: RHS HardState clobbering during splits #16749

irfansharif opened this issue Jun 27, 2017 · 12 comments
Assignees

Comments

@irfansharif
Copy link
Contributor

observed this curious behavior recently where we seem to clobber on-disk HardState; naively this should not happen. See #7600 for a strikingly similar occurrence.
The following diff adding some extra instrumentation, coupled with this test demonstrates the clobbering behavior.
They are both available on this branch, TestMinimal is just a simplified version of the existing TestStoreRangeSplitRaceUninitializedRHS test but with only one split operation for simplified, albeit hacky, logging.

commit fe0203a96d35d4ffd8c3ae6ba55da8b98f8b1651
Author: irfan sharif <[email protected]>
Date:   Tue Jun 27 00:38:53 2017 -0400

    storage: debug hardstate clobbering

diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index f8a6dd9b0..e04324999 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -3112,6 +3112,9 @@ func (r *Replica) handleRaftReadyRaftMuLocked(
 	if err := batch.Commit(syncRaftLog.Get() && rd.MustSync); err != nil {
 		return stats, err
 	}
+	if !raft.IsEmptyHardState(rd.HardState) {
+		log.Infof(ctx, "set hs:  %+v", rd.HardState)
+	}
 	elapsed := timeutil.Since(start)
 	r.store.metrics.RaftLogCommitLatency.RecordValue(elapsed.Nanoseconds())
 
@@ -4195,10 +4198,27 @@ func (r *Replica) applyRaftCommand(
 	writer.Close()
 
 	start := timeutil.Now()
+
+	oldHS, err := loadHardState(ctx, r.store.Engine(), roachpb.RangeID(2))
+	if err != nil {
+		panic(err)
+	}
 	if err := batch.Commit(false); err != nil {
 		return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
 			errors.Wrap(err, "could not commit batch")))
 	}
+
+	// This is the hard state received through raft.
+	newHS, err := loadHardState(ctx, r.store.Engine(), roachpb.RangeID(2))
+	if err != nil {
+		panic(err)
+	}
+
+	if newHS.Term < oldHS.Term || (newHS.Term == oldHS.Term && newHS.Commit < oldHS.Commit) {
+		log.Warningf(ctx, "previously: %+v", oldHS)
+		log.Warningf(ctx, "over-written with: %+v", newHS)
+		log.Warningf(ctx, "clobbered hard state: %s", pretty.Diff(newHS, oldHS))
+	}
 	elapsed := timeutil.Since(start)
 	r.store.metrics.RaftCommandCommitLatency.RecordValue(elapsed.Nanoseconds())
 	return rResult.Delta, nil
diff --git a/pkg/storage/replica_command.go b/pkg/storage/replica_command.go
index 9b173619a..ef33f4be2 100644
--- a/pkg/storage/replica_command.go
+++ b/pkg/storage/replica_command.go
@@ -3033,6 +3033,9 @@ func splitTrigger(
 		if err != nil {
 			return enginepb.MVCCStats{}, EvalResult{}, errors.Wrap(err, "unable to load hard state")
 		}
+
+		log.Warningf(ctx, "current hard state, deliver through raft: %+v", oldHS)
+
 		// Initialize the right-hand lease to be the same as the left-hand lease.
 		// Various pieces of code rely on a replica's lease never being unitialized,
 		// but it's more than that - it ensures that we properly initialize the

Observing the logs for make test PKG=./pkg/storage TESTFLAGS="-v" TESTS=TestMinimal:

I170627 16:43:16.470806 228 storage/replica.go:3116  [s1,r2/1:{-}] set hs:  {Term:9 Vote:2 Commit:0 XXX_unrecognized:[]}
I170627 16:43:16.470983 228 storage/replica.go:3116  [s1,r2/1:{-}] set hs:  {Term:10 Vote:2 Commit:0 XXX_unrecognized:[]}
W170627 16:43:16.471264 218 storage/replica_command.go:3037  [s1,r1/1:/M{in-ax}] current hard state, deliver through raft: {Term:10 Vote:2 Commit:0 XXX_unrecognized:[]}
I170627 16:43:16.471365 228 storage/replica.go:3116  [s1,r2/1:{-}] set hs:  {Term:11 Vote:2 Commit:0 XXX_unrecognized:[]}
I170627 16:43:16.471677 228 storage/replica.go:3116  [s1,r2/1:{-}] set hs:  {Term:12 Vote:2 Commit:0 XXX_unrecognized:[]}
I170627 16:43:16.472065 228 storage/replica.go:3116  [s1,r2/1:{-}] set hs:  {Term:13 Vote:2 Commit:0 XXX_unrecognized:[]}
W170627 16:43:16.472636 74 storage/replica.go:4218  [s1,r1/1:/M{in-ax}] previously: {Term:13 Vote:2 Commit:0 XXX_unrecognized:[]}
W170627 16:43:16.472677 74 storage/replica.go:4219  [s1,r1/1:/M{in-ax}] over-written with: {Term:10 Vote:2 Commit:10 XXX_unrecognized:[]}
W170627 16:43:16.472716 74 storage/replica.go:4220  [s1,r1/1:/M{in-ax}] clobbered hard state: [Term: 10 != 13 Commit: 10 != 0]
I170627 16:43:16.473938 228 storage/replica.go:3116  [s1,r2/1:{a-/Max}] set hs:  {Term:14 Vote:0 Commit:10 XXX_unrecognized:[]}
I170627 16:43:16.474799 155 storage/replica.go:3116  [s2,r2/2:{a-/Max}] set hs:  {Term:14 Vote:0 Commit:10 XXX_unrecognized:[]}

Important bit: clobbered hard state: [Term: 10 != 13 Commit: 10 != 0]
Also interesting to note is that following this we have set hs: {Term:14 Vote:0 Commit:10 XXX_unrecognized:[]}, not seeing the effect of the HardState reset.

@irfansharif
Copy link
Contributor Author

irfansharif commented Jun 27, 2017

From internal discussions, copied here:

I can reproduce this scenario under aggressive raft conditions (frequent elections, short timers) and split triggers similar to #7600 where you can have the following sequence of events:

  • we have a range split in a multi node cluster
  • before the split comes out of raft for all the nodes, one of the nodes that hasn’t received the split trigger receives a request for that range and thus creates an uninitialized replica (correct me here if I’m wrong)
  • before the split trigger comes out of raft, it advances it’s HardState
    when the split trigger comes out of raft, it clobbers the HardState on disk that was being written to earlier.

is this last step correct? if so, is it because the earlier writes were by an uninitialized replica?

-- @irfansharif

I think we screwed this up when we switched to proposer-evaluated KV. See writeInitialState which calls synthesizeRaftState — it does that call precisely to avoid that scenario (before proposer-evaluated KV). Now I unfortunately don’t see anything that prevents this since the writeInitialState call has moved pre-Raft.

-- @tschottdorf

@irfansharif
Copy link
Contributor Author

cc @bdarnell.

@tbg
Copy link
Member

tbg commented Jun 27, 2017

An obvious solution is to call sythesizeRaftState downstream of Raft, before committing the batch. Cc. @danhhz who needs something similar for sstable ingestion.

@cockroachdb cockroachdb deleted a comment from tbg Jun 27, 2017
@tbg
Copy link
Member

tbg commented Jun 28, 2017

(in the longer run, we probably should look into going back to not writing the initial HardState ourselves as we did at one point in the past, though doing so has highlighted the concurrency subtleties of splits)

@tbg tbg self-assigned this Jul 3, 2017
@tbg
Copy link
Member

tbg commented Jul 3, 2017

How frequently do you get this? I haven't gotten a failure in 4 minutes of make stress, trying stressrace now, but also already 4min in.

@irfansharif
Copy link
Contributor Author

this behavior doesn't currently trigger any failures if that's what you're looking for, just curious observation when instrumenting as above. one way to panic when this happens is in replica_state.go: replicaStateLoader.setHardState. batch here is an engine.ReadWriter, so read the old value (if any), compare to the new and panic if new value has a lower term (or the same term and lower commit idx).

@tbg
Copy link
Member

tbg commented Jul 3, 2017

oh, right, forgot you had only warnings in the above diff. Thanks for the heads up!

@tbg
Copy link
Member

tbg commented Jul 3, 2017

Am I still messing it up? Haven't gotten a failure in ~5min. Maybe I'll just need to overload the machine more.

diff --git a/pkg/storage/replica_state.go b/pkg/storage/replica_state.go
index 708de1ea4..dcbc96337 100644
--- a/pkg/storage/replica_state.go
+++ b/pkg/storage/replica_state.go
@@ -455,6 +455,16 @@ func (rsl replicaStateLoader) loadHardState(
 func (rsl replicaStateLoader) setHardState(
        ctx context.Context, batch engine.ReadWriter, st raftpb.HardState,
 ) error {
+       var oldHS raftpb.HardState
+       if _, err := engine.MVCCGetProto(
+               ctx, batch, rsl.RaftHardStateKey(),
+               hlc.Timestamp{}, true, nil, &oldHS,
+       ); err != nil {
+               panic(err)
+       }
+       if oldHS.Term > st.Term || oldHS.Commit > st.Commit {
+               log.Fatalf(ctx, "from %+v to %+v", oldHS, st)
+       }
        return engine.MVCCPutProto(ctx, batch, nil,
                rsl.RaftHardStateKey(), hlc.Timestamp{}, nil, &st)
 }

@tbg
Copy link
Member

tbg commented Jul 3, 2017

Ah, the clobbering doesn't happen in setHardState (which is called during evaluation). It happens where you warn, and in fact clobbering happens almost every time.

@irfansharif
Copy link
Contributor Author

whoops, right, during batch.ApplyRepr. and yes, very frequent.

@tbg
Copy link
Member

tbg commented Jul 3, 2017

Thanks for the prep work, that'll make fixing much more pleasant.

@tbg
Copy link
Member

tbg commented Jul 3, 2017

Ok, I have a working fix but it's a bit too hacky to merge right now. Essentially we now avoid writing the HardState in the split batch completely, but we synthesize it downstream of Raft, the key diff being this:

diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index a183dd977..2413d9c1b 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -1847,6 +1847,24 @@ func splitPostApply(
                log.Fatal(ctx, err)
        }

+       {
+               var s storagebase.ReplicaState
+               s.TruncatedState = &roachpb.RaftTruncatedState{
+                       Term:  raftInitialLogTerm,
+                       Index: raftInitialLogIndex,
+               }
+               s.RaftAppliedIndex = s.TruncatedState.Index
+               oldHS, err := rightRng.raftMu.stateLoader.loadHardState(ctx, r.store.Engine())
+               if err != nil {
+                       log.Fatal(ctx, err)
+               }
+               if err := rightRng.raftMu.stateLoader.synthesizeHardState(
+                       ctx, r.store.Engine(), s, &oldHS,
+               ); err != nil {
+                       log.Fatal(ctx, err)
+               }
+       }
+
        // Finish initialization of the RHS.

        r.mu.Lock()

The full branch is here: https://github.com/tschottdorf/cockroach/tree/split-clobbering

I should be able to clean this up for 1.0.4.

tbg added a commit to tbg/cockroach that referenced this issue Jul 17, 2017
Motivated by cockroachdb#16749. Added an assertion that catches HardState clobbering. Now

```
make stressrace PKG=./pkg/storage/ TESTS=TestStoreRangeSplitRaceUninitializedRHS
```

fails immediately with

```
clobbered hard state: [Term: 8 != 9 Commit: 10 != 0]

previously: raftpb.HardState{
    Term:             0x9,
    Vote:             0x2,
    Commit:           0x0,
    XXX_unrecognized: nil,
}
overwritten with: raftpb.HardState{
    Term:             0x8,
    Vote:             0x2,
    Commit:           0xa,
    XXX_unrecognized: nil,
}
```

which is fixed in the next commit in this PR.
tbg added a commit to tbg/cockroach that referenced this issue Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Note that there is no migration concern.

Fixes cockroachdb#16749.
tbg added a commit to tbg/cockroach that referenced this issue Jul 17, 2017
Motivated by cockroachdb#16749. Added an assertion that catches HardState clobbering. Now

```
make stressrace PKG=./pkg/storage/ TESTS=TestStoreRangeSplitRaceUninitializedRHS
```

fails immediately with

```
clobbered hard state: [Term: 8 != 9 Commit: 10 != 0]

previously: raftpb.HardState{
    Term:             0x9,
    Vote:             0x2,
    Commit:           0x0,
    XXX_unrecognized: nil,
}
overwritten with: raftpb.HardState{
    Term:             0x8,
    Vote:             0x2,
    Commit:           0xa,
    XXX_unrecognized: nil,
}
```

which is fixed in the next commit in this PR.
tbg added a commit to tbg/cockroach that referenced this issue Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Note that there is no migration concern.

Fixes cockroachdb#16749.
tbg added a commit to tbg/cockroach that referenced this issue Jul 17, 2017
Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat). As
explained above, the change isn't made for performance at this point, though.

It also removes one migration concern for cockroachdb#16809, see
cockroachdb#16809 (comment).

We'll need to migrate this. It's straightforward with the in-flight PR cockroachdb#16977.

- we're moving logic downstream of Raft. However, we can easily migrate it
  upstream again, without a real migration, though I don't think that's going
  to happen.
- the big upshot is hopefully a large reduction in complexity for
  @irfansharif's PR: log truncation is one of the odd cases that requires
  a RaftWriteBatch. cockroachdb#16749 is the only other one, and there the (correct)
  solution also involves going downstream of Raft for a Raft-related write. So,
  after solving both of those, I think RaftWriteBatch can go? cc @irfansharif
- as @petermattis pointed out, after @irfansharif's change, we should be able
  to not sync the base engine on truncation changes but do it only as we
  actually clear the log entries (which can be delayed as we see fit). So for
  1000 log truncations across many ranges, we'll only have to sync once if
  that's how we set it up.
tbg added a commit to tbg/cockroach that referenced this issue Jul 17, 2017
Motivated by cockroachdb#16749. Added an assertion that catches HardState clobbering. Now

```
make stressrace PKG=./pkg/storage/ TESTS=TestStoreRangeSplitRaceUninitializedRHS
```

fails immediately with

```
clobbered hard state: [Term: 8 != 9 Commit: 10 != 0]

previously: raftpb.HardState{
    Term:             0x9,
    Vote:             0x2,
    Commit:           0x0,
    XXX_unrecognized: nil,
}
overwritten with: raftpb.HardState{
    Term:             0x8,
    Vote:             0x2,
    Commit:           0xa,
    XXX_unrecognized: nil,
}
```

which is fixed in the next commit in this PR.
tbg added a commit to tbg/cockroach that referenced this issue Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Migration concerns: a lease holder running the new version will propose splits
that don't propose the HardState to Raft. A follower running the old version
will not write the HardState downstream of Raft. In combination, the HardState
would never get written, and would thus be incompatible with the
TruncatedState. Thus, while 1.0 might be around, we're still sending the
potentially dangerous HardState.

Fixes cockroachdb#16749.
tbg added a commit to tbg/cockroach that referenced this issue Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Migration concerns: a lease holder running the new version will propose splits
that don't propose the HardState to Raft. A follower running the old version
will not write the HardState downstream of Raft. In combination, the HardState
would never get written, and would thus be incompatible with the
TruncatedState. Thus, while 1.0 might be around, we're still sending the
potentially dangerous HardState.

Fixes cockroachdb#16749.
tbg added a commit to tbg/cockroach that referenced this issue Jul 18, 2017
Motivated by cockroachdb#16749. Added an assertion that catches HardState clobbering. Now

```
make stressrace PKG=./pkg/storage/ TESTS=TestStoreRangeSplitRaceUninitializedRHS
```

fails immediately with

```
clobbered hard state: [Term: 8 != 9 Commit: 10 != 0]

previously: raftpb.HardState{
    Term:             0x9,
    Vote:             0x2,
    Commit:           0x0,
    XXX_unrecognized: nil,
}
overwritten with: raftpb.HardState{
    Term:             0x8,
    Vote:             0x2,
    Commit:           0xa,
    XXX_unrecognized: nil,
}
```

which is fixed in the next commit in this PR.
tbg added a commit to tbg/cockroach that referenced this issue Jul 18, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Migration concerns: a lease holder running the new version will propose splits
that don't propose the HardState to Raft. A follower running the old version
will not write the HardState downstream of Raft. In combination, the HardState
would never get written, and would thus be incompatible with the
TruncatedState. Thus, while 1.0 might be around, we're still sending the
potentially dangerous HardState.

Fixes cockroachdb#16749.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants