-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: RHS HardState clobbering during splits #16749
Comments
From internal discussions, copied here:
-- @irfansharif
-- @tschottdorf |
cc @bdarnell. |
An obvious solution is to call sythesizeRaftState downstream of Raft, before committing the batch. Cc. @danhhz who needs something similar for sstable ingestion. |
(in the longer run, we probably should look into going back to not writing the initial HardState ourselves as we did at one point in the past, though doing so has highlighted the concurrency subtleties of splits) |
How frequently do you get this? I haven't gotten a failure in 4 minutes of |
this behavior doesn't currently trigger any failures if that's what you're looking for, just curious observation when instrumenting as above. one way to panic when this happens is in |
oh, right, forgot you had only warnings in the above diff. Thanks for the heads up! |
Am I still messing it up? Haven't gotten a failure in ~5min. Maybe I'll just need to overload the machine more. diff --git a/pkg/storage/replica_state.go b/pkg/storage/replica_state.go
index 708de1ea4..dcbc96337 100644
--- a/pkg/storage/replica_state.go
+++ b/pkg/storage/replica_state.go
@@ -455,6 +455,16 @@ func (rsl replicaStateLoader) loadHardState(
func (rsl replicaStateLoader) setHardState(
ctx context.Context, batch engine.ReadWriter, st raftpb.HardState,
) error {
+ var oldHS raftpb.HardState
+ if _, err := engine.MVCCGetProto(
+ ctx, batch, rsl.RaftHardStateKey(),
+ hlc.Timestamp{}, true, nil, &oldHS,
+ ); err != nil {
+ panic(err)
+ }
+ if oldHS.Term > st.Term || oldHS.Commit > st.Commit {
+ log.Fatalf(ctx, "from %+v to %+v", oldHS, st)
+ }
return engine.MVCCPutProto(ctx, batch, nil,
rsl.RaftHardStateKey(), hlc.Timestamp{}, nil, &st)
} |
Ah, the clobbering doesn't happen in |
whoops, right, during |
Thanks for the prep work, that'll make fixing much more pleasant. |
Ok, I have a working fix but it's a bit too hacky to merge right now. Essentially we now avoid writing the HardState in the split batch completely, but we synthesize it downstream of Raft, the key diff being this: diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index a183dd977..2413d9c1b 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -1847,6 +1847,24 @@ func splitPostApply(
log.Fatal(ctx, err)
}
+ {
+ var s storagebase.ReplicaState
+ s.TruncatedState = &roachpb.RaftTruncatedState{
+ Term: raftInitialLogTerm,
+ Index: raftInitialLogIndex,
+ }
+ s.RaftAppliedIndex = s.TruncatedState.Index
+ oldHS, err := rightRng.raftMu.stateLoader.loadHardState(ctx, r.store.Engine())
+ if err != nil {
+ log.Fatal(ctx, err)
+ }
+ if err := rightRng.raftMu.stateLoader.synthesizeHardState(
+ ctx, r.store.Engine(), s, &oldHS,
+ ); err != nil {
+ log.Fatal(ctx, err)
+ }
+ }
+
// Finish initialization of the RHS.
r.mu.Lock() The full branch is here: https://github.com/tschottdorf/cockroach/tree/split-clobbering I should be able to clean this up for 1.0.4. |
Motivated by cockroachdb#16749. Added an assertion that catches HardState clobbering. Now ``` make stressrace PKG=./pkg/storage/ TESTS=TestStoreRangeSplitRaceUninitializedRHS ``` fails immediately with ``` clobbered hard state: [Term: 8 != 9 Commit: 10 != 0] previously: raftpb.HardState{ Term: 0x9, Vote: 0x2, Commit: 0x0, XXX_unrecognized: nil, } overwritten with: raftpb.HardState{ Term: 0x8, Vote: 0x2, Commit: 0xa, XXX_unrecognized: nil, } ``` which is fixed in the next commit in this PR.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Note that there is no migration concern. Fixes cockroachdb#16749.
Motivated by cockroachdb#16749. Added an assertion that catches HardState clobbering. Now ``` make stressrace PKG=./pkg/storage/ TESTS=TestStoreRangeSplitRaceUninitializedRHS ``` fails immediately with ``` clobbered hard state: [Term: 8 != 9 Commit: 10 != 0] previously: raftpb.HardState{ Term: 0x9, Vote: 0x2, Commit: 0x0, XXX_unrecognized: nil, } overwritten with: raftpb.HardState{ Term: 0x8, Vote: 0x2, Commit: 0xa, XXX_unrecognized: nil, } ``` which is fixed in the next commit in this PR.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Note that there is no migration concern. Fixes cockroachdb#16749.
Sending log truncations through Raft is inefficient: the Raft log is not itself part of the replicated state. Instead, we only replicate the TruncatedState and, as a side effect, ClearRange() the affected key range. This is an individual performance optimization whose impact we should measure; anecdotally it always looked like we were doing a lot of work for truncations during a write-heavy workload; this should alleviate this somewhat). As explained above, the change isn't made for performance at this point, though. It also removes one migration concern for cockroachdb#16809, see cockroachdb#16809 (comment). We'll need to migrate this. It's straightforward with the in-flight PR cockroachdb#16977. - we're moving logic downstream of Raft. However, we can easily migrate it upstream again, without a real migration, though I don't think that's going to happen. - the big upshot is hopefully a large reduction in complexity for @irfansharif's PR: log truncation is one of the odd cases that requires a RaftWriteBatch. cockroachdb#16749 is the only other one, and there the (correct) solution also involves going downstream of Raft for a Raft-related write. So, after solving both of those, I think RaftWriteBatch can go? cc @irfansharif - as @petermattis pointed out, after @irfansharif's change, we should be able to not sync the base engine on truncation changes but do it only as we actually clear the log entries (which can be delayed as we see fit). So for 1000 log truncations across many ranges, we'll only have to sync once if that's how we set it up.
Motivated by cockroachdb#16749. Added an assertion that catches HardState clobbering. Now ``` make stressrace PKG=./pkg/storage/ TESTS=TestStoreRangeSplitRaceUninitializedRHS ``` fails immediately with ``` clobbered hard state: [Term: 8 != 9 Commit: 10 != 0] previously: raftpb.HardState{ Term: 0x9, Vote: 0x2, Commit: 0x0, XXX_unrecognized: nil, } overwritten with: raftpb.HardState{ Term: 0x8, Vote: 0x2, Commit: 0xa, XXX_unrecognized: nil, } ``` which is fixed in the next commit in this PR.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.
Motivated by cockroachdb#16749. Added an assertion that catches HardState clobbering. Now ``` make stressrace PKG=./pkg/storage/ TESTS=TestStoreRangeSplitRaceUninitializedRHS ``` fails immediately with ``` clobbered hard state: [Term: 8 != 9 Commit: 10 != 0] previously: raftpb.HardState{ Term: 0x9, Vote: 0x2, Commit: 0x0, XXX_unrecognized: nil, } overwritten with: raftpb.HardState{ Term: 0x8, Vote: 0x2, Commit: 0xa, XXX_unrecognized: nil, } ``` which is fixed in the next commit in this PR.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.
observed this curious behavior recently where we seem to clobber on-disk HardState; naively this should not happen. See #7600 for a strikingly similar occurrence.
The following diff adding some extra instrumentation, coupled with this test demonstrates the clobbering behavior.
They are both available on this branch,
TestMinimal
is just a simplified version of the existingTestStoreRangeSplitRaceUninitializedRHS
test but with only one split operation for simplified, albeit hacky, logging.Observing the logs for
make test PKG=./pkg/storage TESTFLAGS="-v" TESTS=TestMinimal
:Important bit:
clobbered hard state: [Term: 10 != 13 Commit: 10 != 0]
Also interesting to note is that following this we have
set hs: {Term:14 Vote:0 Commit:10 XXX_unrecognized:[]}
, not seeing the effect of theHardState
reset.The text was updated successfully, but these errors were encountered: