Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.2: snapshot can sneak into RHS of split #73721

Closed
cockroach-teamcity opened this issue Dec 12, 2021 · 14 comments
Closed

release-21.2: snapshot can sneak into RHS of split #73721

cockroach-teamcity opened this issue Dec 12, 2021 · 14 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 12, 2021

kv/kvserver.TestChangeReplicasLeaveAtomicRacesWithMerge failed with artifacts on master @ 22df0a6658a927e7b65dafbdfc9d790500791993:

F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/auth.go:97 +0x1c3
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !google.golang.org/grpc.chainStreamInterceptors.func1.1({0x4055b40, 0xc007a01500}, {0x566e5b0, 0xc0035eab40})
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1419 +0x87
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func2.1({0x0, 0x8})
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:189 +0x2d
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc005eae1b0, {0x563aeb8, 0xc007fe67e0}, {0x427f54, 0x203001}, 0xc00880fb58)
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:347 +0xdd
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func2({0x4055b40, 0xc007a01500}, {0x566e5b0, 0xc0035eab40}, 0xc014177170, 0xc006466380)
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:188 +0xf3
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !google.golang.org/grpc.chainStreamInterceptors.func1.1({0x4055b40, 0xc007a01500}, {0x566e5b0, 0xc0035eab40})
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1419 +0x87
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !google.golang.org/grpc.chainStreamInterceptors.func1({0x4055b40, 0xc007a01500}, {0x566e5b0, 0xc0035eab40}, 0xc014177170, 0x4792910)
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1421 +0x159
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !google.golang.org/grpc.(*Server).processStreamingRPC(0xc00e47a380, {0x56b4988, 0xc00274e900}, 0xc008658c60, 0xc0096ae660, 0x79c8d00, 0x0)
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1557 +0xe9a
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !google.golang.org/grpc.(*Server).handleStream(0xc00e47a380, {0x56b4988, 0xc00274e900}, 0xc008658c60, 0x0)
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:1630 +0x9e5
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !google.golang.org/grpc.(*Server).serveStreams.func1.2()
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:941 +0x98
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !created by google.golang.org/grpc.(*Server).serveStreams.func1
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:939 +0x294
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !For more context, check log files in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestChangeReplicasLeaveAtomicRacesWithMerge2349777449
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !****************************************************************************
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !This node experienced a fatal error (printed above), and as a result the
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !process is terminating.
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !Fatal errors can occur due to faulty hardware (disks, memory, clocks) or a
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !problem in CockroachDB. With your help, the support team at Cockroach Labs
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !will try to determine the root cause, recommend next steps, and we can
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !improve CockroachDB based on your report.
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !Please submit a crash report by following the instructions here:
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !    https://github.com/cockroachdb/cockroach/issues/new/choose
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !If you would rather not post publicly, please contact us directly at:
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !    [email protected]
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !
F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1 !The Cockroach Labs team appreciates your feedback.


ERROR: exit status 7

1 runs completed, 1 failures, over 6m7s
context canceled
Help

See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)Parameters in this failure:

  • GOFLAGS=-parallel=4

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-11707

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Dec 12, 2021
@tbg
Copy link
Member

tbg commented Dec 13, 2021

F211212 08:12:51.323075 627789 kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1  unable to remove placeholder: corrupted replicasByKey map: <nil> and &{AmbientContext:{Tracer:0xc00b39baa0 eventLog:<nil> tags:0xc0077c8500 backgroundCtx:0xc007fe65a0} RangeID:44 startKey:[250] store:0xc00831d000 abortSpan:0xc00ef224d8 leaseholderStats:0xc0072c1dd0 writeStats:0xc0072c1e60 creatingReplica:0xc009b75b78 readOnlyCmdMu:{RWMutex:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0}} rangeStr:{strPtr:0xc00f34bfa0} connectionClass:0 raftCtx:0xc007fe65d0 raftMu:{Mutex:{Mutex:{state:1 sema:0}} stateLoader:{RangeIDPrefixBuf:[1 105 180]} sideloaded:0xc0064660c0 stateMachine:{r:0xc0034eca80 batch:{r:0xc0034eca80 sm:0xc0034ecb58 batch:0xc004197400 state:{RaftAppliedIndex:35 LeaseAppliedIndex:23 Desc:0xc00b47e2a0 Lease:0xc014b41740 TruncatedState:0xc00f80d3e0 GCThreshold:0xc00f80d370 Stats:0xc0034ecbf8 DeprecatedUsingAppliedStateKey:false Version:0xc00f80d420 RaftClosedTimestamp:{WallTime:1639296768312812135 Logical:0 Synthetic:false}} closedTimestampSetter:{lease:0xc014b41740 leaseIdx:23 leaseReq:<nil> split:false merge:false} stats:{ContainsEstimates:0 LastUpdateNanos:1639296771212240367 IntentAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 SeparatedIntentCount:0 SysBytes:5566 SysCount:20 AbortSpanBytes:0} maxTS:{WallTime:0 Logical:0 Synthetic:false} changeRemovesReplica:false entries:0 emptyEntries:0 mutations:0 start:{wall:320480044 ext:63774893571 loc:<nil>}} ephemeralBatch:{r:<nil> state:{RaftAppliedIndex:0 LeaseAppliedIndex:0 Desc:<nil> Lease:<nil> TruncatedState:<nil> GCThreshold:<nil> Stats:<nil> DeprecatedUsingAppliedStateKey:false Version:<nil> RaftClosedTimestamp:{WallTime:0 Logical:0 Synthetic:false}}} stats:{batchesProcessed:1 entriesProcessed:2 stateAssertions:0 numEmptyEntries:0 numConfChangeEntries:0}} decoder:{r:0xc0034eca80 cmdBuf:{len:3 head:0xc00345d000 tail:0xc00345d000 free:{iters:[{head:{idx:3 buf:0xc0034ecd50 node:0xc00345d000} tail:{idx:3 buf:0xc0034ecd50 node:0xc00345d000}} {head:{idx:0 buf:0xc0034ecd50 node:0xc00345d000} tail:{idx:1 buf:0xc0034ecd50 node:0xc00345d000}} {head:{idx:0 buf:0xc0034ecd50 node:0xc00345d000} tail:{idx:0 buf:0xc0034ecd50 node:0xc00345d000}}] inUse:[true true true]}}}} leaseHistory:0xc007fe6510 concMgr:0xc00a9d5180 tenantLimiter:<nil> tenantMetricsRef:0xc016dfd950 sideTransportClosedTimestamp:{rangeID:44 receiver:0xc008cdb880 mu:{RWMutex:{RWMutex:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0}} cur:{ts:{WallTime:0 Logical:0 Synthetic:false} lai:0} next:{ts:{WallTime:0 Logical:0 Synthetic:false} lai:0}}} mu:{RWMutex:{RWMutex:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0}} destroyStatus:{reason:0 err:<nil>} quiescent:false laggingFollowersOnQuiesce:[] mergeComplete:<nil> mergeTxnID:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] freezeStart:{WallTime:0 Logical:0 Synthetic:false} state:{RaftAppliedIndex:35 LeaseAppliedIndex:23 Desc:0xc00b47e2a0 Lease:0xc014b41740 TruncatedState:0xc00f80d3e0 GCThreshold:0xc00f80d370 Stats:0xc003d57600 DeprecatedUsingAppliedStateKey:false Version:0xc00f80d420 RaftClosedTimestamp:{WallTime:1639296768312812135 Logical:0 Synthetic:false}} lastIndex:36 lastTerm:6 snapshotLogTruncationConstraints:map[] raftLogSize:5571 raftLogSizeTrusted:false raftLogLastCheckSize:0 pendingLeaseRequest:{repl:0xc0034eca80 llHandles:map[] cancelLocked:<nil> nextLease:{Start:{WallTime:0 Logical:0 Synthetic:false} Expiration:<nil> Replica:{NodeID:0 StoreID:0 ReplicaID:0 Type:<nil>} DeprecatedStartStasis:<nil> ProposedTS:<nil> Epoch:0 Sequence:0 AcquisitionType:0}} minLeaseProposedTS:{WallTime:0 Logical:0 Synthetic:false} conf:{RangeMinBytes:134217728 RangeMaxBytes:536870912 GCPolicy:{TTLSeconds:90000} GlobalReads:false NumReplicas:3 NumVoters:0 Constraints:[] VoterConstraints:[] LeasePreferences:[]} proposalBuf:{p:0xc0034eca80 clock:0xc002d182d0 settings:0xc0009fe000 evalTracker:0xc007fe64e0 full:{noCopy:{} L:0xc0034ece90 notify:{wait:0 notify:0 lock:0 head:<nil> tail:<nil>} checker:0} arr:{small:[<nil> <nil> <nil> <nil>] large:[] shrink:0} allocatedIdx:0 assignedLAI:0 assignedClosedTimestamp:{WallTime:0 Logical:0 Synthetic:false} scratchFooter:{MaxLeaseIndex:0 ClosedTimestamp:{WallTime:0 Logical:0 Synthetic:false}} testing:{leaseIndexFilter:<nil> insertFilter:<nil> submitProposalFilter:<nil> allowLeaseProposalWhenNotLeader:false dontCloseTimestamps:false}} proposals:map[] applyingEntries:true internalRaftGroup:0xc007fe6c90 replicaID:4 tombstoneMinReplicaID:0 leaderID:1 lastReplicaAdded:4 lastReplicaAddedTime:{wall:245571740 ext:63774893571 loc:<nil>} lastUpdateTimes:map[] lastToReplica:{NodeID:4 StoreID:4 ReplicaID:4 Type:<nil>} lastFromReplica:{NodeID:1 StoreID:1 ReplicaID:1 Type:<nil>} checksums:map[] proposalQuota:<nil> proposalQuotaBaseIndex:0 quotaReleaseQueue:[] ticks:0 droppedMessages:0 stateLoader:{RangeIDPrefixBuf:[1 105 180]} cachedProtectedTS:{readAt:{WallTime:0 Logical:0 Synthetic:false} earliestRecord:<nil>} largestPreviousMaxRangeSizeBytes:0 spanConfigExplicitlySet:true failureToGossipSystemConfig:false tenantID:{id:1} closedTimestampSetter:{lease:0xc014b41740 leaseIdx:23 leaseReq:<nil> split:false merge:false}} rangefeedMu:{RWMutex:{RWMutex:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0}} proc:<nil> opFilter:<nil>} splitQueueThrottle:{N:5000000000 Mutex:{Mutex:{state:0 sema:0}} lastProcessed:{wall:0 ext:0 loc:<nil>}} mergeQueueThrottle:{N:5000000000 Mutex:{Mutex:{state:0 sema:0}} lastProcessed:{wall:247508751 ext:63774893571 loc:<nil>}} loadBasedSplitter:{intn:0x520bc0 qpsThreshold:0x19bddc0 qpsRetention:0x19bdd60 mu:{Mutex:{Mutex:{state:0 sema:0}} lastQPSRollover:{wall:0 ext:0 loc:<nil>} lastQPS:0 count:0 maxQPS:{windows:[0 0 0 0 0 0] curIdx:0 curStart:{wall:0 ext:0 loc:<nil>} lastReset:{wall:0 ext:0 loc:<nil>} minRetention:0} splitFinder:<nil> lastSplitSuggestion:{wall:0 ext:0 loc:<nil>}}} unreachablesMu:{Mutex:{Mutex:{state:0 sema:0}} remotes:map[]} protectedTimestampMu:{Mutex:{Mutex:{state:0 sema:0}} minStateReadTimestamp:{WallTime:0 Logical:0 Synthetic:false} pendingGCThreshold:{WallTime:0 Logical:0 Synthetic:false}}} overlapped

@tbg
Copy link
Member

tbg commented Dec 13, 2021

This assertion unfortunately isn't logging the right thing:

if it := s.getOverlappingKeyRangeLocked(&placeholder.rangeDesc); it.item != nil {
return false, errors.AssertionFailedf("corrupted replicasByKey map: %+v and %+v overlapped", it.ph, it.item)
}

It is meant to log placeholder. I'll see if I can get a repro.

@tbg
Copy link
Member

tbg commented Dec 13, 2021

Just going from the log messages, I think what is happening is an overlap between two different ranges:

  • r44 merges r45 (\xfaa-/Max)
  • r46 is created via a re-split at \xfaa of r44
  • on n4, we see a snapshot arrive for r46 (the splitTriggerHelper is supposed to suppress that; the left neighbor is there, but maybe there is a bug in the delay heuristic because r44 hasn't applied the preceding merge yet, potentially?
  • we make an uninitialized replica r46/3 in response to the snapshot request (see [n4,s4,r46/3:{-}] log tag on fatal msg)
  • while applying the snapshot, when the placeholder r46:\xfaa-/Max gets removed, we find that it overlaps the replica r44 (see sideTransportClosedTimestamp:{rangeID:‹44› in assertion)

I am confused as to why the snapshot is even getting received. The way it should work is that we check for overlap before accepting the snapshot:

if placeholder != nil {
if err := s.addPlaceholderLocked(placeholder); err != nil {
return roachpb.NewError(err)
}
}

I can see in the logs that n4 removes the subsumed replica r45/4 just a hair before it receives the snapshot, which comes from this place in MergeRange:

// Note that we were called (indirectly) from raft processing so we must
// call removeInitializedReplicaRaftMuLocked directly to avoid deadlocking
// on the right-hand replica's raftMu.
if err := s.removeInitializedReplicaRaftMuLocked(ctx, rightRepl, rightDesc.NextReplicaID, RemoveOptions{
// The replica was destroyed by the tombstones added to the batch in
// runPreApplyTriggersAfterStagingWriteBatch.
DestroyData: false,
}); err != nil {
return errors.Errorf("cannot remove range: %s", err)
}

Interestingly, we update (widen) the left hand side's descriptor only at the very end of MergeRange:

// Update the subsuming range's descriptor.
leftRepl.setDescRaftMuLocked(ctx, &newLeftDesc)
return nil
}

So the conjecture goes like this, from the point of view of n4:

  • a snapshot arrives for r46 on \xfaa-/Max
  • the merge trigger on r44 starts executing and removes the replica r45 (\xfaa-/Max)
  • the snapshot checks for overlap, but r44 at the time is something-\xfaa, so it doesn't overlap, and the replica for r46 is already unliked so it doesn't conflict either; we accept the snapshot
  • the merge trigger finishes, updating r44 to something-/Max
  • when the snapshot is applied, it wants to remove its placeholder r46: \xfaa-/Max but in the process finds that this placeholder overlaps something-/Max (r44).

If this checks out, the bug is in MergeRange: it leaves the keyspace belonging to the right-hand side unprotected for a brief period of time. This isn't benign: the snapshot is basically clobbering the right-hand side.

I managed to reproduce the bug on my gceworker with vanilla

make stress PKG=./pkg/kv/kvserver TESTS=TestChangeReplicasLeaveAtomicRacesWithMerge

1297 runs completed, 1 failures, over 3m49s

@erikgrinaker
Copy link
Contributor

I don't believe we've seen this failure before? Any thoughts on why it's cropping up now?

@tbg
Copy link
Member

tbg commented Dec 13, 2021

I'll bisect it now. Without having checked, my guess would be that the new thing is the snapshot occurring. Maybe we accidentally disable the snapshot delay heuristic now. One of @nvanbenschoten's recent quiescing cleanups might have done it. Pure pure pure speculation. We'll find out. Don't think the bug is new.

@tbg
Copy link
Member

tbg commented Dec 13, 2021

Still bisecting, but with this diff it fails pretty much every time (not surprising, but still)

diff --git a/pkg/kv/kvserver/store_merge.go b/pkg/kv/kvserver/store_merge.go
index e96737888f..c962e708ae 100644
--- a/pkg/kv/kvserver/store_merge.go
+++ b/pkg/kv/kvserver/store_merge.go
@@ -16,6 +16,8 @@ import (
 	"github.com/cockroachdb/cockroach/pkg/kv/kvserver/readsummary/rspb"
 	"github.com/cockroachdb/cockroach/pkg/roachpb"
 	"github.com/cockroachdb/cockroach/pkg/util/hlc"
+	"github.com/cockroachdb/cockroach/pkg/util/iterutil"
+	"github.com/cockroachdb/cockroach/pkg/util/log"
 	"github.com/cockroachdb/errors"
 )
 
@@ -38,6 +40,33 @@ func (s *Store) MergeRange(
 			newLeftDesc.EndKey, oldLeftDesc.EndKey)
 	}
 
+	// Check that there's never a gap to the right of the pre-merge LHS in replicasByKey.
+	assertCtx, stopAsserting := context.WithCancel(ctx)
+	assertStart := leftRepl.Desc().EndKey
+	assertEnd := assertStart.Next()
+	defer stopAsserting()
+	_ = s.stopper.RunAsyncTask(assertCtx, "force-assertion", func(ctx context.Context) {
+		for {
+			func() {
+				s.mu.Lock()
+				defer s.mu.Unlock()
+				var it replicaOrPlaceholder
+				if err := s.mu.replicasByKey.VisitKeyRange(
+					context.Background(), assertStart, assertEnd, AscendingKeyOrder,
+					func(ctx context.Context, iit replicaOrPlaceholder) error {
+						it = iit
+						return iterutil.StopIteration()
+					}); err != nil {
+					log.Fatalf(ctx, "%v", err)
+				}
+				if it.item != nil {
+					return
+				}
+				log.Fatalf(ctx, "found hole in keyspace desipte pending merge")
+			}()
+		}
+	})
+
 	rightRepl, err := s.GetReplica(rightDesc.RangeID)
 	if err != nil {
 		return err

@tbg
Copy link
Member

tbg commented Dec 13, 2021

It's hard to really bisect this down since the test manages to hit various other fatal errors, for example

F211213 15:08:54.680980 5224 kv/kvserver/store_raft.go:511 [n4,s4,r44/4:/{Table/Max-Max}] 1 unable to acquire split lock: %!v(PANIC=SafeFormatter method: runtime error: invalid memory address or nil pointer dereference)

As far as I can tell the problem (leaving the keyspace unguarded) has existed for some time and definitely predates #72471, which I initially thought introduced this. I'm going to focus my energy on fixing things up so that this test does not fatal on a 20 node cluster in hours.

@erikgrinaker
Copy link
Contributor

Interesting, thanks for digging into this. Have you tried a repro on a release branch, i.e. 21.2?

@tbg
Copy link
Member

tbg commented Dec 13, 2021

Good idea, will try that tomorrow.

@tbg
Copy link
Member

tbg commented Dec 14, 2021

release-21.2:

make stress PKG=./pkg/kv/kvserver TESTS=TestChangeReplicasLeaveAtomicRacesWithMerg STRESSFLAGS='-timeout 8m' 2>&1 | tee stress.log
[...]
F211214 09:54:38.686702 4857 kv/kvserver/store_split.go:154  [n2,s2,r44/2:{/Table/Max-\xfaa}] 1  [n2,s2,r44/2:{/Table/Max-\xfaa}]: failed to update Store after split: unable to add replica [n2,s2,r46/2:{\xfaa-/Max}]: [n2,s2]: cannot addReplicaInternalLocked; range [n2,s2,r46/2:{\xfaa-/Max}] has overlapping range r46:{\xfaa-/Max} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=12, sticky=9223372036.854775807,0]

1443 runs completed, 1 failures, over 4m21s

tbg added a commit to tbg/cockroach that referenced this issue Dec 17, 2021
In cockroachdb#73721 we saw the following assertion fire:

> kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1  unable to
> remove placeholder: corrupted replicasByKey map: <nil> and [...]

This is because `MergeRange` removes from the Store's in-memory map the
right-hand side Replica before extending the left-hand side, leaving a
gap for a snapshot to sneak in. A similar problem exists when a snapshot
widens the existing range (i.e. the snapshot reflects the results of a
merge). This commit closes both gaps.

I verified the fix by inserting this code & calling it at the top of
`(*Store).MergeRange` as well as `applySnapshot`:

```go
func (s *Store) assertNoHole(ctx context.Context, from, to roachpb.RKey) func() {
	caller := string(debug.Stack())
	if from.Equal(roachpb.RKeyMax) {
		// There will be a hole to the right of RKeyMax but it's just the end of
		// the addressable keyspace.
		return func() {}
	}
	// Check that there's never a gap to the right of the pre-merge LHS in replicasByKey.
	ctx, stopAsserting := context.WithCancel(ctx)
	_ = s.stopper.RunAsyncTask(ctx, "force-assertion", func(ctx context.Context) {
		for ctx.Err() == nil {
			func() {
				s.mu.Lock()
				defer s.mu.Unlock()
				var it replicaOrPlaceholder
				if err := s.mu.replicasByKey.VisitKeyRange(
					context.Background(), from, to, AscendingKeyOrder,
					func(ctx context.Context, iit replicaOrPlaceholder) error {
						it = iit
						return iterutil.StopIteration()
					}); err != nil {
					log.Fatalf(ctx, "%v", err)
				}
				if it.item != nil {
					return
				}
				log.Fatalf(ctx, "found hole in keyspace [%s,%s), during:\n%s", from, to, caller)
			}()
		}
	})
	return stopAsserting
}
```

```go
// (*Store).applySnapshot
	{
		var from, to roachpb.RKey
		if isInitialSnap {
			// For uninitialized replicas, there must be a placeholder that covers
			// the snapshot's bounds, so basically check that. A synchronous check
			// here would be simpler but this works well enough.
			d := inSnap.placeholder.Desc()
			from, to = d.StartKey, d.EndKey
		} else {
			// For snapshots to existing replicas, from and to usually match (i.e.
			// nothing is asserted) but if the snapshot spans a merge then we're
			// going to assert that we're transferring the keyspace from the subsumed
			// replicas to this replica seamlessly.
			d := r.Desc()
			from, to = d.EndKey, inSnap.Desc.EndKey
		}

		defer r.store.assertNoHole(ctx, from, to)()
	}

// (*Store).MergeRange
	defer s.assertNoHole(ctx, leftRepl.Desc().EndKey, newLeftDesc.EndKey)()
```

The bug reproduced, before this fix, in
`TestStoreRangeMergeTimestampCache` and
`TestChangeReplicasLeaveAtomicRacesWithMerge`, covering both the
snapshot and merge trigger cases. I'm not too happy to merge this
without the same kind of active test coverage, but the above has a
chance of false positives (if Replica gets removed while assertion loop
still running) and it's unclear when exactly we would enable it (behind
the `crdbtest` tag perhaps)?

I am dissatisfied with a few things I realized (or rather, rediscovered)
while working on this, but since this PR needs to be backported possibly
to all past versions, I am refraining from any refactors. Nevertheless,
here's what annoyed me:

- There is no unified API for managing the store's tracked replicas. As
  a result, there are lots of callers meddling with the `replicasByKey`,
  `uninitializedRanges`, etc, maps, adding complexity.
- the `replicasByKey` btree contains initialized `Replicas` and uses
  their key bounds. This makes for a fairly complex locking story, and
  in particular it's easy to deadlock when holding any replica's lock and
  accessing the btree. It could be easier, in conjunction with the above
  point, to make the btree not hold the `Replica` directly, and to mutate
  the btree in a critical section with calling `(*Replica).setDescLocked`.

Release note (bug fix): A bug was fixed that, in very rare cases, could
result in a node terminating with fatal error "unable to remove
placeholder: corrupted replicasByKey map". To avoid potential data
corruption, users affected by this crash should not restart the node,
but instead decommission it in absentia and have it rejoin the cluster
under a new NodeID.
tbg added a commit to tbg/cockroach that referenced this issue Jan 3, 2022
In cockroachdb#73721 we saw the following assertion fire:

> kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1  unable to
> remove placeholder: corrupted replicasByKey map: <nil> and [...]

This is because `MergeRange` removes from the Store's in-memory map the
right-hand side Replica before extending the left-hand side, leaving a
gap for a snapshot to sneak in. A similar problem exists when a snapshot
widens the existing range (i.e. the snapshot reflects the results of a
merge). This commit closes both gaps.

I verified the fix by calling the newly-introduced (but promptly
disabled, see comment within) `(*Store).maybeAssertNoHole` from
both `(*Store).MergeRange` and `applySnapshot`.

The bug reproduced via this assertion, before this fix, in
`TestStoreRangeMergeTimestampCache` and
`TestChangeReplicasLeaveAtomicRacesWithMerge`, covering both the
snapshot and merge trigger cases. I'm not too happy to merge this
without the same kind of active test coverage, but since this will
require a backport, it's better not to rattle the cage too much.
I filed cockroachdb#74384 to clean this up fully.

Release note (bug fix): A bug was fixed that, in very rare cases, could
result in a node terminating with fatal error "unable to remove
placeholder: corrupted replicasByKey map". To avoid potential data
corruption, users affected by this crash should not restart the node,
but instead decommission it in absentia and have it rejoin the cluster
under a new NodeID.
tbg added a commit to tbg/cockroach that referenced this issue Jan 4, 2022
In cockroachdb#73721 we saw the following assertion fire:

> kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1  unable to
> remove placeholder: corrupted replicasByKey map: <nil> and [...]

This is because `MergeRange` removes from the Store's in-memory map the
right-hand side Replica before extending the left-hand side, leaving a
gap for a snapshot to sneak in. A similar problem exists when a snapshot
widens the existing range (i.e. the snapshot reflects the results of a
merge). This commit closes both gaps.

I verified the fix by calling the newly-introduced (but promptly
disabled, see comment within) `(*Store).maybeAssertNoHole` from
both `(*Store).MergeRange` and `applySnapshot`.

The bug reproduced via this assertion, before this fix, in
`TestStoreRangeMergeTimestampCache` and
`TestChangeReplicasLeaveAtomicRacesWithMerge`, covering both the
snapshot and merge trigger cases. I'm not too happy to merge this
without the same kind of active test coverage, but since this will
require a backport, it's better not to rattle the cage too much.
I filed cockroachdb#74384 to clean this up fully.

Release note (bug fix): A bug was fixed that, in very rare cases, could
result in a node terminating with fatal error "unable to remove
placeholder: corrupted replicasByKey map". To avoid potential data
corruption, users affected by this crash should not restart the node,
but instead decommission it in absentia and have it rejoin the cluster
under a new NodeID.
craig bot pushed a commit that referenced this issue Jan 4, 2022
73734: kvserver: avoid unprotected keyspace during MergeRange r=erikgrinaker a=tbg

In #73721 we saw the following assertion fire:

> kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1  unable to
> remove placeholder: corrupted replicasByKey map: <nil> and [...]

This is because `MergeRange` removes from the Store's in-memory map the
right-hand side Replica before extending the left-hand side, leaving a
gap for a snapshot to sneak in. A similar problem exists when a snapshot
widens the existing range (i.e. the snapshot reflects the results of a
merge). This commit closes both gaps.

I verified the fix by calling the newly-introduced (but promptly
disabled, see comment within) `(*Store).maybeAssertNoHole` from
both `(*Store).MergeRange` and `applySnapshot`.

The bug reproduced via this assertion, before this fix, in
`TestStoreRangeMergeTimestampCache` and
`TestChangeReplicasLeaveAtomicRacesWithMerge`, covering both the
snapshot and merge trigger cases. I'm not too happy to merge this
without the same kind of active test coverage, but since this will
require a backport, it's better not to rattle the cage too much.
I filed #74384 to clean this up fully.

Release note (bug fix): A bug was fixed that, in very rare cases, could
result in a node terminating with fatal error "unable to remove
placeholder: corrupted replicasByKey map". To avoid potential data
corruption, users affected by this crash should not restart the node,
but instead decommission it in absentia and have it rejoin the cluster
under a new NodeID.


Co-authored-by: Tobias Grieger <[email protected]>
@tbg tbg changed the title kv/kvserver: TestChangeReplicasLeaveAtomicRacesWithMerge failed release-21.2: snapshot can sneak into RHS of split Jan 4, 2022
@tbg tbg removed the branch-master Failures and bugs on the master branch. label Jan 4, 2022
@tbg tbg added A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 4, 2022
@tbg
Copy link
Member

tbg commented Jan 4, 2022

We need to backport #73734 to release-21.2. That PR is currently baking on master. 21.2.3 is just getting released. The plan is that, once 21.2.4's sha has been chosen, we backport to release-21.2 to maximize bake time for 21.2.5.

@tbg tbg self-assigned this Jan 4, 2022
@tbg
Copy link
Member

tbg commented Jan 4, 2022

The backport skews with #72471, btw. Just a note for future self.

gustasva pushed a commit to gustasva/cockroach that referenced this issue Jan 4, 2022
In cockroachdb#73721 we saw the following assertion fire:

> kv/kvserver/replica_raftstorage.go:932  [n4,s4,r46/3:{-}] 1  unable to
> remove placeholder: corrupted replicasByKey map: <nil> and [...]

This is because `MergeRange` removes from the Store's in-memory map the
right-hand side Replica before extending the left-hand side, leaving a
gap for a snapshot to sneak in. A similar problem exists when a snapshot
widens the existing range (i.e. the snapshot reflects the results of a
merge). This commit closes both gaps.

I verified the fix by calling the newly-introduced (but promptly
disabled, see comment within) `(*Store).maybeAssertNoHole` from
both `(*Store).MergeRange` and `applySnapshot`.

The bug reproduced via this assertion, before this fix, in
`TestStoreRangeMergeTimestampCache` and
`TestChangeReplicasLeaveAtomicRacesWithMerge`, covering both the
snapshot and merge trigger cases. I'm not too happy to merge this
without the same kind of active test coverage, but since this will
require a backport, it's better not to rattle the cage too much.
I filed cockroachdb#74384 to clean this up fully.

Release note (bug fix): A bug was fixed that, in very rare cases, could
result in a node terminating with fatal error "unable to remove
placeholder: corrupted replicasByKey map". To avoid potential data
corruption, users affected by this crash should not restart the node,
but instead decommission it in absentia and have it rejoin the cluster
under a new NodeID.
@tbg tbg removed their assignment May 31, 2022
@tbg
Copy link
Member

tbg commented Jun 21, 2022

To follow up on the above, we never backported to 21.2 and and this point it is unlikely that we will.

@erikgrinaker
Copy link
Contributor

21.2 is no longer supported, closing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

3 participants