-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-21.2: snapshot can sneak into RHS of split #73721
Comments
|
This assertion unfortunately isn't logging the right thing: cockroach/pkg/kv/kvserver/store_remove_replica.go Lines 361 to 363 in 84e3fdf
It is meant to log |
Just going from the log messages, I think what is happening is an overlap between two different ranges:
I am confused as to why the snapshot is even getting received. The way it should work is that we check for overlap before accepting the snapshot: cockroach/pkg/kv/kvserver/store_snapshot.go Lines 606 to 610 in 541adf9
I can see in the logs that n4 removes the subsumed replica r45/4 just a hair before it receives the snapshot, which comes from this place in cockroach/pkg/kv/kvserver/store_merge.go Lines 49 to 58 in 84e3fdf
Interestingly, we update (widen) the left hand side's descriptor only at the very end of cockroach/pkg/kv/kvserver/store_merge.go Lines 119 to 122 in 84e3fdf
So the conjecture goes like this, from the point of view of n4:
If this checks out, the bug is in I managed to reproduce the bug on my gceworker with vanilla
|
I don't believe we've seen this failure before? Any thoughts on why it's cropping up now? |
I'll bisect it now. Without having checked, my guess would be that the new thing is the snapshot occurring. Maybe we accidentally disable the snapshot delay heuristic now. One of @nvanbenschoten's recent quiescing cleanups might have done it. Pure pure pure speculation. We'll find out. Don't think the bug is new. |
Still bisecting, but with this diff it fails pretty much every time (not surprising, but still) diff --git a/pkg/kv/kvserver/store_merge.go b/pkg/kv/kvserver/store_merge.go
index e96737888f..c962e708ae 100644
--- a/pkg/kv/kvserver/store_merge.go
+++ b/pkg/kv/kvserver/store_merge.go
@@ -16,6 +16,8 @@ import (
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/readsummary/rspb"
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/util/hlc"
+ "github.com/cockroachdb/cockroach/pkg/util/iterutil"
+ "github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/errors"
)
@@ -38,6 +40,33 @@ func (s *Store) MergeRange(
newLeftDesc.EndKey, oldLeftDesc.EndKey)
}
+ // Check that there's never a gap to the right of the pre-merge LHS in replicasByKey.
+ assertCtx, stopAsserting := context.WithCancel(ctx)
+ assertStart := leftRepl.Desc().EndKey
+ assertEnd := assertStart.Next()
+ defer stopAsserting()
+ _ = s.stopper.RunAsyncTask(assertCtx, "force-assertion", func(ctx context.Context) {
+ for {
+ func() {
+ s.mu.Lock()
+ defer s.mu.Unlock()
+ var it replicaOrPlaceholder
+ if err := s.mu.replicasByKey.VisitKeyRange(
+ context.Background(), assertStart, assertEnd, AscendingKeyOrder,
+ func(ctx context.Context, iit replicaOrPlaceholder) error {
+ it = iit
+ return iterutil.StopIteration()
+ }); err != nil {
+ log.Fatalf(ctx, "%v", err)
+ }
+ if it.item != nil {
+ return
+ }
+ log.Fatalf(ctx, "found hole in keyspace desipte pending merge")
+ }()
+ }
+ })
+
rightRepl, err := s.GetReplica(rightDesc.RangeID)
if err != nil {
return err |
It's hard to really bisect this down since the test manages to hit various other fatal errors, for example
As far as I can tell the problem (leaving the keyspace unguarded) has existed for some time and definitely predates #72471, which I initially thought introduced this. I'm going to focus my energy on fixing things up so that this test does not fatal on a 20 node cluster in hours. |
Interesting, thanks for digging into this. Have you tried a repro on a release branch, i.e. 21.2? |
Good idea, will try that tomorrow. |
release-21.2:
|
In cockroachdb#73721 we saw the following assertion fire: > kv/kvserver/replica_raftstorage.go:932 [n4,s4,r46/3:{-}] 1 unable to > remove placeholder: corrupted replicasByKey map: <nil> and [...] This is because `MergeRange` removes from the Store's in-memory map the right-hand side Replica before extending the left-hand side, leaving a gap for a snapshot to sneak in. A similar problem exists when a snapshot widens the existing range (i.e. the snapshot reflects the results of a merge). This commit closes both gaps. I verified the fix by inserting this code & calling it at the top of `(*Store).MergeRange` as well as `applySnapshot`: ```go func (s *Store) assertNoHole(ctx context.Context, from, to roachpb.RKey) func() { caller := string(debug.Stack()) if from.Equal(roachpb.RKeyMax) { // There will be a hole to the right of RKeyMax but it's just the end of // the addressable keyspace. return func() {} } // Check that there's never a gap to the right of the pre-merge LHS in replicasByKey. ctx, stopAsserting := context.WithCancel(ctx) _ = s.stopper.RunAsyncTask(ctx, "force-assertion", func(ctx context.Context) { for ctx.Err() == nil { func() { s.mu.Lock() defer s.mu.Unlock() var it replicaOrPlaceholder if err := s.mu.replicasByKey.VisitKeyRange( context.Background(), from, to, AscendingKeyOrder, func(ctx context.Context, iit replicaOrPlaceholder) error { it = iit return iterutil.StopIteration() }); err != nil { log.Fatalf(ctx, "%v", err) } if it.item != nil { return } log.Fatalf(ctx, "found hole in keyspace [%s,%s), during:\n%s", from, to, caller) }() } }) return stopAsserting } ``` ```go // (*Store).applySnapshot { var from, to roachpb.RKey if isInitialSnap { // For uninitialized replicas, there must be a placeholder that covers // the snapshot's bounds, so basically check that. A synchronous check // here would be simpler but this works well enough. d := inSnap.placeholder.Desc() from, to = d.StartKey, d.EndKey } else { // For snapshots to existing replicas, from and to usually match (i.e. // nothing is asserted) but if the snapshot spans a merge then we're // going to assert that we're transferring the keyspace from the subsumed // replicas to this replica seamlessly. d := r.Desc() from, to = d.EndKey, inSnap.Desc.EndKey } defer r.store.assertNoHole(ctx, from, to)() } // (*Store).MergeRange defer s.assertNoHole(ctx, leftRepl.Desc().EndKey, newLeftDesc.EndKey)() ``` The bug reproduced, before this fix, in `TestStoreRangeMergeTimestampCache` and `TestChangeReplicasLeaveAtomicRacesWithMerge`, covering both the snapshot and merge trigger cases. I'm not too happy to merge this without the same kind of active test coverage, but the above has a chance of false positives (if Replica gets removed while assertion loop still running) and it's unclear when exactly we would enable it (behind the `crdbtest` tag perhaps)? I am dissatisfied with a few things I realized (or rather, rediscovered) while working on this, but since this PR needs to be backported possibly to all past versions, I am refraining from any refactors. Nevertheless, here's what annoyed me: - There is no unified API for managing the store's tracked replicas. As a result, there are lots of callers meddling with the `replicasByKey`, `uninitializedRanges`, etc, maps, adding complexity. - the `replicasByKey` btree contains initialized `Replicas` and uses their key bounds. This makes for a fairly complex locking story, and in particular it's easy to deadlock when holding any replica's lock and accessing the btree. It could be easier, in conjunction with the above point, to make the btree not hold the `Replica` directly, and to mutate the btree in a critical section with calling `(*Replica).setDescLocked`. Release note (bug fix): A bug was fixed that, in very rare cases, could result in a node terminating with fatal error "unable to remove placeholder: corrupted replicasByKey map". To avoid potential data corruption, users affected by this crash should not restart the node, but instead decommission it in absentia and have it rejoin the cluster under a new NodeID.
In cockroachdb#73721 we saw the following assertion fire: > kv/kvserver/replica_raftstorage.go:932 [n4,s4,r46/3:{-}] 1 unable to > remove placeholder: corrupted replicasByKey map: <nil> and [...] This is because `MergeRange` removes from the Store's in-memory map the right-hand side Replica before extending the left-hand side, leaving a gap for a snapshot to sneak in. A similar problem exists when a snapshot widens the existing range (i.e. the snapshot reflects the results of a merge). This commit closes both gaps. I verified the fix by calling the newly-introduced (but promptly disabled, see comment within) `(*Store).maybeAssertNoHole` from both `(*Store).MergeRange` and `applySnapshot`. The bug reproduced via this assertion, before this fix, in `TestStoreRangeMergeTimestampCache` and `TestChangeReplicasLeaveAtomicRacesWithMerge`, covering both the snapshot and merge trigger cases. I'm not too happy to merge this without the same kind of active test coverage, but since this will require a backport, it's better not to rattle the cage too much. I filed cockroachdb#74384 to clean this up fully. Release note (bug fix): A bug was fixed that, in very rare cases, could result in a node terminating with fatal error "unable to remove placeholder: corrupted replicasByKey map". To avoid potential data corruption, users affected by this crash should not restart the node, but instead decommission it in absentia and have it rejoin the cluster under a new NodeID.
In cockroachdb#73721 we saw the following assertion fire: > kv/kvserver/replica_raftstorage.go:932 [n4,s4,r46/3:{-}] 1 unable to > remove placeholder: corrupted replicasByKey map: <nil> and [...] This is because `MergeRange` removes from the Store's in-memory map the right-hand side Replica before extending the left-hand side, leaving a gap for a snapshot to sneak in. A similar problem exists when a snapshot widens the existing range (i.e. the snapshot reflects the results of a merge). This commit closes both gaps. I verified the fix by calling the newly-introduced (but promptly disabled, see comment within) `(*Store).maybeAssertNoHole` from both `(*Store).MergeRange` and `applySnapshot`. The bug reproduced via this assertion, before this fix, in `TestStoreRangeMergeTimestampCache` and `TestChangeReplicasLeaveAtomicRacesWithMerge`, covering both the snapshot and merge trigger cases. I'm not too happy to merge this without the same kind of active test coverage, but since this will require a backport, it's better not to rattle the cage too much. I filed cockroachdb#74384 to clean this up fully. Release note (bug fix): A bug was fixed that, in very rare cases, could result in a node terminating with fatal error "unable to remove placeholder: corrupted replicasByKey map". To avoid potential data corruption, users affected by this crash should not restart the node, but instead decommission it in absentia and have it rejoin the cluster under a new NodeID.
73734: kvserver: avoid unprotected keyspace during MergeRange r=erikgrinaker a=tbg In #73721 we saw the following assertion fire: > kv/kvserver/replica_raftstorage.go:932 [n4,s4,r46/3:{-}] 1 unable to > remove placeholder: corrupted replicasByKey map: <nil> and [...] This is because `MergeRange` removes from the Store's in-memory map the right-hand side Replica before extending the left-hand side, leaving a gap for a snapshot to sneak in. A similar problem exists when a snapshot widens the existing range (i.e. the snapshot reflects the results of a merge). This commit closes both gaps. I verified the fix by calling the newly-introduced (but promptly disabled, see comment within) `(*Store).maybeAssertNoHole` from both `(*Store).MergeRange` and `applySnapshot`. The bug reproduced via this assertion, before this fix, in `TestStoreRangeMergeTimestampCache` and `TestChangeReplicasLeaveAtomicRacesWithMerge`, covering both the snapshot and merge trigger cases. I'm not too happy to merge this without the same kind of active test coverage, but since this will require a backport, it's better not to rattle the cage too much. I filed #74384 to clean this up fully. Release note (bug fix): A bug was fixed that, in very rare cases, could result in a node terminating with fatal error "unable to remove placeholder: corrupted replicasByKey map". To avoid potential data corruption, users affected by this crash should not restart the node, but instead decommission it in absentia and have it rejoin the cluster under a new NodeID. Co-authored-by: Tobias Grieger <[email protected]>
We need to backport #73734 to release-21.2. That PR is currently baking on master. 21.2.3 is just getting released. The plan is that, once 21.2.4's sha has been chosen, we backport to release-21.2 to maximize bake time for 21.2.5. |
The backport skews with #72471, btw. Just a note for future self. |
In cockroachdb#73721 we saw the following assertion fire: > kv/kvserver/replica_raftstorage.go:932 [n4,s4,r46/3:{-}] 1 unable to > remove placeholder: corrupted replicasByKey map: <nil> and [...] This is because `MergeRange` removes from the Store's in-memory map the right-hand side Replica before extending the left-hand side, leaving a gap for a snapshot to sneak in. A similar problem exists when a snapshot widens the existing range (i.e. the snapshot reflects the results of a merge). This commit closes both gaps. I verified the fix by calling the newly-introduced (but promptly disabled, see comment within) `(*Store).maybeAssertNoHole` from both `(*Store).MergeRange` and `applySnapshot`. The bug reproduced via this assertion, before this fix, in `TestStoreRangeMergeTimestampCache` and `TestChangeReplicasLeaveAtomicRacesWithMerge`, covering both the snapshot and merge trigger cases. I'm not too happy to merge this without the same kind of active test coverage, but since this will require a backport, it's better not to rattle the cage too much. I filed cockroachdb#74384 to clean this up fully. Release note (bug fix): A bug was fixed that, in very rare cases, could result in a node terminating with fatal error "unable to remove placeholder: corrupted replicasByKey map". To avoid potential data corruption, users affected by this crash should not restart the node, but instead decommission it in absentia and have it rejoin the cluster under a new NodeID.
To follow up on the above, we never backported to 21.2 and and this point it is unlikely that we will. |
21.2 is no longer supported, closing this out. |
kv/kvserver.TestChangeReplicasLeaveAtomicRacesWithMerge failed with artifacts on master @ 22df0a6658a927e7b65dafbdfc9d790500791993:
Help
See also: [How To Investigate a Go Test Failure \(internal\)](https://cockroachlabs.atlassian.net/l/c/HgfXfJgM)Parameters in this failure:
This test on roachdash | Improve this report!
Jira issue: CRDB-11707
The text was updated successfully, but these errors were encountered: