-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: purging outdated replicas during migrations run into nil pointers #58378
Comments
Hi @ajwerner, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
I think this assertion is a bit racey. I'm a bit surprised by the synchronization we're seeing here, but it's evidently possible for a replica to be evaluating proposal while concurrently applying a snapshot? In the failure above, r85 is being freshly created by the snapshot, which is still being applied. Part of that application process entails setting the right replica state from the incoming snapshot, which happens here: cockroach/pkg/kv/kvserver/replica_raftstorage.go Lines 1001 to 1005 in bdd0b93
Given this test starts off at v21.1, cockroach/pkg/kv/kvserver/replica_proposal.go Lines 772 to 774 in bdd0b93
But because the evaluation seems to be happening concurrently with the snapshot application, it's possible for us to read from an uninitialized ReplicaState, which would then trip up this assertion. |
I'm not sure if I'm missing some obvious synchronization that would disallow command evaluation from happening concurrently with the replica being instantiated by means of a snapshot. I think we might have to drop the assertion here. |
Touches cockroachdb#58378; a stop-gap while we investigate. Release note: None
Wow, that's... weird, right? First of all (and I'm sure you're aware, just writing this out anyway) there is nothing mixed-version going on. Everyone here ought to be using the applied state right off the bat. And it pretty much looks as though what you're seeing is true: we are evaluating a lease request on a replica that was just created in response to a snapshot, and the snapshot has not applied yet, meaning that the lease will evaluate against a completely blank disk state, while the in-memory state already has the descriptor (see the log tags on the fatal error). But looking into this further, I'm not sure how this would work. We are setting the replica state here: cockroach/pkg/kv/kvserver/replica_raftstorage.go Lines 987 to 1010 in ce4b23d
but note that ingesting the actual data happens before. This would mean that there shouldn't be any way for a lease to make it to evaluation (wouldn't it bounce on the key check since it addresses to the start key?!) unless it can also observe the on-disk state. So I'm still not sure what's going on here - the synchronization of req eval and snapshots seems lacking, but I can't use that to explain this particular problem. Can you? I don't think we'd ever send out a snapshot that has |
58387: kvserver: downgrade a potentially racey assertion r=tbg a=irfansharif Touches #58378; a stop-gap while we investigate. Release note: None Co-authored-by: irfan sharif <[email protected]>
We set the entry in the store before we set the raft state:
I believe that circumvents the key check. We should move some of this logic around so that requests can't find the replica until after it has been initialized. |
Oh - good call. I'll take a look. |
Let's say we fix this - are we still on thin ice? I believe fixing this is good enough to address the case of the very first snapshot, as an uninitialized replica (probably) won't ever try to evaluate anything. When a snapshot arrives on an existing replica, and that replica is evaluating some requests at the time (probably follower reads then), it does so under a timestamp that is covered by the lease including the uncertainty interval and has the MVCC history as observable by the commands immutable, so it won't matter whether the commands apply against the old or new state per se. But can there still be weird interleavings regarding the in-memory state? A command may read something from |
I poked at this for a while and it's not super easy to untangle. We'd really want to update the in-memory state of the replica before switching out the placeholder at the store at least for the first snapshot, but in general we do want to update the descriptor atomically with updating the store's key lookup index. Nothing that can't be fixed here, but it will take a little bit of elbow grease and since it's very tedious with 5pm brain, I will look at it again tomorrow. |
I think latching is a little too high-level for this kind of synchronization. Snapshots can block writes that are holding latches, so without some form of re-entrance, we'd run into all kinds of deadlocks if we tried to use them. This sounds more like the kind of issue that the The interactions between below Raft state changes and above Raft state observations are pretty subtle. In part, that's because we aren't explicit about when we're observing state above Raft (see #55461) so we end up needing to be very defensive about how we change things below Raft. For instance, snapshot ingestion goes through all kinds of hoops to make sure that the entire snapshot is ingested atomically to disk, even if that means that 4 of the 5 SSTs are tiny. If we had a cleaner policy about snapshotting an entire range's state during evaluation then we could add in some fine-grained synchronization around this snapshot and things would be a lot easier to think through. Regarding this specific issue, it is surprising how late we set |
Good point about latching being at the wrong level. I remember readOnlyCmdMu and its flaws. I like the idea that the range state should be snapshotted as well. When we make a ReplicaEvalContext, it basically just passes through to I like how that gets us a small step closer to a world in which commands evaluate in a true sandbox, Either way, with this issue on our plate now, I think the best I can do is to move the state update before the store update. That will already be difficult enough and requires some thinking through as well. For example if we "just" move the state update before the store update, if the snapshot spans a split, we will shrink the replica before notifying the store of that fact which can't be good; similar issues can come up on a merge (we'll destroy the RHS before the LHS expands? Not sure how this all works today). Ideally I can manage to do them both at the same time (i.e. hold replica.mu across store.mu). |
There's a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this means is that very temporarily, it's possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This was an existing problem, but is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). This issue has caused a bit of recent instability: cockroachdb#59180, cockroachdb#58489, \cockroachdb#58523, and cockroachdb#58378. While we work on a more considered fix to the problem (tracked in cockroachdb#58489), we can introduce stop the bleeding in the interim (and unskip some tests). Release note: None
There's a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this means is that very temporarily, it's possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This was an existing problem, but is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). This issue has caused a bit of recent instability: cockroachdb#59180, cockroachdb#58489, \cockroachdb#58523, and cockroachdb#58378. While we work on a more considered fix to the problem (tracked in cockroachdb#58489), we can introduce stop the bleeding in the interim (and unskip some tests). Release note: None
59194: kv: introduce a stopgap for lack of ReplicaState synchronization r=irfansharif a=irfansharif There's a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this means is that very temporarily, it's possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This was an existing problem, but is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). This issue has caused a bit of recent instability: #59180, #58489, \#58523, and #58378. While we work on a more considered fix to the problem (tracked in #58489), we can introduce stop the bleeding in the interim (and unskip some tests). Release note: None 59201: sql: add telemetry for materialized views and set schema. r=otan a=RichardJCai sql: add telemetry for materialized views and set schema. Release note: None Resolves #57299 Co-authored-by: irfan sharif <[email protected]> Co-authored-by: richardjcai <[email protected]>
There's a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this means is that very temporarily, it's possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This was an existing problem, but is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). This issue has caused a bit of recent instability: cockroachdb#59180, cockroachdb#58489, \cockroachdb#58523, and cockroachdb#58378. While we work on a more considered fix to the problem (tracked in cockroachdb#58489), we can introduce stop the bleeding in the interim (and unskip some tests). Release note: None
See cockroachdb#59194 and cockroachdb#58489 for more details. In cockroachdb#58489 we observed a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this meant is that it was possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). We believed this was addressed as part of cockroachdb#58378, but that appears not to be the case. Lets re-introduce this stop-gap while we investigate. Release note: None
60429: kv: (re-)introduce a stopgap for lack of ReplicaState synchronization r=irfansharif a=irfansharif See #59194 and #58489 for more details. In #58489 we observed a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this meant is that it was possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). We believed this was addressed as part of #58378, but that appears not to be the case. Lets re-introduce this stop-gap while we investigate. Release note: None Co-authored-by: irfan sharif <[email protected]>
60281: sql/pgwire: send placeholder BackendKeyData r=asubiotto,spaskob a=rafiss fixes #13191 Some tools expect this message to be returned at connection time and will not connect without it. CockroachDB does not support pgwire cancellation, but we can still send a placeholder value here, and continue ignoring cancellation requests like we already do. Added a small test to make sure nothing broke. Release note (sql change): When a connection is established, CockroachDB will now return a placeholder BackendKeyData message in the response. This is for compatibility with some tools, but using the BackendKeyData to cancel a query will still have no effect, just the same as before. 60429: kv: (re-)introduce a stopgap for lack of ReplicaState synchronization r=irfansharif a=irfansharif See #59194 and #58489 for more details. In #58489 we observed a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this meant is that it was possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). We believed this was addressed as part of #58378, but that appears not to be the case. Lets re-introduce this stop-gap while we investigate. Release note: None 60441: bazel: quash unnecessary dependency on `pkg/util/uuid` from protos r=rickystewart a=rickystewart This dependency can be replaced with a few `# keep` deps in a few choice proto targets, which is what we should have done the whole time anyway. This fixes build failures elsewhere in tree -- for example, `pkg/util/uuid:uuid_test`, which doesn't play nicely with `rules_go` in the presence of this dependency. Fixes #59778. Release note: None Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
Going to re-open this issue as a placeholder given we needed to revert the stop-gap in #60429. |
Starting to look at this now. |
I ran the diff --git a/pkg/server/testserver.go b/pkg/server/testserver.go
index 082bf10558..b679fda253 100644
--- a/pkg/server/testserver.go
+++ b/pkg/server/testserver.go
@@ -456,7 +456,28 @@ func (ts *TestServer) NodeDialer() *nodedialer.Dialer {
// completes.
func (ts *TestServer) Start() error {
ctx := context.Background()
- return ts.Server.Start(ctx)
+ err := ts.Server.Start(ctx)
+ if err != nil {
+ return err
+ }
+ return ts.stopper.RunAsyncTask(ctx, "foo", func(ctx context.Context) {
+ for {
+ select {
+ case <-ts.stopper.ShouldQuiesce():
+ return
+ default:
+ }
+ if err := ts.Server.node.stores.VisitStores(func(s *kvserver.Store) error {
+ s.VisitReplicas(func(repl *kvserver.Replica) (more bool) {
+ _ = repl.Version()
+ return true
+ })
+ return nil
+ }); err != nil {
+ panic(err)
+ }
+ }
+ })
}
type dummyProtectedTSProvider struct { To my dismay, it doesn't seem that easy to reproduce this bug. |
Fixes cockroachdb#58378. Fixes cockroachdb#62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in cockroachdb#62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None
60835: kv: bump timestamp cache to Pushee.MinTimestamp on PUSH_ABORT r=nvanbenschoten a=nvanbenschoten Fixes #60779. Fixes #60580. We were only checking that the batch header timestamp was equal to or greater than this pushee's min timestamp, so this is as far as we can bump the timestamp cache. 62832: geo: minor performance improvement for looping over edges r=otan a=andyyang890 This patch slightly improves the performance of many spatial builtins by storing the number of edges used in the loop conditions of for loops into a variable. We discovered this was taking a lot of time when profiling the point-in-polygon optimization. Release note: None 62838: kvserver: purge gc-able, unmigrated replicas during migrations r=irfansharif a=irfansharif Fixes #58378. Fixes #62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in #62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in #58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in #60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None 62839: zonepb: make subzone DiffWithZone more accurate r=ajstorm a=otan * Subzones may be defined in a different order. We did not take this into account which can cause bugs when e.g. ADD REGION adds a subzone in the end rather than in the old "expected" location in the subzones array. This has been fixed by comparing subzones using an unordered map. * The ApplyZoneConfig we previously did overwrote subzone fields on the original subzone array element, meaning that if there was a mismatch it would not be reported through validation. This is now fixed by applying the expected zone config to *zonepb.NewZoneConfig() instead. * Added logic to only check for zone config matches subzones from active subzone IDs. * Improve the error messaging when a subzone config is mismatching - namely, add index and partitioning information and differentiate between missing fields and missing / extraneous zone configs Resolves #62790 Release note (bug fix): Fixed validation bugs during ALTER TABLE ... SET LOCALITY / crdb_internal.validate_multi_region_zone_config where validation errors could occur when the database of a REGIONAL BY ROW table has a new region added. Also fix a validation bug partition zone mismatches configs were not caught. 62872: build: use -json for RandomSyntax test r=otan a=rafiss I'm hoping this will help out with an issue where the test failures seem to be missing helpful logs. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
Describe the problem
On this build I see the following crash:
This code was introduced in #58088. I suppose it's possible that my PR caused this issue, but I doubt it. Another thing to note is that I don't think this cluster was upgraded or anything like that so I have to assume that there's some initialization invariant that code is not expecting.
The text was updated successfully, but these errors were encountered: