kv: (re-)introduce a stopgap for lack of ReplicaState synchronization #60429

irfansharif · 2021-02-10T16:43:09Z

See #59194 and #58489 for more details.

In #58489 we observed a scary lack of synchronization around how we set
the ReplicaState for a given replica, and how we mark a replica as
"initialized". What this meant is that it was possible for the entry in
Store.mu.replicas to be both "initialized" and have an empty
ReplicaState. This is now more likely to bite us given the migrations
infrastructure attempts to purge outdated replicas at start up time
(when replicas are being initialized, and we're iterating through extan
replicas in the Store.mu.replicas map).

We believed this was addressed as part of #58378, but that appears not
to be the case. Lets re-introduce this stop-gap while we investigate.

Release note: None

See cockroachdb#59194 and cockroachdb#58489 for more details. In cockroachdb#58489 we observed a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this meant is that it was possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). We believed this was addressed as part of cockroachdb#58378, but that appears not to be the case. Lets re-introduce this stop-gap while we investigate. Release note: None

cockroach-teamcity · 2021-02-10T16:43:19Z

This change is

ajwerner

😞

irfansharif · 2021-02-10T17:12:23Z

😔

bors r+

60429: kv: (re-)introduce a stopgap for lack of ReplicaState synchronization r=irfansharif a=irfansharif See #59194 and #58489 for more details. In #58489 we observed a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this meant is that it was possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). We believed this was addressed as part of #58378, but that appears not to be the case. Lets re-introduce this stop-gap while we investigate. Release note: None Co-authored-by: irfan sharif <[email protected]>

craig · 2021-02-10T17:48:27Z

Build failed:

GitHub CI (Cockroach)

irfansharif · 2021-02-10T20:02:30Z

bors r+

craig · 2021-02-10T22:22:00Z

Build succeeded:

GitHub CI (Cockroach)

Fixes cockroachdb#58378. Fixes cockroachdb#62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in cockroachdb#62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None

60835: kv: bump timestamp cache to Pushee.MinTimestamp on PUSH_ABORT r=nvanbenschoten a=nvanbenschoten Fixes #60779. Fixes #60580. We were only checking that the batch header timestamp was equal to or greater than this pushee's min timestamp, so this is as far as we can bump the timestamp cache. 62832: geo: minor performance improvement for looping over edges r=otan a=andyyang890 This patch slightly improves the performance of many spatial builtins by storing the number of edges used in the loop conditions of for loops into a variable. We discovered this was taking a lot of time when profiling the point-in-polygon optimization. Release note: None 62838: kvserver: purge gc-able, unmigrated replicas during migrations r=irfansharif a=irfansharif Fixes #58378. Fixes #62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in #62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in #58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in #60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None 62839: zonepb: make subzone DiffWithZone more accurate r=ajstorm a=otan * Subzones may be defined in a different order. We did not take this into account which can cause bugs when e.g. ADD REGION adds a subzone in the end rather than in the old "expected" location in the subzones array. This has been fixed by comparing subzones using an unordered map. * The ApplyZoneConfig we previously did overwrote subzone fields on the original subzone array element, meaning that if there was a mismatch it would not be reported through validation. This is now fixed by applying the expected zone config to *zonepb.NewZoneConfig() instead. * Added logic to only check for zone config matches subzones from active subzone IDs. * Improve the error messaging when a subzone config is mismatching - namely, add index and partitioning information and differentiate between missing fields and missing / extraneous zone configs Resolves #62790 Release note (bug fix): Fixed validation bugs during ALTER TABLE ... SET LOCALITY / crdb_internal.validate_multi_region_zone_config where validation errors could occur when the database of a REGIONAL BY ROW table has a new region added. Also fix a validation bug partition zone mismatches configs were not caught. 62872: build: use -json for RandomSyntax test r=otan a=rafiss I'm hoping this will help out with an issue where the test failures seem to be missing helpful logs. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

irfansharif requested review from ajwerner, tbg and angelapwen February 10, 2021 16:43

ajwerner approved these changes Feb 10, 2021

View reviewed changes

craig bot merged commit 5095a3c into cockroachdb:master Feb 10, 2021

irfansharif deleted the 210210.stopgap-rep-version branch February 10, 2021 23:46

irfansharif mentioned this pull request Mar 3, 2021

kvserver: purging outdated replicas during migrations run into nil pointers #58378

Closed

irfansharif mentioned this pull request Mar 25, 2021

roachtest: acceptance/version-upgrade failed [not using applied state] #62267

Closed

irfansharif mentioned this pull request Mar 31, 2021

kvserver: purge gc-able, unmigrated replicas during migrations #62838

Merged

irfansharif mentioned this pull request Mar 31, 2021

release-21.1: kvserver: purge gc-able, unmigrated replicas during migrations #62871

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: (re-)introduce a stopgap for lack of ReplicaState synchronization #60429

kv: (re-)introduce a stopgap for lack of ReplicaState synchronization #60429

irfansharif commented Feb 10, 2021

cockroach-teamcity commented Feb 10, 2021

ajwerner left a comment

irfansharif commented Feb 10, 2021

craig bot commented Feb 10, 2021

irfansharif commented Feb 10, 2021

craig bot commented Feb 10, 2021

kv: (re-)introduce a stopgap for lack of ReplicaState synchronization #60429

kv: (re-)introduce a stopgap for lack of ReplicaState synchronization #60429

Conversation

irfansharif commented Feb 10, 2021

cockroach-teamcity commented Feb 10, 2021

ajwerner left a comment

Choose a reason for hiding this comment

irfansharif commented Feb 10, 2021

craig bot commented Feb 10, 2021

irfansharif commented Feb 10, 2021

craig bot commented Feb 10, 2021