Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: (re-)introduce a stopgap for lack of ReplicaState synchronization #60429

Merged
merged 1 commit into from
Feb 10, 2021

Conversation

irfansharif
Copy link
Contributor

See #59194 and #58489 for more details.

In #58489 we observed a scary lack of synchronization around how we set
the ReplicaState for a given replica, and how we mark a replica as
"initialized". What this meant is that it was possible for the entry in
Store.mu.replicas to be both "initialized" and have an empty
ReplicaState. This is now more likely to bite us given the migrations
infrastructure attempts to purge outdated replicas at start up time
(when replicas are being initialized, and we're iterating through extan
replicas in the Store.mu.replicas map).

We believed this was addressed as part of #58378, but that appears not
to be the case. Lets re-introduce this stop-gap while we investigate.

Release note: None

See cockroachdb#59194 and cockroachdb#58489 for more details.

In cockroachdb#58489 we observed a scary lack of synchronization around how we set
the ReplicaState for a given replica, and how we mark a replica as
"initialized". What this meant is that it was possible for the entry in
Store.mu.replicas to be both "initialized" and have an empty
ReplicaState. This is now more likely to bite us given the migrations
infrastructure attempts to purge outdated replicas at start up time
(when replicas are being initialized, and we're iterating through extan
replicas in the Store.mu.replicas map).

We believed this was addressed as part of cockroachdb#58378, but that appears not
to be the case. Lets re-introduce this stop-gap while we investigate.

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😞

@irfansharif
Copy link
Contributor Author

😔

bors r+

craig bot pushed a commit that referenced this pull request Feb 10, 2021
60429: kv: (re-)introduce a stopgap for lack of ReplicaState synchronization r=irfansharif a=irfansharif

See #59194 and #58489 for more details.

In #58489 we observed a scary lack of synchronization around how we set
the ReplicaState for a given replica, and how we mark a replica as
"initialized". What this meant is that it was possible for the entry in
Store.mu.replicas to be both "initialized" and have an empty
ReplicaState. This is now more likely to bite us given the migrations
infrastructure attempts to purge outdated replicas at start up time
(when replicas are being initialized, and we're iterating through extan
replicas in the Store.mu.replicas map).

We believed this was addressed as part of #58378, but that appears not
to be the case. Lets re-introduce this stop-gap while we investigate.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 10, 2021

Build failed:

@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 10, 2021

Build succeeded:

@craig craig bot merged commit 5095a3c into cockroachdb:master Feb 10, 2021
@irfansharif irfansharif deleted the 210210.stopgap-rep-version branch February 10, 2021 23:46
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 31, 2021
Fixes cockroachdb#58378.
Fixes cockroachdb#62267.

Previously it was possible for us to have replicas in-memory, with
pre-migrated state, even after a migration was finalized. This led to
the kind of badness we were observing in cockroachdb#62267, where it appeared that
a replica was not using the applied state key despite us having migrated
into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088).

---

To see how, consider the following set of events:

- Say r42 starts off on n1, n2, and n3
- n3 flaps and so we place a replica for r42 on n4
- n3's replica, r42/3, is now GC-able, but still un-GC-ed
- We run the applied state migration, first migrating all ranges into it
  and then purging outdated replicas
- Well, we should want to purge r42/3, cause it's un-migrated and
  evaluating anything on it (say a lease request) is unsound because
  we've bumped version gates that tell the kvserver to always expect
  post-migration state
- What happens when we try to purge r42/3? Previous to this PR if it
  didn't have a replica version, we'd skip over it (!)
- Was it possible for r42/3 to not have a replica version? Shouldn't it
  have been accounted for when we migrated all ranges? No, that's precisely
  why the migration infrastructure purge outdated replicas. The migrate
  request only returns once its applied on all followers; in our example
  that wouldn't include r42/3 since it was no longer one
- The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we
  should've been doing the opposite. When iterating over a store's
  replicas for purging purposes, an empty replica version is fine and
  expected; we should interpret that as signal that we're dealing with a
  replica that was obviously never migrated (to even start using replica
  versions in the first place). Because it didn't have a valid replica
  version installed, we can infer that it's soon to be GC-ed (else we
  wouldn't have been able to finalize the applied state + replica
  version migration)
- The conditions above made it possible for us to evaluate requests on
  replicas with migration state out-of-date relative to the store's
  version
- Boom

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 31, 2021
Fixes cockroachdb#58378.
Fixes cockroachdb#62267.

Previously it was possible for us to have replicas in-memory, with
pre-migrated state, even after a migration was finalized. This led to
the kind of badness we were observing in cockroachdb#62267, where it appeared that
a replica was not using the applied state key despite us having migrated
into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088).

---

To see how, consider the following set of events:

- Say r42 starts off on n1, n2, and n3
- n3 flaps and so we place a replica for r42 on n4
- n3's replica, r42/3, is now GC-able, but still un-GC-ed
- We run the applied state migration, first migrating all ranges into it
  and then purging outdated replicas
- Well, we should want to purge r42/3, cause it's un-migrated and
  evaluating anything on it (say a lease request) is unsound because
  we've bumped version gates that tell the kvserver to always expect
  post-migration state
- What happens when we try to purge r42/3? Previous to this PR if it
  didn't have a replica version, we'd skip over it (!)
- Was it possible for r42/3 to not have a replica version? Shouldn't it
  have been accounted for when we migrated all ranges? No, that's precisely
  why the migration infrastructure purge outdated replicas. The migrate
  request only returns once its applied on all followers; in our example
  that wouldn't include r42/3 since it was no longer one
- The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we
  should've been doing the opposite. When iterating over a store's
  replicas for purging purposes, an empty replica version is fine and
  expected; we should interpret that as signal that we're dealing with a
  replica that was obviously never migrated (to even start using replica
  versions in the first place). Because it didn't have a valid replica
  version installed, we can infer that it's soon to be GC-ed (else we
  wouldn't have been able to finalize the applied state + replica
  version migration)
- The conditions above made it possible for us to evaluate requests on
  replicas with migration state out-of-date relative to the store's
  version
- Boom

Release note: None
craig bot pushed a commit that referenced this pull request Mar 31, 2021
60835: kv: bump timestamp cache to Pushee.MinTimestamp on PUSH_ABORT r=nvanbenschoten a=nvanbenschoten

Fixes #60779.
Fixes #60580.

We were only checking that the batch header timestamp was equal to or
greater than this pushee's min timestamp, so this is as far as we can
bump the timestamp cache.

62832: geo: minor performance improvement for looping over edges r=otan a=andyyang890

This patch slightly improves the performance of many
spatial builtins by storing the number of edges used
in the loop conditions of for loops into a variable.
We discovered this was taking a lot of time when
profiling the point-in-polygon optimization.

Release note: None

62838: kvserver: purge gc-able, unmigrated replicas during migrations r=irfansharif a=irfansharif

Fixes #58378.
Fixes #62267.

Previously it was possible for us to have replicas in-memory, with
pre-migrated state, even after a migration was finalized. This led to
the kind of badness we were observing in #62267, where it appeared that
a replica was not using the applied state key despite us having migrated
into it (see TruncatedAndRangeAppliedState, introduced in #58088).

---

To see how, consider the following set of events:

- Say r42 starts off on n1, n2, and n3
- n3 flaps and so we place a replica for r42 on n4
- n3's replica, r42/3, is now GC-able, but still un-GC-ed
- We run the applied state migration, first migrating all ranges into it
  and then purging outdated replicas
- Well, we should want to purge r42/3, cause it's un-migrated and
  evaluating anything on it (say a lease request) is unsound because
  we've bumped version gates that tell the kvserver to always expect
  post-migration state
- What happens when we try to purge r42/3? Previous to this PR if it
  didn't have a replica version, we'd skip over it (!)
- Was it possible for r42/3 to not have a replica version? Shouldn't it
  have been accounted for when we migrated all ranges? No, that's precisely
  why the migration infrastructure purge outdated replicas. The migrate
  request only returns once its applied on all followers; in our example
  that wouldn't include r42/3 since it was no longer one
- The stop-gap in #60429 made it so that we didn't GC r42/3, when we
  should've been doing the opposite. When iterating over a store's
  replicas for purging purposes, an empty replica version is fine and
  expected; we should interpret that as signal that we're dealing with a
  replica that was obviously never migrated (to even start using replica
  versions in the first place). Because it didn't have a valid replica
  version installed, we can infer that it's soon to be GC-ed (else we
  wouldn't have been able to finalize the applied state + replica
  version migration)
- The conditions above made it possible for us to evaluate requests on
  replicas with migration state out-of-date relative to the store's
  version
- Boom

Release note: None


62839: zonepb: make subzone DiffWithZone more accurate r=ajstorm a=otan

* Subzones may be defined in a different order. We did not take this
  into account which can cause bugs when e.g. ADD REGION adds a subzone
  in the end rather than in the old "expected" location in the subzones
  array. This has been fixed by comparing subzones using an unordered
  map.
* The ApplyZoneConfig we previously did overwrote subzone fields on the
  original subzone array element, meaning that if there was a mismatch
  it would not be reported through validation. This is now fixed by
  applying the expected zone config to *zonepb.NewZoneConfig() instead.
* Added logic to only check for zone config matches subzones from
  active subzone IDs.
* Improve the error messaging when a subzone config is mismatching -
  namely, add index and partitioning information and differentiate
  between missing fields and missing / extraneous zone configs

Resolves #62790

Release note (bug fix): Fixed validation bugs during ALTER TABLE ... SET
LOCALITY / crdb_internal.validate_multi_region_zone_config where
validation errors could occur when the database of a REGIONAL BY ROW
table has a new region added. Also fix a validation bug partition zone
mismatches configs were not caught.

62872: build: use -json for RandomSyntax test r=otan a=rafiss

I'm hoping this will help out with an issue where the test failures seem
to be missing helpful logs.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Andy Yang <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants