-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: (re-)introduce a stopgap for lack of ReplicaState synchronization #60429
Merged
craig
merged 1 commit into
cockroachdb:master
from
irfansharif:210210.stopgap-rep-version
Feb 10, 2021
Merged
kv: (re-)introduce a stopgap for lack of ReplicaState synchronization #60429
craig
merged 1 commit into
cockroachdb:master
from
irfansharif:210210.stopgap-rep-version
Feb 10, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
See cockroachdb#59194 and cockroachdb#58489 for more details. In cockroachdb#58489 we observed a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this meant is that it was possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). We believed this was addressed as part of cockroachdb#58378, but that appears not to be the case. Lets re-introduce this stop-gap while we investigate. Release note: None
ajwerner
approved these changes
Feb 10, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😞
😔 bors r+ |
craig bot
pushed a commit
that referenced
this pull request
Feb 10, 2021
60429: kv: (re-)introduce a stopgap for lack of ReplicaState synchronization r=irfansharif a=irfansharif See #59194 and #58489 for more details. In #58489 we observed a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this meant is that it was possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). We believed this was addressed as part of #58378, but that appears not to be the case. Lets re-introduce this stop-gap while we investigate. Release note: None Co-authored-by: irfan sharif <[email protected]>
Build failed: |
bors r+ |
Build succeeded: |
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this pull request
Mar 31, 2021
Fixes cockroachdb#58378. Fixes cockroachdb#62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in cockroachdb#62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this pull request
Mar 31, 2021
Fixes cockroachdb#58378. Fixes cockroachdb#62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in cockroachdb#62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None
craig bot
pushed a commit
that referenced
this pull request
Mar 31, 2021
60835: kv: bump timestamp cache to Pushee.MinTimestamp on PUSH_ABORT r=nvanbenschoten a=nvanbenschoten Fixes #60779. Fixes #60580. We were only checking that the batch header timestamp was equal to or greater than this pushee's min timestamp, so this is as far as we can bump the timestamp cache. 62832: geo: minor performance improvement for looping over edges r=otan a=andyyang890 This patch slightly improves the performance of many spatial builtins by storing the number of edges used in the loop conditions of for loops into a variable. We discovered this was taking a lot of time when profiling the point-in-polygon optimization. Release note: None 62838: kvserver: purge gc-able, unmigrated replicas during migrations r=irfansharif a=irfansharif Fixes #58378. Fixes #62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in #62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in #58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in #60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None 62839: zonepb: make subzone DiffWithZone more accurate r=ajstorm a=otan * Subzones may be defined in a different order. We did not take this into account which can cause bugs when e.g. ADD REGION adds a subzone in the end rather than in the old "expected" location in the subzones array. This has been fixed by comparing subzones using an unordered map. * The ApplyZoneConfig we previously did overwrote subzone fields on the original subzone array element, meaning that if there was a mismatch it would not be reported through validation. This is now fixed by applying the expected zone config to *zonepb.NewZoneConfig() instead. * Added logic to only check for zone config matches subzones from active subzone IDs. * Improve the error messaging when a subzone config is mismatching - namely, add index and partitioning information and differentiate between missing fields and missing / extraneous zone configs Resolves #62790 Release note (bug fix): Fixed validation bugs during ALTER TABLE ... SET LOCALITY / crdb_internal.validate_multi_region_zone_config where validation errors could occur when the database of a REGIONAL BY ROW table has a new region added. Also fix a validation bug partition zone mismatches configs were not caught. 62872: build: use -json for RandomSyntax test r=otan a=rafiss I'm hoping this will help out with an issue where the test failures seem to be missing helpful logs. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See #59194 and #58489 for more details.
In #58489 we observed a scary lack of synchronization around how we set
the ReplicaState for a given replica, and how we mark a replica as
"initialized". What this meant is that it was possible for the entry in
Store.mu.replicas to be both "initialized" and have an empty
ReplicaState. This is now more likely to bite us given the migrations
infrastructure attempts to purge outdated replicas at start up time
(when replicas are being initialized, and we're iterating through extan
replicas in the Store.mu.replicas map).
We believed this was addressed as part of #58378, but that appears not
to be the case. Lets re-introduce this stop-gap while we investigate.
Release note: None