-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: avoid clobbering replica conf #75233
kvserver: avoid clobbering replica conf #75233
Conversation
PS: Looking at the CI failures here, merging this bug fix will probably surface more CI failures/improper test assumptions that were masked until now. Take a look at the following for e.g., definitely does not hold true anymore. cockroach/pkg/kv/kvserver/replicate_queue_test.go Lines 264 to 277 in 5359e54
|
1a702c7
to
61e1bc8
Compare
Outstanding CI failure is #75212. |
Fixes cockroachdb#75109. There are two ways to read the configuration applied over a given replica: (a) the copy embedded in each replica struct (b) spanconfig.StoreReader, hanging off the store struct The interface in (b) is implemented by both the span configs infrastructure and the legacy system config span it's designed to replace; it's typically used by KV queues (see cockroachdb#69172). When switching between subsystems in KV (controlled by spanconfig.store.enabled), for we transparently switch the source for (b). We also use then use the right source to refresh (a) for every replica. Incremental updates thereafter mutate (a) directly. As you'd expect, we need to take special care that only one subsystem is writing to (a) at a given point in time, a guarantee that was not upheld before this commit. This bug manifested after we enabled span configs by default (see cockroachdb#73876), and likely affected many tests. To avoid the system config span from clobbering (a) when span configs are enabled, we just need to check what spanconfig.store.enabled holds at the right spot. To repro: # Reliably fails with 1-2m on my GCE worker before this patch, # doesn't after. dev test pkg/kv/kvserver \ -f TestBackpressureNotAppliedWhenReducingRangeSize \ -v --show-logs --timeout 15m --stress Release note: None
09b8f9f
to
2e2894b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple questions for the second commit but otherwise everything else looks good.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)
-- commits, line 38 at r2:
It's worth adding more words around what those assumptions were here.
pkg/kv/kvserver/testing_knobs.go, line 378 at r4 (raw file):
// is used only for (old) tests written with the system config span in mind. // // TODO(irfansharif): Get rid of this knob, maybe by first move
"maybe by first move"
pkg/kv/kvserver/client_store_test.go, line 30 at r3 (raw file):
// TestStoreSetRangesMaxBytes creates a set of ranges via splitting and then // sets the config zone to a custom max bytes value to verify the ranges' max // bytes are updated appropriately. // XXX:
// XXX: ?
pkg/kv/kvserver/replicate_queue_test.go, line 275 at r2 (raw file):
// config asks for 3x replication, and the replication queue will // down-replicate it. require.Len(t, desc.Replicas().Descriptors(), 5)
Do we know why this worked before and doesn't work now? Is that something we want to address or are we happy changing this test?
This test made assumptions about where the scratch range was carved out (the system table ranges) from and what config was originally applied to it (database system's zone config), assumptions that no longer hold with the span configs infrastructure. This commit partially rewrites the test to be compatible with span configs (cockroachdb#73876), though unfortunately making it a bit slower (~4.5s to ~8.5s). The failure mode was masked until now due to the bug fixed in the preceding commit. Release note: None
It's a pretty old test and did not age well in lieu of the span configs infrastructure. We re-write it such that it's no longer so intimately coupled to the gossiped system config span internals. Release note: None
..span configs. Release note: None
2e2894b
to
dedcec7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
Previously, arulajmani (Arul Ajmani) wrote…
It's worth adding more words around what those assumptions were here.
Done.
pkg/kv/kvserver/testing_knobs.go, line 378 at r4 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
"maybe by first move"
Done.
pkg/kv/kvserver/client_store_test.go, line 30 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
// XXX: ?
Woops, forgot to remove. Done.
pkg/kv/kvserver/replicate_queue_test.go, line 275 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Do we know why this worked before and doesn't work now? Is that something we want to address or are we happy changing this test?
Yes, added to the commit message. Basically this test relied on the the internal behavior of the system config span where scratch range was split off from the system table ranges, and it "originally" applying the system database zone/span config. With "ConfigureScratchRange", we don't have any such lineage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r5, 3 of 3 files at r6, 2 of 2 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
This test relies on a second node's span config rangefeed to have caught up sufficiently with the global state such that a snapshot, when applied, would construct a replica with the config this test attempts to declare over the originating replica. Though very unlikely to occur (repro-ed only once in hours GCE worker runs), it's possible. This patch is a speculative fix for a failure observed only with manual stressing on a SHA that included cockroachdb#75233. (Also make some prominent span config logging a bit more consistent with one another.) Release note: None
75429: parser: implement ALTER TABLE ... RESET (...) r=rafiss a=otan Only last two commits for review (rest is #75262). See individual commits for details. 75464: *: use bootstrap.BootstrappedSystemIDChecker in tests r=postamar a=postamar Instead of relying on hard-coded constants, tests now rely on the bootstrapped system schema to determine which IDs belong to system tables or not. Release note: None 75561: sql/tests: skip st_buffer in randomized test r=otan a=rafiss fixes #74078 Release note: None 75598: kvserver: deflake TestBackpressureNotAppliedWhenReducingRangeSize r=irfansharif a=irfansharif This test relies on a second node's span config rangefeed to have caught up sufficiently with the global state such that a snapshot, when applied, would construct a replica with the config this test attempts to declare over the originating replica. Though very unlikely to occur (repro-ed only once in hours GCE worker runs), it's possible. This patch is a speculative fix for a failure observed only with manual stressing on a SHA that included #75233. (Also make some prominent span config logging a bit more consistent with one another.) Release note: None Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Marius Posta <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: irfan sharif <[email protected]>
Fixes #75109. There are two ways to read the configuration applied over
a given replica:
(a) the copy embedded in each replica struct
(b) spanconfig.StoreReader, hanging off the store struct
The interface in (b) is implemented by both the span configs
infrastructure and the legacy system config span it's designed to
replace; it's typically used by KV queues (see #69172). When switching
between subsystems in KV (controlled by spanconfig.store.enabled), for
we transparently switch the source for (b). We also use then use the
right source to refresh (a) for every replica. Incremental updates
thereafter mutate (a) directly. As you'd expect, we need to take special
care that only one subsystem is writing to (a) at a given point in time,
a guarantee that was not upheld before this commit. This bug manifested
after we enabled span configs by default (see #73876), and likely
affected many tests.
To avoid the system config span from clobbering (a) when span configs
are enabled, we just need to check what spanconfig.store.enabled
holds at the right spot. To repro:
Release note: None