kvserver: disable {split,replicate,mvccGC} queues until... #98422

irfansharif · 2023-03-11T01:33:22Z

...subscribed to span configs. Do the same for the store rebalancer. We applied this treatment for the merge queue back in #78122 since the fallback behavior, if not subscribed, is to use the statically defined span config for every operation.

For the replicate queue this mean obtusely applying a replication factor of 3, regardless of configuration. This was possible typically post node restart before subscription was initially established. We saw this in roachtest: replicate/wide failed #98385. It was possible then for us to ignore configured voter/non-voter/lease constraints.
For the split queue, we wouldn't actually compute any split keys if unsubscribed, so the missing check was somewhat benign. But we would be obtusely applying the default range sizes [128MiB,512MiB], so for clusters configured with larger range sizes, this could lead to a burst of splitting post node-restart.
For the MVCC GC queue, it would mean applying the the statically configured default GC TTL and ignoring any set protected timestamps. The latter is best-effort protection but could result in internal operations relying on protection (like backups, changefeeds) failing informatively. For clusters configured with GC TTL greater than the default, post node-restart it could lead to a burst of MVCC GC activity and AOST queries failing to find expected data.
For the store rebalancer, ignoring span configs could result in violating lease preferences and voter constraints.

Fixes #98421.
Fixes #98385.

Release note (bug fix): It was previously possible for CockroachDB to not respect non-default zone configs. This only happened for a short window after nodes with existing replicas were restarted, and self-rectified within seconds. This manifested in a few ways:

If num_replicas was set to something other than 3, we would still add or remove replicas to get to 3x replication.
- If num_voters was set explicitly to get a mix of voting and non-voting replicas, it would be ignored. CockroachDB could possibly remove non-voting replicas.
If range_min_bytes or range_max_bytes were changed from 128 MiB and 512 MiB respectively, we would instead try to size ranges to be within [128 MiB, 512MiB]. This could appear as an excess amount of range splits or merges, as visible in the Replication Dashboard under "Range Operations".
If gc.ttlseconds was set to something other than 90000 seconds, we would still GC data only older than 90000s/25h. If the GC TTL was set to something larger than 25h, AOST queries going further back may now start failing. For GC TTLs less than the 25h default, clusters would observe increased disk usage due to more retained garbage.
If constraints, lease_preferences or voter_constraints were set, they would be ignored. Range data and leases would possibly be moved outside where prescribed. This issues only lasted a few seconds post node-restarts, and any zone config violations were rectified shortly after.

blathers-crl · 2023-03-11T01:33:25Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-03-11T01:33:30Z

This change is

kvoli

This only happened for a short window after nodes with existing replicas were restarted

It seems as though it could possibly occur if the range feed couldn't be established as well? Is that unlikely to occur in practice?

Overall . Some minor nits. Looks like the patch needs crlfmt, update setting names and implement *mockSpanConfigSubscriber.LastUpdated to pass CI.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @irfansharif)

-- commits line 16 at r1:
Could you explain this further? If there are no split keys being computed how could there be a burst of splitting post node restart?

Code quote:

  - For the split queue, we wouldn't actually compute any split keys if
    unsubscribed, so the missing check was somewhat benign. But we would
    be obtusely applying the default range sizes [128MiB,512MiB], so for
    clusters configured with larger range sizes, this could lead to a
    burst of splitting post node-restart.

-- commits line 27 at r1:
nit: Generally it affected constraints, not just voter.

cockroach/pkg/kv/kvserver/store_rebalancer.go

Lines 1097 to 1109 in 1a58583

    
           for i := 0; i < len(finalNonVoterTargets); i++ { 
        
           	add, remove, _, shouldRebalance := sr.allocator.RebalanceTarget( 
        
           		ctx, 
        
           		sr.storePool, 
        
           		rbCtx.conf, 
        
           		rbCtx.candidateReplica.RaftStatus(), 
        
           		finalVoterTargets, 
        
           		finalNonVoterTargets, 
        
           		rbCtx.candidateReplica.RangeUsageInfo(), 
        
           		storepool.StoreFilterSuspect, 
        
           		allocatorimpl.NonVoterTarget, 
        
           		options, 
        
           	)

pkg/kv/kvserver/store_rebalancer.go line 158 at r1 (raw file):

		storePool = rq.store.cfg.StorePool
	}

nit: empty line added.

pkg/kv/kvserver/kvserverbase/base.go line 43 at r1 (raw file):

var ReplicateQueueEnabled = settings.RegisterBoolSetting(
	settings.SystemOnly,
	"kv.replicate.queue_enabled",

nit: change these to be kv.xxxx.queue.enabled

...subscribed to span configs. Do the same for the store rebalancer. We applied this treatment for the merge queue back in cockroachdb#78122 since the fallback behavior, if not subscribed, is to use the statically defined span config for every operation. - For the replicate queue this mean obtusely applying a replication factor of 3, regardless of configuration. This was possible typically post node restart before subscription was initially established. We saw this in cockroachdb#98385. It was possible then for us to ignore configured voter/non-voter/lease constraints. - For the split queue, we wouldn't actually compute any split keys if unsubscribed, so the missing check was somewhat benign. But we would be obtusely applying the default range sizes [128MiB,512MiB], so for clusters configured with larger range sizes, this could lead to a burst of splitting post node-restart. - For the MVCC GC queue, it would mean applying the the statically configured default GC TTL and ignoring any set protected timestamps. The latter is best-effort protection but could result in internal operations relying on protection (like backups, changefeeds) failing informatively. For clusters configured with GC TTL greater than the default, post node-restart it could lead to a burst of MVCC GC activity and AOST queries failing to find expected data. - For the store rebalancer, ignoring span configs could result in violating lease preferences and voter/non-voter constraints. Fixes cockroachdb#98421. Fixes cockroachdb#98385. While here, we also introduce the following non-public cluster settings to selectively enable/disable KV queues: - kv.mvcc_gc_queue.enabled - kv.split_queue.enabled - kv.replicate_queue.enabled Release note (bug fix): It was previously possible for CockroachDB to not respect non-default zone configs. This could only happen for a short window after nodes with existing replicas were restarted (measured in seconds), and self-rectified (also within seconds). This manifested in a few ways: - If num_replicas was set to something other than 3, we would still add or remove replicas to get to 3x replication. - If num_voters was set explicitly to get a mix of voting and non-voting replicas, it would be ignored. CockroachDB could possibly remove non-voting replicas. - If range_min_bytes or range_max_bytes were changed from 128 MiB and 512 MiB respectively, we would instead try to size ranges to be within [128 MiB, 512MiB]. This could appear as an excess amount of range splits or merges, as visible in the Replication Dashboard under "Range Operations". - If gc.ttlseconds was set to something other than 90000 seconds, we would still GC data only older than 90000s/25h. If the GC TTL was set to something larger than 25h, AOST queries going further back may now start failing. For GC TTLs less than the 25h default, clusters would observe increased disk usage due to more retained garbage. - If constraints, lease_preferences or voter_constraints were set, they would be ignored. Range data and leases would possibly be moved outside where prescribed. This issues lasted a few seconds post node-restarts, and any zone config violations were rectified shortly after.

irfansharif

I'm only mentioning the node restart point since it's a release note -- rangefeeds are an internal mechanism. But yes, it can also happen if rangefeeds reliably fail to get established, say when there's loss of quorum. But these are centralized system database ranges, and if you've lost quorum there, things are pretty bad. I don't think it's worth mentioning, and I think it's rare given our routine 5x replication.

Repurposed queueConfig.needsSystemConfig to capture this needs-subscription-logic; will wait for CI to get green.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani and @kvoli)

-- commits line 16 at r1:

Previously, kvoli (Austen) wrote…

Could you explain this further? If there are no split keys being computed how could there be a burst of splitting post node restart?

The burst of splitting would be due to the larger configured range sizes, not due to computed split keys (which BTW, are only used to figure out schema boundaries or under coalesced ranges, the points where adjacent schemas have different span configs). So if the restarted store has 4GiB ranges, post-restart we might try to slice them back down to 512MiB ranges.

irfansharif · 2023-03-13T17:59:51Z

bors r+

craig · 2023-03-13T18:24:11Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-03-13T19:21:27Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

arulajmani

Nice find! I see this is already in the bors queue, so no need to stop the train for me, but please do address comments in a followup. Specifically, the one around expanding our testing.

Reviewed 19 of 19 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @kvoli)

pkg/kv/kvserver/mvcc_gc_queue.go line 686 at r2 (raw file):

) (processed bool, err error) {
	if !mgcq.enabled() {
		log.VEventf(ctx, 2, "skipping mvcc gc: queue has been disabled")

nit: Consider logging louder, but doing so log.Every(10 seconds).

pkg/kv/kvserver/store.go line 2066 at r2 (raw file):

}

// GetConfReader exposes access to a configuration reader.

Please add a sentence about returning errSpanConfigUnavailable up here as well.

pkg/kv/kvserver/store.go line 2090 at r2 (raw file):

		// surface explicitly that we don't have any span configs instead of
		// falling back to the statically configured one.
		// - enabling range merges would be extremely dangerous -- we could

Instead of having these words here, they're better placed around where you're setting needsSpanConfigs for the respective queues.

pkg/spanconfig/spanconfigkvsubscriber/kvsubscriber_client_test.go line 32 at r2 (raw file):

// TestBlockedKVSubscriberDisablesMerges ensures that the merge queue is
// disabled until the KVSubscriber has some snapshot.
func TestBlockedKVSubscriberDisablesMerges(t *testing.T) {

Let's expand this test to cover the other queues we've identified in this commit as well.

craig · 2023-03-13T20:34:01Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-03-13T20:34:07Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 234bcd0 to blathers/backport-release-22.2-98422: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

irfansharif

please do address comments in a followup. Specifically, the one around expanding our testing.

See #98807.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

Regression test for cockroachdb#98422 which disables {split,replicate,mvccGC} queues until subscribed to span configs. We're only slightly modifying the existing merge-queue specific TestBlockedKVSubscriberDisablesMerges. Release note: None

…99123 98712: storage: expose separate WriteBatch interface r=sumeerbhola a=jbowens Adapt NewUnindexedBatch to no longer take a writeOnly parameter. Instead, a new NewWriteBatch method is exposed that returns a WriteBatch type that does not provide Reader facilities. Future work may remove UnindexedBatch altogether, updating callers to explicitly maintain separate Readers and WriteBatches. Epic: None Release note: None 98807: spanconfig: add TestBlockedKVSubscriberDisablesQueues r=irfansharif a=irfansharif Regression test for #98422 which disables {split,replicate,mvccGC} queues until subscribed to span configs. We're only slightly modifying the existing merge-queue specific TestBlockedKVSubscriberDisablesMerges. Release note: None 98823: sql: address a couple of old TODOs r=yuzefovich a=yuzefovich This commit addresses a couple of old TODOs in the `connExecutor.dispatchToExecutionEngine` method. First, it simply removes the now-stale TODO about needing to `Step()` the txn before executing the query. The TODO mentioned things about the old way FK checks were performed, but we now always run checks in a deferred fashion, and perform the `Step`s correctly there for cascades and checks, thus, the TODO can be removed. (The TODO itself explicitly mentions that it can be removed once we do checks and cascades in the deferred fashion.) Second, it addresses a TODO about how some plan information is saved. Up until 961e66f the cleanup of `planNode` tree and `flow` infra was intertwined, so in 6ae4881 in order to "sample" the plan with correct info (some of which is only available _after_ the execution is done) we delegated that sampling to `planTop.close`. However, with `planNode` tree and `flow` cleanups refactored and de-coupled, we no longer have to close the plan early on. As a result, we now close the plan in a defer right after we create it (now it's the only place we do so), and this commit makes it so that we record the plan info explicitly right after returning from the execution engine, thus, addressing the second TODO. Epic: None Release note: None 98896: tracing: fix WithEventListeners option with verbose parent r=yuzefovich a=yuzefovich Previously, when using `WithEventListeners` span option and not explicitly specifying `WithRecording` option, we could incorrectly use the structured recording. In particular, this was the case when the parent has verbose recording, and this is now fixed. At the moment, this span option is used only by the backup and restore processors, so the impact would be that their verbose recording wouldn't be included into the session recording (perhaps we don't actually expose that anyway). Epic: None Release note: None 99079: sql: fix hint for unknown udfs in views and funcs r=rharding6373 a=rharding6373 Fixes a hint for some instances of an "unknown function" error that was not properly applied previously. Informs: #99002 Epic: None Release note: None 99095: ui: fix inflight request replacement, cleanup database details api r=THardy98 a=THardy98 Changes to allow in-flight request replacement in this PR: #98331 caused breaking changes to `KeyedCachedDataReducer`s. Despite the added `allowReplacementRequests` flag being `false`, in-flight requests would be replaced (specifically, when fetching for initial state). This would prevent all requests but the last one from being stored in the reducer state and leave the remaining requests in a permanent `inFlight` status. Consequently, our pages would break as the permanent `inFlight` status on these requests would prevent us from ever fetching data, putting these pages in a permanent loading state. This PR adds checks to ensure `allowReplacementRequests` is enabled when we receive a response, before deciding whether to omit it from the reducer state. It's important to note that this is still not an ideal solution for `KeyedCachedDataReducer`s, should we ever want to replace inflight requests for one. The checks to replace an in-flight request still occur at the reducer-level, whereas in a `KeyedCachedDataReducer` we'd like them to occur at the state-level (i.e. instead of replacement checks for all requests across the reducer, we should check for requests that impact a specific key's state). This way we scope replacements to each key in the reducer, allowing us to fetch data for each key independently (which is the intended functionality). Enabling this for a `KeyedCachedDataReducer` without making this change I believe will cause the breaking behaviour above. This PR also includes some cleanup to the database details API, namely: - camel casing response fields - encapsulating the database span stats response into its own object (instead of across multiple objects), this helps scope any errors deriving from the query to the single stats object **BEFORE** https://www.loom.com/share/57c9f278376a4551977398316d6ed7b6 **AFTER** https://www.loom.com/share/b2d4e074afca4b42a4f711171de3fd72 Release note (bug fix): Fixed replacement of in-flight requests for `KeyedCachedDataReducer`s to prevent permanent loading on requests stuck on an `inFlight` status. 99097: upgrades: fix backfilling of special roles in user ID migrations r=rafiss a=andyyang890 upgrades: fix system.privileges user ID migration This patch adds a missing case for the public role in the backfilling portion of the system.privileges user ID migration. Release note: None --- upgrades: fix system.database_role_settings user ID migration This patch adds a missing case for the empty role in the backfilling portion of the system.database_role_settings user ID migration. Release note: None --- Part of #87079 99113: server: use no-op cost controller for shared-process tenants r=knz a=stevendanna In the long run, all rate limiting of tenants should be controlled by a tenant capability. However, at the moment, we do not have the infrastructure to read tenant capabilities from the tenant process. Since, the main user of shared-process tenants is the new, experimental unified-architecture were we do not anticipate needing the cost controller for most users, this PR injects a no-op cost controller for shared-process tenants. Epic: CRDB-23559 Release note: None 99120: tracing: allow collection of stack history for Structured spans as well r=dt a=adityamaru Previously, we mandated that the span must be Verbose to be able to collect and emit a Structured event containing stack history. This change permits Structured spans to also collect stack history. Release note: None Epic: none 99123: sql: skip flaky TestSQLStatsDataDriven r=ericharmeling a=ericharmeling This commit skips TestSQLStatsDataDriven. This test is flaky. Part of #89861. Epic: none Release note: None Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Thomas Hardy <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Eric Harmeling <[email protected]>

irfansharif added the backport-22.2.x label Mar 11, 2023

irfansharif requested review from arulajmani and kvoli March 11, 2023 01:33

irfansharif requested a review from a team as a code owner March 11, 2023 01:33

kvoli approved these changes Mar 13, 2023

View reviewed changes

irfansharif force-pushed the 230310.spanconfig-kv-queues branch from 3e341fd to aa64776 Compare March 13, 2023 16:17

irfansharif requested a review from a team March 13, 2023 16:17

irfansharif force-pushed the 230310.spanconfig-kv-queues branch from aa64776 to 234bcd0 Compare March 13, 2023 16:24

irfansharif commented Mar 13, 2023

View reviewed changes

arulajmani reviewed Mar 13, 2023

View reviewed changes

craig bot merged commit c31c1ac into cockroachdb:master Mar 13, 2023

irfansharif deleted the 230310.spanconfig-kv-queues branch March 13, 2023 20:37

irfansharif mentioned this pull request Mar 16, 2023

release-22.2: kvserver: disable {split,replicate,mvccGC} queues until... #98803

Merged

irfansharif mentioned this pull request Mar 16, 2023

spanconfig: add TestBlockedKVSubscriberDisablesQueues #98807

Merged

irfansharif commented Mar 16, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: disable {split,replicate,mvccGC} queues until... #98422

kvserver: disable {split,replicate,mvccGC} queues until... #98422

irfansharif commented Mar 11, 2023

blathers-crl bot commented Mar 11, 2023

cockroach-teamcity commented Mar 11, 2023

kvoli left a comment

irfansharif left a comment

irfansharif commented Mar 13, 2023

craig bot commented Mar 13, 2023

craig bot commented Mar 13, 2023

arulajmani left a comment

craig bot commented Mar 13, 2023

blathers-crl bot commented Mar 13, 2023

irfansharif left a comment

	for i := 0; i < len(finalNonVoterTargets); i++ {
	add, remove, _, shouldRebalance := sr.allocator.RebalanceTarget(
	ctx,
	sr.storePool,
	rbCtx.conf,
	rbCtx.candidateReplica.RaftStatus(),
	finalVoterTargets,
	finalNonVoterTargets,
	rbCtx.candidateReplica.RangeUsageInfo(),
	storepool.StoreFilterSuspect,
	allocatorimpl.NonVoterTarget,
	options,
	)

kvserver: disable {split,replicate,mvccGC} queues until... #98422

kvserver: disable {split,replicate,mvccGC} queues until... #98422

Conversation

irfansharif commented Mar 11, 2023

blathers-crl bot commented Mar 11, 2023

cockroach-teamcity commented Mar 11, 2023

kvoli left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif commented Mar 13, 2023

craig bot commented Mar 13, 2023

craig bot commented Mar 13, 2023

arulajmani left a comment

Choose a reason for hiding this comment

craig bot commented Mar 13, 2023

blathers-crl bot commented Mar 13, 2023

irfansharif left a comment

Choose a reason for hiding this comment