kvserver: mitigate unavailability due to backpressure when reducing range size #46863

ajwerner · 2020-04-01T17:35:17Z

This PR comes in three commits which, taken together, should effectively mitigate the problems caused by reducing the range size.

Detect when a range is much larger than the maximum size allowed with the backpressure policy and assume that the range got that big because the maximum range size used to be larger. This "much larger" threshold is controlled by a private cluster setting and defaults to 2MB.
Remember, in-memory, previous large range sizes and use that previous range size to decide whether to backpressure.
Prioritize splitting ranges which are currently backpressuring writes.

Fixes #46271

Release note (bug fix): When reducing the range_max_size zone configuration, queries will no experience longer experience backpressure based on the new, smaller value until the range size has first been reduced to below that value.

cockroach-teamcity · 2020-04-01T17:35:25Z

This change is

nvanbenschoten

I like this change. The mitigations aren't super elegant in that multiple are needed to cover different cases, but you did a nice job isolating them and testing them. I feel ok about this making it into the release.

Reviewed 3 of 3 files at r1, 6 of 6 files at r2, 4 of 4 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @irfansharif)

pkg/kv/kvserver/client_replica_backpressure_test.go, line 36 at r1 (raw file):

)

// Test that mitigations to backpressure when reducing the range size works.

s/works/work/

pkg/kv/kvserver/client_replica_backpressure_test.go, line 62 at r2 (raw file):

		waitForBlockedRange func(id roachpb.RangeID),
	) {
		// Add a testing knob to block split traGet(repl.RangeID)nsactions which we'll enable before

Something got messed up here.

pkg/kv/kvserver/client_replica_backpressure_test.go, line 179 at r2 (raw file):

	})

	t.Run("no backpressure when much larger on new node", func(t *testing.T) {

Could you add a comment here about why this case differs from the previous one? It's clear once you understand the implementation of these mitigations, but not if you're just reading the test.

pkg/kv/kvserver/replica_backpressure.go, line 45 at r1 (raw file):

// range size by anything larger than the backpressureRangeSizeMultiplier would
// immediately mean that all ranges require backpressure. To mitigate this
// unwanted backpressure we say that any range which is large than the

larger

pkg/kv/kvserver/replica_backpressure.go, line 55 at r2 (raw file):

// We additionally mitigate this situation further by:
//
//  1) We store in-memory on each replica the largest zone configuration range

We should mention somewhere why this is insufficient to be the sole mitigation.

pkg/kv/kvserver/split_queue_test.go, line 60 at r3 (raw file):

		// No intersection, max bytes+1, no load.
		{roachpb.RKeyMin, roachpb.RKey(keys.MetaMax), 64<<20 + 1, 64 << 20, true, 1},
		// No intersection, max bytes * 2 + 2, no load, should backpressure

nit: missing punctuation.

irfansharif

(this was pleasant to review)

Reviewed 3 of 3 files at r1, 6 of 6 files at r2, 4 of 4 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/kv/kvserver/client_replica_backpressure_test.go, line 44 at r1 (raw file):

	const (
		rowSize  = 32 << 10

// 32 Kib
// 32 Mib

pkg/kv/kvserver/client_replica_backpressure_test.go, line 144 at r1 (raw file):

		tdb.Exec(t, "UPSERT INTO foo VALUES ($1, $2)",
			rRand.Intn(10000000), randutil.RandBytes(rRand, rowSize))
		// Now we'll change the range_max_bytes to be 2x the range size.

Looking at just this commit, I don't quite follow how the newMax math below gives you 2x the range size. Isn't it half the range size - 32kb?

pkg/kv/kvserver/client_replica_backpressure_test.go, line 113 at r2 (raw file):

		unblockSplit = func() {
			closeOnce.Do(func() {
				allowSplits.Store(true)

Is the previous commit buggy by not having this there?

pkg/kv/kvserver/helpers_test.go, line 385 at r2 (raw file):

// LargestPreviousMaxRangeSizeBytes returns the in-memory value used to mitigate
// backpressure when the zone.RangeMaxSize is decreased.
func (r *Replica) LargestPreviousMaxRangeSizeBytes() int64 {

Do we want this to be suffixed with MuLocked?

pkg/kv/kvserver/replica.go, line 478 at r2 (raw file):

		// size. It is set when the zone config changes to a smaller value and the
		// current range size exceeds the new value. It is cleared after the range's
		// size drops below its current zone.MaxRangeBytes or the zone.MaxRangeBytes

s/the/if

pkg/kv/kvserver/replica_application_state_machine.go, line 839 at r2 (raw file):

	// If the range is now less than its RangeMaxBytes, clear the history of its
	// larger previous max bytes.

s/larger/largest

pkg/kv/kvserver/split_queue.go, line 113 at r3 (raw file):

	// additionalPriorityDueToBackpressure is a mechanism to prioritize splitting
	// ranges which will actively backpressuring writes.

s/backpressuring/backpressure?

pkg/kv/kvserver/split_queue.go, line 123 at r3 (raw file):

	// the limit, we will backpressure. We strongly prefer to split over
	// backpressure.
	const addiitonalPriorityDueToBackpressure = 50

Typo here.

This commit adds a setting to control how far over the configured max range size multiplied by the multiple that we ought to enforce the backpressure. This setting may seem counterintuitive but is a critical part of the strategy to avoid temporary unavailability in cases where the range size is decreased. Release justification: fixes for high-priority or high-severity bugs in existing functionality. This is made more pressing because of the new, larger default range size in this release. Customers may be more inclined to lower the range size if we uncover issues with the larger size. Release note: None

ajwerner

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @nvanbenschoten)

pkg/kv/kvserver/client_replica_backpressure_test.go, line 36 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/works/work/

Done.

pkg/kv/kvserver/client_replica_backpressure_test.go, line 44 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

// 32 Kib
// 32 Mib

Done.

pkg/kv/kvserver/client_replica_backpressure_test.go, line 144 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Looking at just this commit, I don't quite follow how the newMax math below gives you 2x the range size. Isn't it half the range size - 32kb?

The comment is wrong. Updated

pkg/kv/kvserver/client_replica_backpressure_test.go, line 62 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Something got messed up here.

Done.

pkg/kv/kvserver/client_replica_backpressure_test.go, line 113 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Is the previous commit buggy by not having this there?

I don't think so. Without this, the splits would be unblocked but would fail. This is needed to make the splits actually succeed which is used to check that the LargestPreviousMaxRangeSizeBytesis cleared.

pkg/kv/kvserver/client_replica_backpressure_test.go, line 179 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a comment here about why this case differs from the previous one? It's clear once you understand the implementation of these mitigations, but not if you're just reading the test.

Done.

pkg/kv/kvserver/helpers_test.go, line 385 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Do we want this to be suffixed with MuLocked?

erm I meant to grab the lock, fixed. Thanks

pkg/kv/kvserver/replica.go, line 478 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/the/if

I feel like the after reasonably distributes over the or but 🤷

pkg/kv/kvserver/replica_application_state_machine.go, line 839 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/larger/largest

Done.

pkg/kv/kvserver/replica_backpressure.go, line 55 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We should mention somewhere why this is insufficient to be the sole mitigation.

Done.

pkg/kv/kvserver/split_queue.go, line 113 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/backpressuring/backpressure?

Yes, done.

pkg/kv/kvserver/split_queue.go, line 123 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Typo here.

Done.

pkg/kv/kvserver/split_queue_test.go, line 60 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: missing punctuation.

Done.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @irfansharif, and @nvanbenschoten)

pkg/kv/kvserver/client_replica_backpressure_test.go, line 213 at r6 (raw file):

		tdb.Exec(t, "SET CLUSTER SETTING kv.range.backpressure_byte_tolerance = '1 KiB'")

		// Now we'll change the range_max_bytes to be 2x the range size and within the

This comment looks wrong as well. Also, "slop"?

This commit adds logic to track in RAM the previous maximum range size when the current range size exceeds the current maximum range size. This is a further mitigation to cases where the range size is decreased to avoid unavailability until ranges are split down to the proper size. Release note: None

This commit is a further mitigation to the decrease of range size. It is important to split range which are too big. It is especially important to split ranges which would block writes. Release note: None

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @nvanbenschoten)

pkg/kv/kvserver/client_replica_backpressure_test.go, line 213 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This comment looks wrong as well. Also, "slop"?

Done.

nvanbenschoten

Reviewed 4 of 6 files at r5, 1 of 5 files at r7, 4 of 4 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif)

ajwerner · 2020-04-03T02:33:21Z

TFTR!

bors r=irfansharif ,nvanbenschoten

craig · 2020-04-03T03:15:23Z

Build succeeded

GitHub CI (Cockroach)

Fixed cockroachdb#75020. This test makes use of an "upsert-until-backpressure" primitive, where it continually writes blobs to a scratch table configured with a lower max range size (1<<18 from the default of 512<<20) until it experiences replica backpressure. Before cockroachdb#73876 (when using the system config span), the splitting off of the scratch table's ranges happened near instantly after the creation of the table itself. This changed slightly with the span configs infrastructure, where there's more of asynchrony built in and ranges may appear after a while longer. The test previously started upserting as soon as it created the table, being able to implicitly rely on the tight synchrony of already having the table's ranges carved out. This comes when deciding to record a replica's largest previous max range size -- something we only do if the replica's current size is larger than the new limit (see (*Replica).SetSpanCnfig). When we weren't upserting until the range applied the latest config, this was not possible. With span configs and its additional asynchrony, this assumption no longer held. It was possible then for the range to accept writes larger than the newest max range size, which unintentionally (for this test at least) triggers an escape hatch where we avoid backpressure when lowering a range's max size below its current size (cockroachdb#46863). The failure is easy to repro. This test runs in ~8s, and with the following we were reliably seeing timeouts where the test was waiting for backpressure to kick in (but it never did because of the escape hatch). dev test pkg/kv/kvserver \ -f TestProtectedTimestamps -v --show-logs \ --timeout 300s --ignore-cache --count 10 De-flaking this test is simple -- we just wait for the table's replicas to apply the latest config before issuing writes to it. Release note: None

75304: execbuilder: deflake TestExecBuild r=irfansharif a=irfansharif Fixes #74933. This test asserts on physical plans of statements after moving ranges around, i.e. whether distsql execution is "local" or "distributed". This is determined by: - where the distsql planner places processor cores, - which in turn is a function of the span resolver, - which delegates to the embedded distsender, - which operates off of a range cache. The range cache, like the name suggests, can hold stale data. This test moved replicas of indexes around to induce distributed execution, but was not accounting for the caches holding onto stale data. In my test runs every so often I was seeing stale range descriptors from before the relocate trip up the planner to generate a local execution, flaking the test. Fortunately for us, all we need to do is slap on a retry. To repro: # Took under a minute on my gceworker. Helped to trim down the test # file slightly. dev test pkg/sql/opt/exec/execbuilder \ -f 'TestExecBuild/5node' -v --show-logs \ --stress --stress-args '-p 4' Release note: None 75436: kvserver: de-flake TestProtectedTimestamps r=irfansharif a=irfansharif Fixed #75020. This test makes use of an "upsert-until-backpressure" primitive, where it continually writes blobs to a scratch table configured with a lower max range size (1<<18 from the default of 512<<20) until it experiences replica backpressure. Before #73876 (when using the system config span), the splitting off of the scratch table's ranges happened near instantly after the creation of the table itself. This changed slightly with the span configs infrastructure, where there's more of asynchrony built in and ranges may appear after a while longer. The test previously started upserting as soon as it created the table, being able to implicitly rely on the tight synchrony of already having the table's ranges carved out. This comes when deciding to record a replica's largest previous max range size -- something we only do if the replica's current size is larger than the new limit (see (*Replica).SetSpanCnfig). When we weren't upserting until the range applied the latest config, this was not possible. With span configs and its additional asynchrony, this assumption no longer held. It was possible then for the range to accept writes larger than the newest max range size, which unintentionally (for this test at least) triggers an escape hatch where we avoid backpressure when lowering a range's max size below its current size (#46863). The failure is easy to repro. This test runs in ~8s, and with the following we were reliably seeing timeouts where the test was waiting for backpressure to kick in (but it never did because of the escape hatch). dev test pkg/kv/kvserver \ -f TestProtectedTimestamps -v --show-logs \ --timeout 300s --ignore-cache --count 10 De-flaking this test is simple -- we just wait for the table's replicas to apply the latest config before issuing writes to it. Release note: None 75443: ui: removed formatting to statements on the details pages r=THardy98 a=THardy98 **ui: removed formatting to statements on the details pages** Previously, statements displayed on the statement/transaction/index details pages were formatted (formatting was added to allow for better readability of statements on these detail pages). However, statements queries from the frontend were noticeably slower due to this implementation. This change reverts the changes to statement formatting (updates the fixtures to show the non-formatted statements), but keeps the change that uses statement ID as an identifier in the URL instead of the raw statement. **Release note (ui change)**: removed formatting to statements on the statement, transaction and index details pages, change to replace raw statement with statement ID in the URL remained. Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Thomas Hardy <[email protected]>

ajwerner force-pushed the ajwerner/fix-lowering-range-size branch from f7581e6 to 8f5b0d1 Compare April 2, 2020 02:44

ajwerner requested a review from nvanbenschoten April 2, 2020 03:50

ajwerner marked this pull request as ready for review April 2, 2020 03:50

irfansharif self-requested a review April 2, 2020 15:50

nvanbenschoten reviewed Apr 2, 2020

View reviewed changes

irfansharif approved these changes Apr 2, 2020

View reviewed changes

ajwerner force-pushed the ajwerner/fix-lowering-range-size branch from 8f5b0d1 to 0a9660c Compare April 2, 2020 21:29

ajwerner commented Apr 2, 2020

View reviewed changes

irfansharif reviewed Apr 2, 2020

View reviewed changes

ajwerner added 2 commits April 2, 2020 20:09

kvserver: prioritize splitting ranges which are currently backpressuring

0cd7810

This commit is a further mitigation to the decrease of range size. It is important to split range which are too big. It is especially important to split ranges which would block writes. Release note: None

ajwerner force-pushed the ajwerner/fix-lowering-range-size branch from 0a9660c to 0cd7810 Compare April 3, 2020 00:09

ajwerner commented Apr 3, 2020

View reviewed changes

nvanbenschoten approved these changes Apr 3, 2020

View reviewed changes

craig bot merged commit c36f7cb into cockroachdb:master Apr 3, 2020

nvanbenschoten mentioned this pull request Apr 3, 2020

roachtest: hotspotsplits/nodes=4 failed #46957

Closed

ajwerner mentioned this pull request Mar 26, 2021

jobs: canceling a job fails if the job's range is backpressuring writes #62627

Closed

irfansharif mentioned this pull request Jan 24, 2022

kvserver: de-flake TestProtectedTimestamps #75436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: mitigate unavailability due to backpressure when reducing range size #46863

kvserver: mitigate unavailability due to backpressure when reducing range size #46863

ajwerner commented Apr 1, 2020 •

edited

Loading

cockroach-teamcity commented Apr 1, 2020

nvanbenschoten left a comment

irfansharif left a comment

ajwerner left a comment

irfansharif left a comment

ajwerner left a comment

nvanbenschoten left a comment

ajwerner commented Apr 3, 2020

craig bot commented Apr 3, 2020

kvserver: mitigate unavailability due to backpressure when reducing range size #46863

kvserver: mitigate unavailability due to backpressure when reducing range size #46863

Conversation

ajwerner commented Apr 1, 2020 • edited Loading

cockroach-teamcity commented Apr 1, 2020

nvanbenschoten left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajwerner commented Apr 3, 2020

craig bot commented Apr 3, 2020

Build succeeded

ajwerner commented Apr 1, 2020 •

edited

Loading