Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: mitigate unavailability due to backpressure when reducing range size #46863

Merged
merged 3 commits into from
Apr 3, 2020

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Apr 1, 2020

This PR comes in three commits which, taken together, should effectively mitigate the problems caused by reducing the range size.

  1. Detect when a range is much larger than the maximum size allowed with the backpressure policy and assume that the range got that big because the maximum range size used to be larger. This "much larger" threshold is controlled by a private cluster setting and defaults to 2MB.

  2. Remember, in-memory, previous large range sizes and use that previous range size to decide whether to backpressure.

  3. Prioritize splitting ranges which are currently backpressuring writes.

Fixes #46271

Release note (bug fix): When reducing the range_max_size zone configuration, queries will no experience longer experience backpressure based on the new, smaller value until the range size has first been reduced to below that value.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/fix-lowering-range-size branch from f7581e6 to 8f5b0d1 Compare April 2, 2020 02:44
@ajwerner ajwerner requested a review from nvanbenschoten April 2, 2020 03:50
@ajwerner ajwerner marked this pull request as ready for review April 2, 2020 03:50
@irfansharif irfansharif self-requested a review April 2, 2020 15:50
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change. The mitigations aren't super elegant in that multiple are needed to cover different cases, but you did a nice job isolating them and testing them. I feel ok about this making it into the release.

Reviewed 3 of 3 files at r1, 6 of 6 files at r2, 4 of 4 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @irfansharif)


pkg/kv/kvserver/client_replica_backpressure_test.go, line 36 at r1 (raw file):

)

// Test that mitigations to backpressure when reducing the range size works.

s/works/work/


pkg/kv/kvserver/client_replica_backpressure_test.go, line 62 at r2 (raw file):

		waitForBlockedRange func(id roachpb.RangeID),
	) {
		// Add a testing knob to block split traGet(repl.RangeID)nsactions which we'll enable before

Something got messed up here.


pkg/kv/kvserver/client_replica_backpressure_test.go, line 179 at r2 (raw file):

	})

	t.Run("no backpressure when much larger on new node", func(t *testing.T) {

Could you add a comment here about why this case differs from the previous one? It's clear once you understand the implementation of these mitigations, but not if you're just reading the test.


pkg/kv/kvserver/replica_backpressure.go, line 45 at r1 (raw file):

// range size by anything larger than the backpressureRangeSizeMultiplier would
// immediately mean that all ranges require backpressure. To mitigate this
// unwanted backpressure we say that any range which is large than the

larger


pkg/kv/kvserver/replica_backpressure.go, line 55 at r2 (raw file):

// We additionally mitigate this situation further by:
//
//  1) We store in-memory on each replica the largest zone configuration range

We should mention somewhere why this is insufficient to be the sole mitigation.


pkg/kv/kvserver/split_queue_test.go, line 60 at r3 (raw file):

		// No intersection, max bytes+1, no load.
		{roachpb.RKeyMin, roachpb.RKey(keys.MetaMax), 64<<20 + 1, 64 << 20, true, 1},
		// No intersection, max bytes * 2 + 2, no load, should backpressure

nit: missing punctuation.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: (this was pleasant to review)

Reviewed 3 of 3 files at r1, 6 of 6 files at r2, 4 of 4 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/kv/kvserver/client_replica_backpressure_test.go, line 44 at r1 (raw file):

	const (
		rowSize  = 32 << 10

// 32 Kib
// 32 Mib


pkg/kv/kvserver/client_replica_backpressure_test.go, line 144 at r1 (raw file):

		tdb.Exec(t, "UPSERT INTO foo VALUES ($1, $2)",
			rRand.Intn(10000000), randutil.RandBytes(rRand, rowSize))
		// Now we'll change the range_max_bytes to be 2x the range size.

Looking at just this commit, I don't quite follow how the newMax math below gives you 2x the range size. Isn't it half the range size - 32kb?


pkg/kv/kvserver/client_replica_backpressure_test.go, line 113 at r2 (raw file):

		unblockSplit = func() {
			closeOnce.Do(func() {
				allowSplits.Store(true)

Is the previous commit buggy by not having this there?


pkg/kv/kvserver/helpers_test.go, line 385 at r2 (raw file):

// LargestPreviousMaxRangeSizeBytes returns the in-memory value used to mitigate
// backpressure when the zone.RangeMaxSize is decreased.
func (r *Replica) LargestPreviousMaxRangeSizeBytes() int64 {

Do we want this to be suffixed with MuLocked?


pkg/kv/kvserver/replica.go, line 478 at r2 (raw file):

		// size. It is set when the zone config changes to a smaller value and the
		// current range size exceeds the new value. It is cleared after the range's
		// size drops below its current zone.MaxRangeBytes or the zone.MaxRangeBytes

s/the/if


pkg/kv/kvserver/replica_application_state_machine.go, line 839 at r2 (raw file):

	// If the range is now less than its RangeMaxBytes, clear the history of its
	// larger previous max bytes.

s/larger/largest


pkg/kv/kvserver/split_queue.go, line 113 at r3 (raw file):

	// additionalPriorityDueToBackpressure is a mechanism to prioritize splitting
	// ranges which will actively backpressuring writes.

s/backpressuring/backpressure?


pkg/kv/kvserver/split_queue.go, line 123 at r3 (raw file):

	// the limit, we will backpressure. We strongly prefer to split over
	// backpressure.
	const addiitonalPriorityDueToBackpressure = 50

Typo here.

This commit adds a setting to control how far over the configured max range
size multiplied by the multiple that we ought to enforce the backpressure.

This setting may seem counterintuitive but is a critical part of the strategy
to avoid temporary unavailability in cases where the range size is decreased.

Release justification: fixes for high-priority or high-severity bugs in existing
functionality. This is made more pressing because of the new, larger default
range size in this release. Customers may be more inclined to lower the range
size if we uncover issues with the larger size.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/fix-lowering-range-size branch from 8f5b0d1 to 0a9660c Compare April 2, 2020 21:29
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @nvanbenschoten)


pkg/kv/kvserver/client_replica_backpressure_test.go, line 36 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/works/work/

Done.


pkg/kv/kvserver/client_replica_backpressure_test.go, line 44 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

// 32 Kib
// 32 Mib

Done.


pkg/kv/kvserver/client_replica_backpressure_test.go, line 144 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Looking at just this commit, I don't quite follow how the newMax math below gives you 2x the range size. Isn't it half the range size - 32kb?

The comment is wrong. Updated


pkg/kv/kvserver/client_replica_backpressure_test.go, line 62 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Something got messed up here.

Done.


pkg/kv/kvserver/client_replica_backpressure_test.go, line 113 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Is the previous commit buggy by not having this there?

I don't think so. Without this, the splits would be unblocked but would fail. This is needed to make the splits actually succeed which is used to check that the LargestPreviousMaxRangeSizeBytesis cleared.


pkg/kv/kvserver/client_replica_backpressure_test.go, line 179 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a comment here about why this case differs from the previous one? It's clear once you understand the implementation of these mitigations, but not if you're just reading the test.

Done.


pkg/kv/kvserver/helpers_test.go, line 385 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Do we want this to be suffixed with MuLocked?

erm I meant to grab the lock, fixed. Thanks


pkg/kv/kvserver/replica.go, line 478 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/the/if

I feel like the after reasonably distributes over the or but 🤷


pkg/kv/kvserver/replica_application_state_machine.go, line 839 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/larger/largest

Done.


pkg/kv/kvserver/replica_backpressure.go, line 55 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We should mention somewhere why this is insufficient to be the sole mitigation.

Done.


pkg/kv/kvserver/split_queue.go, line 113 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/backpressuring/backpressure?

Yes, done.


pkg/kv/kvserver/split_queue.go, line 123 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Typo here.

Done.


pkg/kv/kvserver/split_queue_test.go, line 60 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: missing punctuation.

Done.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @irfansharif, and @nvanbenschoten)


pkg/kv/kvserver/client_replica_backpressure_test.go, line 213 at r6 (raw file):

		tdb.Exec(t, "SET CLUSTER SETTING kv.range.backpressure_byte_tolerance = '1 KiB'")

		// Now we'll change the range_max_bytes to be 2x the range size and within the

This comment looks wrong as well. Also, "slop"?

ajwerner added 2 commits April 2, 2020 20:09
This commit adds logic to track in RAM the previous maximum range size when
the current range size exceeds the current maximum range size. This is a
further mitigation to cases where the range size is decreased to avoid
unavailability until ranges are split down to the proper size.

Release note: None
This commit is a further mitigation to the decrease of range size. It is
important to split range which are too big. It is especially important to
split ranges which would block writes.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/fix-lowering-range-size branch from 0a9660c to 0cd7810 Compare April 3, 2020 00:09
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @nvanbenschoten)


pkg/kv/kvserver/client_replica_backpressure_test.go, line 213 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This comment looks wrong as well. Also, "slop"?

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 6 files at r5, 1 of 5 files at r7, 4 of 4 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif)

@ajwerner
Copy link
Contributor Author

ajwerner commented Apr 3, 2020

TFTR!

bors r=irfansharif ,nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Apr 3, 2020

Build succeeded

@craig craig bot merged commit c36f7cb into cockroachdb:master Apr 3, 2020
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jan 24, 2022
Fixed cockroachdb#75020. This test makes use of an "upsert-until-backpressure"
primitive, where it continually writes blobs to a scratch table
configured with a lower max range size (1<<18 from the default of
512<<20) until it experiences replica backpressure. Before cockroachdb#73876 (when
using the system config span), the splitting off of the scratch table's
ranges happened near instantly after the creation of the table itself.
This changed slightly with the span configs infrastructure, where
there's more of asynchrony built in and ranges may appear after a while
longer.

The test previously started upserting as soon as it created the table,
being able to implicitly rely on the tight synchrony of already having
the table's ranges carved out. This comes when deciding to record a
replica's largest previous max range size -- something we only do if the
replica's current size is larger than the new limit (see
(*Replica).SetSpanCnfig). When we weren't upserting until the range
applied the latest config, this was not possible. With span configs and
its additional asynchrony, this assumption no longer held. It was
possible then for the range to accept writes larger than the newest max
range size, which unintentionally (for this test at least) triggers an
escape hatch where we avoid backpressure when lowering a range's max
size below its current size (cockroachdb#46863).

The failure is easy to repro. This test runs in ~8s, and with the
following we were reliably seeing timeouts where the test was waiting
for backpressure to kick in (but it never did because of the escape
hatch).

    dev test pkg/kv/kvserver \
      -f TestProtectedTimestamps -v --show-logs \
      --timeout 300s --ignore-cache --count 10

De-flaking this test is simple -- we just wait for the table's replicas
to apply the latest config before issuing writes to it.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jan 24, 2022
Fixed cockroachdb#75020. This test makes use of an "upsert-until-backpressure"
primitive, where it continually writes blobs to a scratch table
configured with a lower max range size (1<<18 from the default of
512<<20) until it experiences replica backpressure. Before cockroachdb#73876 (when
using the system config span), the splitting off of the scratch table's
ranges happened near instantly after the creation of the table itself.
This changed slightly with the span configs infrastructure, where
there's more of asynchrony built in and ranges may appear after a while
longer.

The test previously started upserting as soon as it created the table,
being able to implicitly rely on the tight synchrony of already having
the table's ranges carved out. This comes when deciding to record a
replica's largest previous max range size -- something we only do if the
replica's current size is larger than the new limit (see
(*Replica).SetSpanCnfig). When we weren't upserting until the range
applied the latest config, this was not possible. With span configs and
its additional asynchrony, this assumption no longer held. It was
possible then for the range to accept writes larger than the newest max
range size, which unintentionally (for this test at least) triggers an
escape hatch where we avoid backpressure when lowering a range's max
size below its current size (cockroachdb#46863).

The failure is easy to repro. This test runs in ~8s, and with the
following we were reliably seeing timeouts where the test was waiting
for backpressure to kick in (but it never did because of the escape
hatch).

    dev test pkg/kv/kvserver \
      -f TestProtectedTimestamps -v --show-logs \
      --timeout 300s --ignore-cache --count 10

De-flaking this test is simple -- we just wait for the table's replicas
to apply the latest config before issuing writes to it.

Release note: None
craig bot pushed a commit that referenced this pull request Jan 25, 2022
75304: execbuilder: deflake TestExecBuild r=irfansharif a=irfansharif

Fixes #74933. This test asserts on physical plans of statements after
moving ranges around, i.e. whether distsql execution is "local" or
"distributed". This is determined by:
- where the distsql planner places processor cores,
- which in turn is a function of the span resolver,
- which delegates to the embedded distsender,
- which operates off of a range cache.

The range cache, like the name suggests, can hold stale data. This test
moved replicas of indexes around to induce distributed execution, but
was not accounting for the caches holding onto stale data. In my test
runs every so often I was seeing stale range descriptors from before the
relocate trip up the planner to generate a local execution, flaking the
test. Fortunately for us, all we need to do is slap on a retry. To
repro:

    # Took under a minute on my gceworker. Helped to trim down the test
    # file slightly.
    dev test pkg/sql/opt/exec/execbuilder \
      -f 'TestExecBuild/5node' -v --show-logs \
      --stress --stress-args '-p 4'

Release note: None

75436: kvserver: de-flake TestProtectedTimestamps r=irfansharif a=irfansharif

Fixed #75020. This test makes use of an "upsert-until-backpressure"
primitive, where it continually writes blobs to a scratch table
configured with a lower max range size (1<<18 from the default of
512<<20) until it experiences replica backpressure. Before #73876 (when
using the system config span), the splitting off of the scratch table's
ranges happened near instantly after the creation of the table itself.
This changed slightly with the span configs infrastructure, where
there's more of asynchrony built in and ranges may appear after a while
longer.

The test previously started upserting as soon as it created the table,
being able to implicitly rely on the tight synchrony of already having
the table's ranges carved out. This comes when deciding to record a
replica's largest previous max range size -- something we only do if the
replica's current size is larger than the new limit (see
(*Replica).SetSpanCnfig). When we weren't upserting until the range
applied the latest config, this was not possible. With span configs and
its additional asynchrony, this assumption no longer held. It was
possible then for the range to accept writes larger than the newest max
range size, which unintentionally (for this test at least) triggers an
escape hatch where we avoid backpressure when lowering a range's max
size below its current size (#46863).

The failure is easy to repro. This test runs in ~8s, and with the
following we were reliably seeing timeouts where the test was waiting
for backpressure to kick in (but it never did because of the escape
hatch).

    dev test pkg/kv/kvserver \
      -f TestProtectedTimestamps -v --show-logs \
      --timeout 300s --ignore-cache --count 10

De-flaking this test is simple -- we just wait for the table's replicas
to apply the latest config before issuing writes to it.

Release note: None

75443: ui: removed formatting to statements on the details pages r=THardy98 a=THardy98

**ui: removed formatting to statements on the details pages**

Previously, statements displayed on the statement/transaction/index
details pages were formatted (formatting was added to allow for better
readability of statements on these detail pages). However, statements
queries from the frontend were noticeably slower due to this
implementation. This change reverts the changes to statement formatting
(updates the fixtures to show the non-formatted statements), but keeps
the change that uses statement ID as an identifier in the URL instead of
the raw statement.

**Release note (ui change)**: removed formatting to statements on the
statement, transaction and index details pages, change to replace raw
statement with statement ID in the URL remained.

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Thomas Hardy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: lowering the range size can cause unavailability
4 participants