-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: mitigate unavailability due to backpressure when reducing range size #46863
kvserver: mitigate unavailability due to backpressure when reducing range size #46863
Conversation
f7581e6
to
8f5b0d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change. The mitigations aren't super elegant in that multiple are needed to cover different cases, but you did a nice job isolating them and testing them. I feel ok about this making it into the release.
Reviewed 3 of 3 files at r1, 6 of 6 files at r2, 4 of 4 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @irfansharif)
pkg/kv/kvserver/client_replica_backpressure_test.go, line 36 at r1 (raw file):
) // Test that mitigations to backpressure when reducing the range size works.
s/works/work/
pkg/kv/kvserver/client_replica_backpressure_test.go, line 62 at r2 (raw file):
waitForBlockedRange func(id roachpb.RangeID), ) { // Add a testing knob to block split traGet(repl.RangeID)nsactions which we'll enable before
Something got messed up here.
pkg/kv/kvserver/client_replica_backpressure_test.go, line 179 at r2 (raw file):
}) t.Run("no backpressure when much larger on new node", func(t *testing.T) {
Could you add a comment here about why this case differs from the previous one? It's clear once you understand the implementation of these mitigations, but not if you're just reading the test.
pkg/kv/kvserver/replica_backpressure.go, line 45 at r1 (raw file):
// range size by anything larger than the backpressureRangeSizeMultiplier would // immediately mean that all ranges require backpressure. To mitigate this // unwanted backpressure we say that any range which is large than the
larger
pkg/kv/kvserver/replica_backpressure.go, line 55 at r2 (raw file):
// We additionally mitigate this situation further by: // // 1) We store in-memory on each replica the largest zone configuration range
We should mention somewhere why this is insufficient to be the sole mitigation.
pkg/kv/kvserver/split_queue_test.go, line 60 at r3 (raw file):
// No intersection, max bytes+1, no load. {roachpb.RKeyMin, roachpb.RKey(keys.MetaMax), 64<<20 + 1, 64 << 20, true, 1}, // No intersection, max bytes * 2 + 2, no load, should backpressure
nit: missing punctuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, 6 of 6 files at r2, 4 of 4 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/kv/kvserver/client_replica_backpressure_test.go, line 44 at r1 (raw file):
const ( rowSize = 32 << 10
// 32 Kib
// 32 Mib
pkg/kv/kvserver/client_replica_backpressure_test.go, line 144 at r1 (raw file):
tdb.Exec(t, "UPSERT INTO foo VALUES ($1, $2)", rRand.Intn(10000000), randutil.RandBytes(rRand, rowSize)) // Now we'll change the range_max_bytes to be 2x the range size.
Looking at just this commit, I don't quite follow how the newMax math below gives you 2x the range size. Isn't it half the range size - 32kb?
pkg/kv/kvserver/client_replica_backpressure_test.go, line 113 at r2 (raw file):
unblockSplit = func() { closeOnce.Do(func() { allowSplits.Store(true)
Is the previous commit buggy by not having this there?
pkg/kv/kvserver/helpers_test.go, line 385 at r2 (raw file):
// LargestPreviousMaxRangeSizeBytes returns the in-memory value used to mitigate // backpressure when the zone.RangeMaxSize is decreased. func (r *Replica) LargestPreviousMaxRangeSizeBytes() int64 {
Do we want this to be suffixed with MuLocked
?
pkg/kv/kvserver/replica.go, line 478 at r2 (raw file):
// size. It is set when the zone config changes to a smaller value and the // current range size exceeds the new value. It is cleared after the range's // size drops below its current zone.MaxRangeBytes or the zone.MaxRangeBytes
s/the/if
pkg/kv/kvserver/replica_application_state_machine.go, line 839 at r2 (raw file):
// If the range is now less than its RangeMaxBytes, clear the history of its // larger previous max bytes.
s/larger/largest
pkg/kv/kvserver/split_queue.go, line 113 at r3 (raw file):
// additionalPriorityDueToBackpressure is a mechanism to prioritize splitting // ranges which will actively backpressuring writes.
s/backpressuring/backpressure?
pkg/kv/kvserver/split_queue.go, line 123 at r3 (raw file):
// the limit, we will backpressure. We strongly prefer to split over // backpressure. const addiitonalPriorityDueToBackpressure = 50
Typo here.
This commit adds a setting to control how far over the configured max range size multiplied by the multiple that we ought to enforce the backpressure. This setting may seem counterintuitive but is a critical part of the strategy to avoid temporary unavailability in cases where the range size is decreased. Release justification: fixes for high-priority or high-severity bugs in existing functionality. This is made more pressing because of the new, larger default range size in this release. Customers may be more inclined to lower the range size if we uncover issues with the larger size. Release note: None
8f5b0d1
to
0a9660c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @nvanbenschoten)
pkg/kv/kvserver/client_replica_backpressure_test.go, line 36 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/works/work/
Done.
pkg/kv/kvserver/client_replica_backpressure_test.go, line 44 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
// 32 Kib
// 32 Mib
Done.
pkg/kv/kvserver/client_replica_backpressure_test.go, line 144 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Looking at just this commit, I don't quite follow how the newMax math below gives you 2x the range size. Isn't it half the range size - 32kb?
The comment is wrong. Updated
pkg/kv/kvserver/client_replica_backpressure_test.go, line 62 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Something got messed up here.
Done.
pkg/kv/kvserver/client_replica_backpressure_test.go, line 113 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Is the previous commit buggy by not having this there?
I don't think so. Without this, the splits would be unblocked but would fail. This is needed to make the splits actually succeed which is used to check that the LargestPreviousMaxRangeSizeBytes
is cleared.
pkg/kv/kvserver/client_replica_backpressure_test.go, line 179 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you add a comment here about why this case differs from the previous one? It's clear once you understand the implementation of these mitigations, but not if you're just reading the test.
Done.
pkg/kv/kvserver/helpers_test.go, line 385 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Do we want this to be suffixed with
MuLocked
?
erm I meant to grab the lock, fixed. Thanks
pkg/kv/kvserver/replica.go, line 478 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
s/the/if
I feel like the after
reasonably distributes over the or
but 🤷
pkg/kv/kvserver/replica_application_state_machine.go, line 839 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
s/larger/largest
Done.
pkg/kv/kvserver/replica_backpressure.go, line 55 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We should mention somewhere why this is insufficient to be the sole mitigation.
Done.
pkg/kv/kvserver/split_queue.go, line 113 at r3 (raw file):
Previously, irfansharif (irfan sharif) wrote…
s/backpressuring/backpressure?
Yes, done.
pkg/kv/kvserver/split_queue.go, line 123 at r3 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Typo here.
Done.
pkg/kv/kvserver/split_queue_test.go, line 60 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: missing punctuation.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @irfansharif, and @nvanbenschoten)
pkg/kv/kvserver/client_replica_backpressure_test.go, line 213 at r6 (raw file):
tdb.Exec(t, "SET CLUSTER SETTING kv.range.backpressure_byte_tolerance = '1 KiB'") // Now we'll change the range_max_bytes to be 2x the range size and within the
This comment looks wrong as well. Also, "slop"?
This commit adds logic to track in RAM the previous maximum range size when the current range size exceeds the current maximum range size. This is a further mitigation to cases where the range size is decreased to avoid unavailability until ranges are split down to the proper size. Release note: None
This commit is a further mitigation to the decrease of range size. It is important to split range which are too big. It is especially important to split ranges which would block writes. Release note: None
0a9660c
to
0cd7810
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @nvanbenschoten)
pkg/kv/kvserver/client_replica_backpressure_test.go, line 213 at r6 (raw file):
Previously, irfansharif (irfan sharif) wrote…
This comment looks wrong as well. Also, "slop"?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 6 files at r5, 1 of 5 files at r7, 4 of 4 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif)
TFTR! bors r=irfansharif ,nvanbenschoten |
Build succeeded |
Fixed cockroachdb#75020. This test makes use of an "upsert-until-backpressure" primitive, where it continually writes blobs to a scratch table configured with a lower max range size (1<<18 from the default of 512<<20) until it experiences replica backpressure. Before cockroachdb#73876 (when using the system config span), the splitting off of the scratch table's ranges happened near instantly after the creation of the table itself. This changed slightly with the span configs infrastructure, where there's more of asynchrony built in and ranges may appear after a while longer. The test previously started upserting as soon as it created the table, being able to implicitly rely on the tight synchrony of already having the table's ranges carved out. This comes when deciding to record a replica's largest previous max range size -- something we only do if the replica's current size is larger than the new limit (see (*Replica).SetSpanCnfig). When we weren't upserting until the range applied the latest config, this was not possible. With span configs and its additional asynchrony, this assumption no longer held. It was possible then for the range to accept writes larger than the newest max range size, which unintentionally (for this test at least) triggers an escape hatch where we avoid backpressure when lowering a range's max size below its current size (cockroachdb#46863). The failure is easy to repro. This test runs in ~8s, and with the following we were reliably seeing timeouts where the test was waiting for backpressure to kick in (but it never did because of the escape hatch). dev test pkg/kv/kvserver \ -f TestProtectedTimestamps -v --show-logs \ --timeout 300s --ignore-cache --count 10 De-flaking this test is simple -- we just wait for the table's replicas to apply the latest config before issuing writes to it. Release note: None
Fixed cockroachdb#75020. This test makes use of an "upsert-until-backpressure" primitive, where it continually writes blobs to a scratch table configured with a lower max range size (1<<18 from the default of 512<<20) until it experiences replica backpressure. Before cockroachdb#73876 (when using the system config span), the splitting off of the scratch table's ranges happened near instantly after the creation of the table itself. This changed slightly with the span configs infrastructure, where there's more of asynchrony built in and ranges may appear after a while longer. The test previously started upserting as soon as it created the table, being able to implicitly rely on the tight synchrony of already having the table's ranges carved out. This comes when deciding to record a replica's largest previous max range size -- something we only do if the replica's current size is larger than the new limit (see (*Replica).SetSpanCnfig). When we weren't upserting until the range applied the latest config, this was not possible. With span configs and its additional asynchrony, this assumption no longer held. It was possible then for the range to accept writes larger than the newest max range size, which unintentionally (for this test at least) triggers an escape hatch where we avoid backpressure when lowering a range's max size below its current size (cockroachdb#46863). The failure is easy to repro. This test runs in ~8s, and with the following we were reliably seeing timeouts where the test was waiting for backpressure to kick in (but it never did because of the escape hatch). dev test pkg/kv/kvserver \ -f TestProtectedTimestamps -v --show-logs \ --timeout 300s --ignore-cache --count 10 De-flaking this test is simple -- we just wait for the table's replicas to apply the latest config before issuing writes to it. Release note: None
75304: execbuilder: deflake TestExecBuild r=irfansharif a=irfansharif Fixes #74933. This test asserts on physical plans of statements after moving ranges around, i.e. whether distsql execution is "local" or "distributed". This is determined by: - where the distsql planner places processor cores, - which in turn is a function of the span resolver, - which delegates to the embedded distsender, - which operates off of a range cache. The range cache, like the name suggests, can hold stale data. This test moved replicas of indexes around to induce distributed execution, but was not accounting for the caches holding onto stale data. In my test runs every so often I was seeing stale range descriptors from before the relocate trip up the planner to generate a local execution, flaking the test. Fortunately for us, all we need to do is slap on a retry. To repro: # Took under a minute on my gceworker. Helped to trim down the test # file slightly. dev test pkg/sql/opt/exec/execbuilder \ -f 'TestExecBuild/5node' -v --show-logs \ --stress --stress-args '-p 4' Release note: None 75436: kvserver: de-flake TestProtectedTimestamps r=irfansharif a=irfansharif Fixed #75020. This test makes use of an "upsert-until-backpressure" primitive, where it continually writes blobs to a scratch table configured with a lower max range size (1<<18 from the default of 512<<20) until it experiences replica backpressure. Before #73876 (when using the system config span), the splitting off of the scratch table's ranges happened near instantly after the creation of the table itself. This changed slightly with the span configs infrastructure, where there's more of asynchrony built in and ranges may appear after a while longer. The test previously started upserting as soon as it created the table, being able to implicitly rely on the tight synchrony of already having the table's ranges carved out. This comes when deciding to record a replica's largest previous max range size -- something we only do if the replica's current size is larger than the new limit (see (*Replica).SetSpanCnfig). When we weren't upserting until the range applied the latest config, this was not possible. With span configs and its additional asynchrony, this assumption no longer held. It was possible then for the range to accept writes larger than the newest max range size, which unintentionally (for this test at least) triggers an escape hatch where we avoid backpressure when lowering a range's max size below its current size (#46863). The failure is easy to repro. This test runs in ~8s, and with the following we were reliably seeing timeouts where the test was waiting for backpressure to kick in (but it never did because of the escape hatch). dev test pkg/kv/kvserver \ -f TestProtectedTimestamps -v --show-logs \ --timeout 300s --ignore-cache --count 10 De-flaking this test is simple -- we just wait for the table's replicas to apply the latest config before issuing writes to it. Release note: None 75443: ui: removed formatting to statements on the details pages r=THardy98 a=THardy98 **ui: removed formatting to statements on the details pages** Previously, statements displayed on the statement/transaction/index details pages were formatted (formatting was added to allow for better readability of statements on these detail pages). However, statements queries from the frontend were noticeably slower due to this implementation. This change reverts the changes to statement formatting (updates the fixtures to show the non-formatted statements), but keeps the change that uses statement ID as an identifier in the URL instead of the raw statement. **Release note (ui change)**: removed formatting to statements on the statement, transaction and index details pages, change to replace raw statement with statement ID in the URL remained. Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Thomas Hardy <[email protected]>
This PR comes in three commits which, taken together, should effectively mitigate the problems caused by reducing the range size.
Detect when a range is much larger than the maximum size allowed with the backpressure policy and assume that the range got that big because the maximum range size used to be larger. This "much larger" threshold is controlled by a private cluster setting and defaults to 2MB.
Remember, in-memory, previous large range sizes and use that previous range size to decide whether to backpressure.
Prioritize splitting ranges which are currently backpressuring writes.
Fixes #46271
Release note (bug fix): When reducing the
range_max_size
zone configuration, queries will no experience longer experience backpressure based on the new, smaller value until the range size has first been reduced to below that value.