kvserver: throttle `AddSSTable` requests with `IngestAsWrites` #73904

erikgrinaker · 2021-12-16T10:15:23Z

Store.Send() limits the number of concurrent AddSSTable requests
and delays them depending on LSM health via Engine.PreIngestDelay, to
prevent overwhelming Pebble. However, requests with IngestAsWrites
were not throttled, which has been seen to cause significant read
amplification.

This patch subjects IngestAsWrites requests to Engine.PreIngestDelay
as well, and adds a separate limit for IngestAsWrites requests
controlled via the cluster setting
kv.bulk_io_write.concurrent_addsstable_as_writes_requests (default
10). Since these requests are generally small, and will end up in the
Pebble memtable before being flushed to disk, we can tolerate a larger
limit for these requests than regular AddSSTable requests (1).

Release note (performance improvement): Bulk ingestion of small write
batches (e.g. index backfill into a large number of ranges) is now
throttled, to avoid buildup of read amplification and associated
performance degradation. Concurrency is controlled by the new cluster
setting kv.bulk_io_write.concurrent_addsstable_as_writes_requests.

cockroach-teamcity · 2021-12-16T10:15:30Z

This change is

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvserver/store_send.go, line 290 at r1 (raw file):

		beforeEngineDelay := timeutil.Now()
		s.engine.PreIngestDelay(ctx)

isn't the other behavior change (not explicitly mentioned in the PR description) that IngestAsWrites are now also subject to PreIngestDelay?
Is there a justification for the delay to be the same as a larger sstable?

dt · 2021-12-16T20:14:43Z

This looks like a good idea to me.

Is there a justification for the delay to be the same as a larger sstable?

My understanding is that the investigation of tpce index builds that are currently able to get to high read amp suggests that is happening due to these as-writes batches currently skipping the PreIngestBackpressure, and then overwhelming the LSM. Subjecting them to the same back pressure seems like a good idea, right? like, if the store is healthy, there's no back pressure anyway, and they go right though. If the store is not healthy, then backpressuring "ssts", even if they'll be switched to memtable writes, seems like a good idea?

nvanbenschoten · 2021-12-16T20:19:47Z

Subjecting them to the same back pressure seems like a good idea, right?

That was my thinking as well. This may actually be the more important part of this change.

`Store.Send()` limits the number of concurrent `AddSSTable` requests and delays them depending on LSM health via `Engine.PreIngestDelay`, to prevent overwhelming Pebble. However, requests with `IngestAsWrites` were not throttled, which has been seen to cause significant read amplification. This patch subjects `IngestAsWrites` requests to `Engine.PreIngestDelay` as well, and adds a separate limit for `IngestAsWrites` requests controlled via the cluster setting `kv.bulk_io_write.concurrent_addsstable_as_writes_requests` (default 10). Since these requests are generally small, and will end up in the Pebble memtable before being flushed to disk, we can tolerate a larger limit for these requests than regular `AddSSTable` requests (1). Release note (performance improvement): Bulk ingestion of small write batches (e.g. index backfill into a large number of ranges) is now throttled, to avoid buildup of read amplification and associated performance degradation. Concurrency is controlled by the new cluster setting `kv.bulk_io_write.concurrent_addsstable_as_writes_requests`.

erikgrinaker · 2021-12-16T20:29:53Z

isn't the other behavior change (not explicitly mentioned in the PR description) that IngestAsWrites are now also subject to PreIngestDelay?

You're right, that wasn't communicated properly here, and as others point out is likely to be the more significant improvement.

Is there a justification for the delay to be the same as a larger sstable?

This limits throughput to about 4 MB/s, concurrently with 16 MB/s of SST ingestion. Since the write batches likely have more overhead, and may be executed concurrently with SSTs, I think it makes sense to use the same delay for them.

dt · 2021-12-16T21:29:10Z

This limits throughput to about 4 MB/s, concurrently with 16 MB/s of SST ingestion.

Huh, just curious, how do we arrive at those mb/s numbers?

erikgrinaker · 2021-12-16T21:40:06Z

This limits throughput to about 4 MB/s, concurrently with 16 MB/s of SST ingestion.

Huh, just curious, how do we arrive at those mb/s numbers?

16 MB/s for SST ingestion is from schemachanger.backfiller.max_sst_size=16MiB and kv.bulk_io_write.concurrent_addsstable_requests=1. 4 MB/s for IngestAsWrites is from kv.bulk_io_write.small_write_size=400KiB and kv.bulk_io_write.concurrent_addsstable_requests=10.

Of course, I see now that the /s bit doesn't make sense (getting late here). But that's the concurrent volume we're ingesting, for lack of a better term.

erikgrinaker · 2021-12-17T11:47:45Z

I've run some quick tests with this patch, and the results are encouraging.

First off, I'll note that it's still possible to get read amp blowup. This can trivially be done by setting small_write_size=1GiB (so always ingested as writes) and keeping concurrent_addsstable_as_writes_requests=10 (or higher). This happens because PreIngestDelay hits the ceiling of 5s, which is not a sufficiently large delay to stabilize the LSM.

However, using more reasonable values such as small_write_size=1MiB and concurrent_addsstable_as_writes_requests=4 (so 4 MB concurrent write batch ingestion) allows the LSM to recover and read amp to drop back down to expected levels, even with some larger SSTs ingested normally too. This also holds for small_write_size=2MiB and concurrent_addsstable_as_writes_requests=2, where all SSTs are ingested as writes, and even up to concurrent_addsstable_as_writes_requests=10 in some periods (but 20 caused high read amp). For all of these different values, disk throughput remained stable and high.

Finally, I tried max_sst_size=400KiB, small_write_size=1MiB, and concurrent_addsstable_as_writes_requests=10. This emulates what would happen in the customer escalation we saw, where the SST sizes dropped to 400 KiB or below due to the large number of ranges, with the default throttling introduced here. This too showed stable read amp and high disk throughput.

This was not a comprehensive or systematic test, but I think sufficient to demonstrate that this throttling can prevent read amp blowup without significantly affecting throughput with default settings. @nvanbenschoten Did you find the same in your trials?

As for PreIngestDelay, it is unfortunate that the cap of 5s will allow read amp to spiral out. As a simple fix, we could consider letting PreIngestDelay spin until the calculated delay was less than the maximum delay. Or equivalently, setting some hard limit on read amp and waiting until it drops below that. However, that does carry a significant risk of starvation, and we'd likely want a more sophisticated policy here.

sumeerbhola · 2021-12-17T14:42:12Z

Nice to see these results!

In addition to doing something like this, that is backportable, we should create an issue to do both (a) make this more robust, (b) remove all these hard to fiddle with cluster settings. It is much easier for an operator to have a mental model for a setting like admission.l0_sub_level_count_overload_threshold, and much harder to reason about kv.bulk_io_write.concurrent_addsstable* and rocksdb.ingest_backpressure.max_delay.

erikgrinaker · 2021-12-17T16:52:47Z

Yeah, I agree -- wrote up #73979.

nvanbenschoten · 2021-12-17T17:09:47Z

This was not a comprehensive or systematic test, but I think sufficient to demonstrate that this throttling can prevent read amp blowup without significantly affecting throughput with default settings. @nvanbenschoten Did you find the same in your trials?

My last few trials were inconclusive. The intention was to show that with this change, we throttle sufficiently to not only stave off read amplification growth, but also enough to recover from it. The hope was that this would provide additional evidence of the stabilizing effect of this change. So my trials were running with this PR + a cluster setting to allow me to dynamically opt-in to any throttling on IngestAsWrites (the old behavior). Unfortunately, over the course of 4 trials with small_write_size=1GiB, I was never able to create unbounded read amplification growth. The best I can say is that enabling this throttling caused a cluster with a few nodes that had a read amplification of 20 to quickly drop down below 10 after a spike in delayed ingestion.

erikgrinaker · 2021-12-17T17:26:37Z

Unfortunately, over the course of 4 trials with small_write_size=1GiB, I was never able to create unbounded read amplification growth. The best I can say is that enabling this throttling caused a cluster with a few nodes that had a read amplification of 20 to quickly drop down below 10 after a spike in delayed ingestion.

This is very surprising to me, as I have been able to consistently produce read amp growth up to 1000 -- even on the same cluster you have been testing with. Reach out to me on Slack if you'd like to give this another shot together.

The intention was to show that with this change, we throttle sufficiently to not only stave off read amplification growth, but also enough to recover from it.

I saw this in my tests, where I let read amp build up by setting large values for small_write_size and concurrent_addsstable_as_writes_requests, and then reduce them down to reasonable values, which dropped read amp back down to expected levels (10-30 range) where it remained.

Do you feel like we need additional testing or tuning for this PR?

nvanbenschoten

Do you feel like we need additional testing or tuning for this PR?

I'm satisfied with what I've seen on Slack and in the test clusters.

Reviewed 2 of 3 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

erikgrinaker · 2021-12-19T16:16:01Z

TFTRs!

bors r=dt,nvanbenschoten

craig · 2021-12-19T17:12:10Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2021-12-19T17:12:28Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 3f125d1 to blathers/backport-release-21.1-73904: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

Fixes cockroachdb#102683. Part of cockroachdb#104154. These were added way back in cockroachdb#36403 and cockroachdb#73904, pre-dating much of IO admission control for leaseholder writes. With cockroachdb#95563, we now have IO admission control for follower writes. Put together, have ample LSM read-amp protection through AC alone. These concurrency limiters are now redundant and oblivious to more sophisticated AC measures. We recently removed the below-raft equivalents of these limiters (cockroachdb#98762), and like mentioned there, these limiters can exacerbate memory pressure. Separately, we're looking to work on speedier restores, and these limiters are starting to get in the way. While here, we also disable the pre-ingest delay mechanism in pebble, which too pre-dates AC, introduced way back in cockroachdb#34258 for RocksDB and in \cockroachdb#41839 for Pebble. IO AC is able to limit the number of L0 files, and this pre-ingest delay with its maximum per-request delay time of 5s can be less than effective. It's worth noting that the L0 file count threshold at which this pre-ingest delay mechanism kicked in was 20, while AC aims for 1000[^1]. This commit doesn't go as far as removing these limiters outright, merely disabling them. This is just out of an overabundance of caution. We can probably remove them once kvflowcontrol.enabled has had >1 release worth of baking time. Until then, it's nice to know we have these old safety hatches. We have ample time in the release to assess fallout from this commit, and also use this increased AddSST concurrency to stress the kvflowcontrol machinery. [^1]: The 1000 file limit exists to bound how long it takes to clear L0 completely. Envelope math cribbed from elsewhere: With 2MiB files, 1000 files is ~2GB, which at 40MB/s of compaction throughput (with a compaction slot consistently dedicated to L0) takes < 60s to clear the backlog. So the 'recovery' time is modest in that operators should not need to take manual action Release note: None

104861: kvserver: disable pre-AC above-raft AddSST throttling r=irfansharif a=irfansharif Fixes #102683. Part of #104154. These were added way back in #36403 and #73904, pre-dating much of IO admission control for leaseholder writes. With #95563, we now have IO admission control for follower writes. Put together, have ample LSM read-amp protection through AC alone. These concurrency limiters are now redundant and oblivious to more sophisticated AC measures. We recently removed the below-raft equivalents of these limiters (#98762), and like mentioned there, these limiters can exacerbate memory pressure. Separately, we're looking to work on speedier restores, and these limiters are starting to get in the way. While here, we also disable the pre-ingest delay mechanism in pebble, which too pre-dates AC, introduced way back in #34258 for RocksDB and in \#41839 for Pebble. IO AC is able to limit the number of L0 files, and this pre-ingest delay with its maximum per-request delay time of 5s can be less than effective. It's worth noting that the L0 file count threshold at which this pre-ingest delay mechanism kicked in was 20, while AC aims for 1000[^1]. This commit doesn't go as far as removing these limiters outright, merely disabling them. This is just out of an overabundance of caution. We can probably remove them once kvflowcontrol.enabled has had >1 release worth of baking time. Until then, it's nice to know we have these old safety hatches. We have ample time in the release to assess fallout from this commit, and also use this increased AddSST concurrency to stress the kvflowcontrol machinery. [^1]: The 1000 file limit exists to bound how long it takes to clear L0 completely. Envelope math cribbed from elsewhere: With 2MiB files, 1000 files is ~2GB, which at 40MB/s of compaction throughput (with a compaction slot consistently dedicated to L0) takes < 60s to clear the backlog. So the 'recovery' time is modest in that operators should not need to take manual action. Release note: None Co-authored-by: irfan sharif <[email protected]>

erikgrinaker added backport-21.1.x labels Dec 16, 2021

erikgrinaker requested review from dt and nvanbenschoten December 16, 2021 10:15

erikgrinaker self-assigned this Dec 16, 2021

erikgrinaker requested a review from a team as a code owner December 16, 2021 10:15

erikgrinaker requested a review from sumeerbhola December 16, 2021 11:06

sumeerbhola reviewed Dec 16, 2021

View reviewed changes

erikgrinaker force-pushed the addsstable-as-writes-limit branch from 237fa44 to 3b20ab2 Compare December 16, 2021 20:26

erikgrinaker force-pushed the addsstable-as-writes-limit branch from 3b20ab2 to 3f125d1 Compare December 16, 2021 20:27

erikgrinaker changed the title ~~kvserver: limit AddSSTable requests with IngestAsWrites~~ kvserver: throttle AddSSTable requests with IngestAsWrites Dec 16, 2021

dt approved these changes Dec 16, 2021

View reviewed changes

erikgrinaker mentioned this pull request Dec 17, 2021

kvserver: improve AddSSTable throttling #73979

Closed

nvanbenschoten approved these changes Dec 17, 2021

View reviewed changes

craig bot merged commit 7bd8974 into cockroachdb:master Dec 19, 2021

blathers-crl bot mentioned this pull request Dec 19, 2021

release-21.2: kvserver: throttle AddSSTable requests with IngestAsWrites #74071

Merged

cockroach-teamcity mentioned this pull request Dec 19, 2021

kvserver: throttle AddSSTable requests with IngestAsWrites cockroachdb/docs#12630

Closed

erikgrinaker mentioned this pull request Dec 19, 2021

release-21.1: kvserver: throttle AddSSTable requests with IngestAsWrites #74072

Merged

erikgrinaker deleted the addsstable-as-writes-limit branch December 19, 2021 19:28

irfansharif mentioned this pull request Jun 14, 2023

kvserver: disable pre-AC above-raft AddSST throttling #104861

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: throttle `AddSSTable` requests with `IngestAsWrites` #73904

kvserver: throttle `AddSSTable` requests with `IngestAsWrites` #73904

erikgrinaker commented Dec 16, 2021 •

edited

Loading

cockroach-teamcity commented Dec 16, 2021

sumeerbhola left a comment

dt commented Dec 16, 2021 •

edited

Loading

nvanbenschoten commented Dec 16, 2021

erikgrinaker commented Dec 16, 2021 •

edited

Loading

dt commented Dec 16, 2021

erikgrinaker commented Dec 16, 2021 •

edited

Loading

erikgrinaker commented Dec 17, 2021 •

edited

Loading

sumeerbhola commented Dec 17, 2021

erikgrinaker commented Dec 17, 2021

nvanbenschoten commented Dec 17, 2021

erikgrinaker commented Dec 17, 2021 •

edited

Loading

nvanbenschoten left a comment

erikgrinaker commented Dec 19, 2021

craig bot commented Dec 19, 2021

blathers-crl bot commented Dec 19, 2021

kvserver: throttle AddSSTable requests with IngestAsWrites #73904

kvserver: throttle AddSSTable requests with IngestAsWrites #73904

Conversation

erikgrinaker commented Dec 16, 2021 • edited Loading

cockroach-teamcity commented Dec 16, 2021

sumeerbhola left a comment

Choose a reason for hiding this comment

dt commented Dec 16, 2021 • edited Loading

nvanbenschoten commented Dec 16, 2021

erikgrinaker commented Dec 16, 2021 • edited Loading

dt commented Dec 16, 2021

erikgrinaker commented Dec 16, 2021 • edited Loading

erikgrinaker commented Dec 17, 2021 • edited Loading

sumeerbhola commented Dec 17, 2021

erikgrinaker commented Dec 17, 2021

nvanbenschoten commented Dec 17, 2021

erikgrinaker commented Dec 17, 2021 • edited Loading

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker commented Dec 19, 2021

craig bot commented Dec 19, 2021

blathers-crl bot commented Dec 19, 2021

kvserver: throttle `AddSSTable` requests with `IngestAsWrites` #73904

kvserver: throttle `AddSSTable` requests with `IngestAsWrites` #73904

erikgrinaker commented Dec 16, 2021 •

edited

Loading

dt commented Dec 16, 2021 •

edited

Loading

erikgrinaker commented Dec 16, 2021 •

edited

Loading

erikgrinaker commented Dec 16, 2021 •

edited

Loading

erikgrinaker commented Dec 17, 2021 •

edited

Loading

erikgrinaker commented Dec 17, 2021 •

edited

Loading