Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: throttle AddSSTable requests with IngestAsWrites #73904

Merged
merged 1 commit into from
Dec 19, 2021

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Dec 16, 2021

Store.Send() limits the number of concurrent AddSSTable requests
and delays them depending on LSM health via Engine.PreIngestDelay, to
prevent overwhelming Pebble. However, requests with IngestAsWrites
were not throttled, which has been seen to cause significant read
amplification.

This patch subjects IngestAsWrites requests to Engine.PreIngestDelay
as well, and adds a separate limit for IngestAsWrites requests
controlled via the cluster setting
kv.bulk_io_write.concurrent_addsstable_as_writes_requests (default
10). Since these requests are generally small, and will end up in the
Pebble memtable before being flushed to disk, we can tolerate a larger
limit for these requests than regular AddSSTable requests (1).

Release note (performance improvement): Bulk ingestion of small write
batches (e.g. index backfill into a large number of ranges) is now
throttled, to avoid buildup of read amplification and associated
performance degradation. Concurrency is controlled by the new cluster
setting kv.bulk_io_write.concurrent_addsstable_as_writes_requests.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, and @nvanbenschoten)


pkg/kv/kvserver/store_send.go, line 290 at r1 (raw file):

		beforeEngineDelay := timeutil.Now()
		s.engine.PreIngestDelay(ctx)

isn't the other behavior change (not explicitly mentioned in the PR description) that IngestAsWrites are now also subject to PreIngestDelay?
Is there a justification for the delay to be the same as a larger sstable?

@dt
Copy link
Member

dt commented Dec 16, 2021

This looks like a good idea to me.

Is there a justification for the delay to be the same as a larger sstable?

My understanding is that the investigation of tpce index builds that are currently able to get to high read amp suggests that is happening due to these as-writes batches currently skipping the PreIngestBackpressure, and then overwhelming the LSM. Subjecting them to the same back pressure seems like a good idea, right? like, if the store is healthy, there's no back pressure anyway, and they go right though. If the store is not healthy, then backpressuring "ssts", even if they'll be switched to memtable writes, seems like a good idea?

@nvanbenschoten
Copy link
Member

Subjecting them to the same back pressure seems like a good idea, right?

That was my thinking as well. This may actually be the more important part of this change.

@erikgrinaker erikgrinaker force-pushed the addsstable-as-writes-limit branch from 237fa44 to 3b20ab2 Compare December 16, 2021 20:26
`Store.Send()` limits the number of concurrent `AddSSTable` requests
and delays them depending on LSM health via `Engine.PreIngestDelay`, to
prevent overwhelming Pebble. However, requests with `IngestAsWrites`
were not throttled, which has been seen to cause significant read
amplification.

This patch subjects `IngestAsWrites` requests to `Engine.PreIngestDelay`
as well, and adds a separate limit for `IngestAsWrites` requests
controlled via the cluster setting
`kv.bulk_io_write.concurrent_addsstable_as_writes_requests` (default
10). Since these requests are generally small, and will end up in the
Pebble memtable before being flushed to disk, we can tolerate a larger
limit for these requests than regular `AddSSTable` requests (1).

Release note (performance improvement): Bulk ingestion of small write
batches (e.g. index backfill into a large number of ranges) is now
throttled, to avoid buildup of read amplification and associated
performance degradation. Concurrency is controlled by the new cluster
setting `kv.bulk_io_write.concurrent_addsstable_as_writes_requests`.
@erikgrinaker erikgrinaker force-pushed the addsstable-as-writes-limit branch from 3b20ab2 to 3f125d1 Compare December 16, 2021 20:27
@erikgrinaker erikgrinaker changed the title kvserver: limit AddSSTable requests with IngestAsWrites kvserver: throttle AddSSTable requests with IngestAsWrites Dec 16, 2021
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 16, 2021

isn't the other behavior change (not explicitly mentioned in the PR description) that IngestAsWrites are now also subject to PreIngestDelay?

You're right, that wasn't communicated properly here, and as others point out is likely to be the more significant improvement.

Is there a justification for the delay to be the same as a larger sstable?

This limits throughput to about 4 MB/s, concurrently with 16 MB/s of SST ingestion. Since the write batches likely have more overhead, and may be executed concurrently with SSTs, I think it makes sense to use the same delay for them.

@dt
Copy link
Member

dt commented Dec 16, 2021

This limits throughput to about 4 MB/s, concurrently with 16 MB/s of SST ingestion.

Huh, just curious, how do we arrive at those mb/s numbers?

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 16, 2021

This limits throughput to about 4 MB/s, concurrently with 16 MB/s of SST ingestion.

Huh, just curious, how do we arrive at those mb/s numbers?

16 MB/s for SST ingestion is from schemachanger.backfiller.max_sst_size=16MiB and kv.bulk_io_write.concurrent_addsstable_requests=1. 4 MB/s for IngestAsWrites is from kv.bulk_io_write.small_write_size=400KiB and kv.bulk_io_write.concurrent_addsstable_requests=10.

Of course, I see now that the /s bit doesn't make sense (getting late here). But that's the concurrent volume we're ingesting, for lack of a better term.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 17, 2021

I've run some quick tests with this patch, and the results are encouraging.

First off, I'll note that it's still possible to get read amp blowup. This can trivially be done by setting small_write_size=1GiB (so always ingested as writes) and keeping concurrent_addsstable_as_writes_requests=10 (or higher). This happens because PreIngestDelay hits the ceiling of 5s, which is not a sufficiently large delay to stabilize the LSM.

However, using more reasonable values such as small_write_size=1MiB and concurrent_addsstable_as_writes_requests=4 (so 4 MB concurrent write batch ingestion) allows the LSM to recover and read amp to drop back down to expected levels, even with some larger SSTs ingested normally too. This also holds for small_write_size=2MiB and concurrent_addsstable_as_writes_requests=2, where all SSTs are ingested as writes, and even up to concurrent_addsstable_as_writes_requests=10 in some periods (but 20 caused high read amp). For all of these different values, disk throughput remained stable and high.

Finally, I tried max_sst_size=400KiB, small_write_size=1MiB, and concurrent_addsstable_as_writes_requests=10. This emulates what would happen in the customer escalation we saw, where the SST sizes dropped to 400 KiB or below due to the large number of ranges, with the default throttling introduced here. This too showed stable read amp and high disk throughput.

This was not a comprehensive or systematic test, but I think sufficient to demonstrate that this throttling can prevent read amp blowup without significantly affecting throughput with default settings. @nvanbenschoten Did you find the same in your trials?

As for PreIngestDelay, it is unfortunate that the cap of 5s will allow read amp to spiral out. As a simple fix, we could consider letting PreIngestDelay spin until the calculated delay was less than the maximum delay. Or equivalently, setting some hard limit on read amp and waiting until it drops below that. However, that does carry a significant risk of starvation, and we'd likely want a more sophisticated policy here.

@sumeerbhola
Copy link
Collaborator

Nice to see these results!

In addition to doing something like this, that is backportable, we should create an issue to do both (a) make this more robust, (b) remove all these hard to fiddle with cluster settings. It is much easier for an operator to have a mental model for a setting like admission.l0_sub_level_count_overload_threshold, and much harder to reason about kv.bulk_io_write.concurrent_addsstable* and rocksdb.ingest_backpressure.max_delay.

@erikgrinaker
Copy link
Contributor Author

Yeah, I agree -- wrote up #73979.

@nvanbenschoten
Copy link
Member

This was not a comprehensive or systematic test, but I think sufficient to demonstrate that this throttling can prevent read amp blowup without significantly affecting throughput with default settings. @nvanbenschoten Did you find the same in your trials?

My last few trials were inconclusive. The intention was to show that with this change, we throttle sufficiently to not only stave off read amplification growth, but also enough to recover from it. The hope was that this would provide additional evidence of the stabilizing effect of this change. So my trials were running with this PR + a cluster setting to allow me to dynamically opt-in to any throttling on IngestAsWrites (the old behavior). Unfortunately, over the course of 4 trials with small_write_size=1GiB, I was never able to create unbounded read amplification growth. The best I can say is that enabling this throttling caused a cluster with a few nodes that had a read amplification of 20 to quickly drop down below 10 after a spike in delayed ingestion.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 17, 2021

Unfortunately, over the course of 4 trials with small_write_size=1GiB, I was never able to create unbounded read amplification growth. The best I can say is that enabling this throttling caused a cluster with a few nodes that had a read amplification of 20 to quickly drop down below 10 after a spike in delayed ingestion.

This is very surprising to me, as I have been able to consistently produce read amp growth up to 1000 -- even on the same cluster you have been testing with. Reach out to me on Slack if you'd like to give this another shot together.

The intention was to show that with this change, we throttle sufficiently to not only stave off read amplification growth, but also enough to recover from it.

I saw this in my tests, where I let read amp build up by setting large values for small_write_size and concurrent_addsstable_as_writes_requests, and then reduce them down to reasonable values, which dropped read amp back down to expected levels (10-30 range) where it remained.

Do you feel like we need additional testing or tuning for this PR?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you feel like we need additional testing or tuning for this PR?

I'm satisfied with what I've seen on Slack and in the test clusters.

Reviewed 2 of 3 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

@erikgrinaker
Copy link
Contributor Author

TFTRs!

bors r=dt,nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Dec 19, 2021

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Dec 19, 2021

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 3f125d1 to blathers/backport-release-21.1-73904: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@erikgrinaker erikgrinaker deleted the addsstable-as-writes-limit branch December 19, 2021 19:28
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jun 14, 2023
Fixes cockroachdb#102683. Part of cockroachdb#104154.

These were added way back in cockroachdb#36403 and cockroachdb#73904, pre-dating much of
IO admission control for leaseholder writes. With cockroachdb#95563, we now have IO
admission control for follower writes. Put together, have ample
LSM read-amp protection through AC alone. These concurrency limiters are
now redundant and oblivious to more sophisticated AC measures. We
recently removed the below-raft equivalents of these limiters (cockroachdb#98762),
and like mentioned there, these limiters can exacerbate memory pressure.
Separately, we're looking to work on speedier restores, and these
limiters are starting to get in the way.

While here, we also disable the pre-ingest delay mechanism in pebble,
which too pre-dates AC, introduced way back in cockroachdb#34258 for RocksDB and in
\cockroachdb#41839 for Pebble. IO AC is able to limit the number of L0 files, and
this pre-ingest delay with its maximum per-request delay time of 5s can
be less than effective. It's worth noting that the L0 file count
threshold at which this pre-ingest delay mechanism kicked in was 20,
while AC aims for 1000[^1].

This commit doesn't go as far as removing these limiters outright,
merely disabling them. This is just out of an overabundance of caution.
We can probably remove them once kvflowcontrol.enabled has had >1
release worth of baking time. Until then, it's nice to know we have
these old safety hatches. We have ample time in the release to assess
fallout from this commit, and also use this increased AddSST concurrency
to stress the kvflowcontrol machinery.

[^1]: The 1000 file limit exists to bound how long it takes to clear L0
      completely. Envelope math cribbed from elsewhere: With 2MiB files,
      1000 files is ~2GB, which at 40MB/s of compaction throughput (with
      a compaction slot consistently dedicated to L0) takes < 60s to
      clear the backlog. So the 'recovery' time is modest in that
      operators should not need to take manual action

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jul 4, 2023
Fixes cockroachdb#102683. Part of cockroachdb#104154.

These were added way back in cockroachdb#36403 and cockroachdb#73904, pre-dating much of
IO admission control for leaseholder writes. With cockroachdb#95563, we now have IO
admission control for follower writes. Put together, have ample
LSM read-amp protection through AC alone. These concurrency limiters are
now redundant and oblivious to more sophisticated AC measures. We
recently removed the below-raft equivalents of these limiters (cockroachdb#98762),
and like mentioned there, these limiters can exacerbate memory pressure.
Separately, we're looking to work on speedier restores, and these
limiters are starting to get in the way.

While here, we also disable the pre-ingest delay mechanism in pebble,
which too pre-dates AC, introduced way back in cockroachdb#34258 for RocksDB and in
\cockroachdb#41839 for Pebble. IO AC is able to limit the number of L0 files, and
this pre-ingest delay with its maximum per-request delay time of 5s can
be less than effective. It's worth noting that the L0 file count
threshold at which this pre-ingest delay mechanism kicked in was 20,
while AC aims for 1000[^1].

This commit doesn't go as far as removing these limiters outright,
merely disabling them. This is just out of an overabundance of caution.
We can probably remove them once kvflowcontrol.enabled has had >1
release worth of baking time. Until then, it's nice to know we have
these old safety hatches. We have ample time in the release to assess
fallout from this commit, and also use this increased AddSST concurrency
to stress the kvflowcontrol machinery.

[^1]: The 1000 file limit exists to bound how long it takes to clear L0
      completely. Envelope math cribbed from elsewhere: With 2MiB files,
      1000 files is ~2GB, which at 40MB/s of compaction throughput (with
      a compaction slot consistently dedicated to L0) takes < 60s to
      clear the backlog. So the 'recovery' time is modest in that
      operators should not need to take manual action

Release note: None
craig bot pushed a commit that referenced this pull request Jul 4, 2023
104861: kvserver: disable pre-AC above-raft AddSST throttling r=irfansharif a=irfansharif

Fixes #102683. Part of #104154.

These were added way back in #36403 and #73904, pre-dating much of IO admission control for leaseholder writes. With #95563, we now have IO admission control for follower writes. Put together, have ample LSM read-amp protection through AC alone. These concurrency limiters are now redundant and oblivious to more sophisticated AC measures. We recently removed the below-raft equivalents of these limiters (#98762), and like mentioned there, these limiters can exacerbate memory pressure. Separately, we're looking to work on speedier restores, and these limiters are starting to get in the way.

While here, we also disable the pre-ingest delay mechanism in pebble, which too pre-dates AC, introduced way back in #34258 for RocksDB and in \#41839 for Pebble. IO AC is able to limit the number of L0 files, and this pre-ingest delay with its maximum per-request delay time of 5s can be less than effective. It's worth noting that the L0 file count threshold at which this pre-ingest delay mechanism kicked in was 20, while AC aims for 1000[^1].

This commit doesn't go as far as removing these limiters outright, merely disabling them. This is just out of an overabundance of caution. We can probably remove them once kvflowcontrol.enabled has had >1 release worth of baking time. Until then, it's nice to know we have these old safety hatches. We have ample time in the release to assess fallout from this commit, and also use this increased AddSST concurrency to stress the kvflowcontrol machinery.

[^1]: The 1000 file limit exists to bound how long it takes to clear L0 completely. Envelope math cribbed from elsewhere: With 2MiB files, 1000 files is ~2GB, which at 40MB/s of compaction throughput (with a compaction slot consistently dedicated to L0) takes < 60s to clear the backlog. So the 'recovery' time is modest in that operators should not need to take manual action.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants