storage: limit RevertRange batches to 32mb #59716

dt · 2021-02-02T19:48:46Z

The existing limit in key-count/span-count can produce batches in excess of 64mb,
if, for example, they have very large keys. These batches then are rejected for
exceeding the raft command size limit.

This adds an aditional hard-coded limit of 32mb on the write batch to which keys
or spans to clear are added (if the command is executed against a non-Batch the
limit is ignored). The size of the batch is re-checked once every 32 keys.

Release note (bug fix): avoid creating batches that exceed the raft command limit (64mb) when reverting ranges that contain very large keys.

cockroach-teamcity · 2021-02-02T19:48:52Z

This change is

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, and @dt)

pkg/storage/mvcc.go, line 2126 at r1 (raw file):

	}

	const maxBatchByteSize, recheckBatchSizeEvery = 32 << 20, 32

It is confusing that we already have a maxBatchSize parameter and that is set to MaxSpanRequestKeys.
I thought MaxSpanRequestKeys was to bound the amount of data retrieved for a read. If that is correct, why is it being used here?
Was its purpose to limit the amount of read work or write work? Given that we count a ClearRange as 1, I guess it was to limit the amount of write work. Can we unify that with what is being introduced here?

pkg/storage/mvcc.go, line 2198 at r1 (raw file):

					break
				}
				recheckBatchSize = 0

We only write to the Batch in flushClearedKeys, so if there are 63 K-V pairs that together exceed the 64MB limit, and then we find a non-matching key and call flush, we will not notice that we have exceeded the batch size. I think we could approximate an upper bound on the size of the batch by just adding the key lengths. I think it would be good to push that approximation into clearMatchingKey since it knows when it is transitioning from individual clears to clear range.

Even better would be to lift that code out into a clearBatcher struct with functions flushClearedRun, clear, getCountAndApproxSize that can be used to cleanly separate the iteration from the write batching. And that would unify this if-block with the preceding one since both are doing the same thing wrt constructing a resume span.

andreimatei

The thing looks good to me, but I'm not the right person to look. Seems like Sumeer's got you.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dt)

dt

Cleaned up a bit and added some tests. RFAL!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @sumeerbhola)

pkg/storage/mvcc.go, line 2126 at r1 (raw file):

Previously, sumeerbhola wrote…

It is confusing that we already have a maxBatchSize parameter and that is set to MaxSpanRequestKeys.
I thought MaxSpanRequestKeys was to bound the amount of data retrieved for a read. If that is correct, why is it being used here?
Was its purpose to limit the amount of read work or write work? Given that we count a ClearRange as 1, I guess it was to limit the amount of write work. Can we unify that with what is being introduced here?

The count is indeed limiting write work, and more generally just limiting work, so that a caller could paginate, persist progress checkpoints, etc.

But the count, while likely a good proxy for how long the request will run, isn't on its own enough to avoid running afoul of the max command size in the fact of giant keys, so we need both it seems.

pkg/storage/mvcc.go, line 2198 at r1 (raw file):

Previously, sumeerbhola wrote…

We only write to the Batch in flushClearedKeys, so if there are 63 K-V pairs that together exceed the 64MB limit, and then we find a non-matching key and call flush, we will not notice that we have exceeded the batch size. I think we could approximate an upper bound on the size of the batch by just adding the key lengths. I think it would be good to push that approximation into clearMatchingKey since it knows when it is transitioning from individual clears to clear range.

Even better would be to lift that code out into a clearBatcher struct with functions flushClearedRun, clear, getCountAndApproxSize that can be used to cleanly separate the iteration from the write batching. And that would unify this if-block with the preceding one since both are doing the same thing wrt constructing a resume span.

Since we have an eye to backporting I didn't want to go too far into refactoring and pulling out a new helper just yet.

But I think I fixed the hypothetical 63-giant-keys case, by opting to flush the buffered keys as a clear-range rather than individual even if we didn't reach the 64 key mark that would usually motivate that, but instead if, when we go to flush them one-by-one, their encoded size is too large.

While I was at it, I switched to tracking the size as sum of bytes added, to avoid the type-sniff and intermittent batch size checks, and made the size limit an argument. While I don't want to make it a request param in this change (again, for backport), making it a function arg at least makes it easier to test.

sumeerbhola

Reviewed 3 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @sumeerbhola)

pkg/storage/mvcc.go, line 2077 at r2 (raw file):

// batch that is too large -- in number of bytes -- for raft to replicate if the
// keys are very large. So if the total length of the keys or key spans cleared
// exceeds maxBatchByteSize it will also stop and return a reusme span.

resume

pkg/storage/mvcc.go, line 2171 at r2 (raw file):

						}
					}
					batchByteSize += encodedBufSize

shouldn't this be outside the for loop since encodedBufSize already reflects all the keys? If yes, can we get some test coverage so we don't have an inadvertent performance regression.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @sumeerbhola)

pkg/storage/mvcc.go, line 2077 at r2 (raw file):

Previously, sumeerbhola wrote…

resume

Done.

pkg/storage/mvcc.go, line 2171 at r2 (raw file):

Previously, sumeerbhola wrote…

shouldn't this be outside the for loop since encodedBufSize already reflects all the keys? If yes, can we get some test coverage so we don't have an inadvertent performance regression.

yep, indeed. The random test actually already has enough data to catch it, so just needed to add counting on how many resumes it did.

sumeerbhola

Reviewed 2 of 2 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @sumeerbhola)

pkg/storage/mvcc_test.go, line 2403 at r3 (raw file):

			const keyLimit = 100
			keyLen := int64(len(roachpb.Key(fmt.Sprintf("%05d", 1)))) + MVCCVersionTimestampSize
			maxAttempts := (numKVs * keyLen) / byteLimit

It's a little hard to see which limit is the bottleneck. Ideally, we should have one test case each where one of the two is the bottleneck.
And how about a minAttempts to ensure that things are being broken into batches.

The existing limit in key-count/span-count can produce batches in excess of 64mb, if, for example, they have very large keys. These batches then are rejected for exceeding the raft command size limit. This adds an aditional hard-coded limit of 32mb on the write batch to which keys or spans to clear are added (if the command is executed against a non-Batch the limit is ignored). The size of the batch is re-checked once every 32 keys. Release note (bug fix): avoid creating batches that exceed the raft command limit (64mb) when reverting ranges that contain very large keys.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru and @sumeerbhola)

pkg/storage/mvcc_test.go, line 2403 at r3 (raw file):

Previously, sumeerbhola wrote…

It's a little hard to see which limit is the bottleneck. Ideally, we should have one test case each where one of the two is the bottleneck.
And how about a minAttempts to ensure that things are being broken into batches.

It's a bit easier to exercise the specific limits in isolation in the small, static tests above, so added cases there. I only used the random test for catching the over-counting because it actually had enough data to exceed the buffer-size case.

sumeerbhola

Reviewed 1 of 1 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @sumeerbhola)

dt · 2021-02-18T19:16:21Z

TFTR!

bors r+

craig · 2021-02-18T20:59:43Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2021-02-18T23:00:14Z

Build succeeded:

GitHub CI (Cockroach)

dt requested review from andreimatei and adityamaru February 2, 2021 19:48

dt requested a review from sumeerbhola February 2, 2021 20:23

sumeerbhola requested changes Feb 3, 2021

View reviewed changes

andreimatei reviewed Feb 4, 2021

View reviewed changes

dt force-pushed the limit-revert-batch branch 2 times, most recently from 61afed7 to 2c78c3e Compare February 17, 2021 02:48

dt commented Feb 17, 2021

View reviewed changes

dt force-pushed the limit-revert-batch branch from 2c78c3e to dfe6752 Compare February 17, 2021 16:02

sumeerbhola requested changes Feb 17, 2021

View reviewed changes

dt force-pushed the limit-revert-batch branch 2 times, most recently from c52ac93 to e18cfeb Compare February 18, 2021 14:47

dt commented Feb 18, 2021

View reviewed changes

sumeerbhola approved these changes Feb 18, 2021

View reviewed changes

dt force-pushed the limit-revert-batch branch from e18cfeb to 840282b Compare February 18, 2021 17:27

dt force-pushed the limit-revert-batch branch from 840282b to 89fcd2c Compare February 18, 2021 17:28

dt commented Feb 18, 2021

View reviewed changes

sumeerbhola reviewed Feb 18, 2021

View reviewed changes

craig bot merged commit 40a35fe into cockroachdb:master Feb 18, 2021

dt deleted the limit-revert-batch branch February 28, 2021 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: limit RevertRange batches to 32mb #59716

storage: limit RevertRange batches to 32mb #59716

dt commented Feb 2, 2021

cockroach-teamcity commented Feb 2, 2021

sumeerbhola left a comment

andreimatei left a comment

dt left a comment

sumeerbhola left a comment

dt left a comment

sumeerbhola left a comment

dt left a comment

sumeerbhola left a comment

dt commented Feb 18, 2021

craig bot commented Feb 18, 2021

craig bot commented Feb 18, 2021

storage: limit RevertRange batches to 32mb #59716

storage: limit RevertRange batches to 32mb #59716

Conversation

dt commented Feb 2, 2021

cockroach-teamcity commented Feb 2, 2021

sumeerbhola left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

dt commented Feb 18, 2021

craig bot commented Feb 18, 2021

craig bot commented Feb 18, 2021