Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: limit RevertRange batches to 32mb #59716

Merged
merged 1 commit into from
Feb 18, 2021

Conversation

dt
Copy link
Member

@dt dt commented Feb 2, 2021

The existing limit in key-count/span-count can produce batches in excess of 64mb,
if, for example, they have very large keys. These batches then are rejected for
exceeding the raft command size limit.

This adds an aditional hard-coded limit of 32mb on the write batch to which keys
or spans to clear are added (if the command is executed against a non-Batch the
limit is ignored). The size of the batch is re-checked once every 32 keys.

Release note (bug fix): avoid creating batches that exceed the raft command limit (64mb) when reverting ranges that contain very large keys.

@dt dt requested review from andreimatei and adityamaru February 2, 2021 19:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dt dt requested a review from sumeerbhola February 2, 2021 20:23
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, and @dt)


pkg/storage/mvcc.go, line 2126 at r1 (raw file):

	}

	const maxBatchByteSize, recheckBatchSizeEvery = 32 << 20, 32

It is confusing that we already have a maxBatchSize parameter and that is set to MaxSpanRequestKeys.
I thought MaxSpanRequestKeys was to bound the amount of data retrieved for a read. If that is correct, why is it being used here?
Was its purpose to limit the amount of read work or write work? Given that we count a ClearRange as 1, I guess it was to limit the amount of write work. Can we unify that with what is being introduced here?


pkg/storage/mvcc.go, line 2198 at r1 (raw file):

					break
				}
				recheckBatchSize = 0

We only write to the Batch in flushClearedKeys, so if there are 63 K-V pairs that together exceed the 64MB limit, and then we find a non-matching key and call flush, we will not notice that we have exceeded the batch size. I think we could approximate an upper bound on the size of the batch by just adding the key lengths. I think it would be good to push that approximation into clearMatchingKey since it knows when it is transitioning from individual clears to clear range.

Even better would be to lift that code out into a clearBatcher struct with functions flushClearedRun, clear, getCountAndApproxSize that can be used to cleanly separate the iteration from the write batching. And that would unify this if-block with the preceding one since both are doing the same thing wrt constructing a resume span.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing looks good to me, but I'm not the right person to look. Seems like Sumeer's got you.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dt)

@dt dt force-pushed the limit-revert-batch branch 2 times, most recently from 61afed7 to 2c78c3e Compare February 17, 2021 02:48
Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up a bit and added some tests. RFAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @sumeerbhola)


pkg/storage/mvcc.go, line 2126 at r1 (raw file):

Previously, sumeerbhola wrote…

It is confusing that we already have a maxBatchSize parameter and that is set to MaxSpanRequestKeys.
I thought MaxSpanRequestKeys was to bound the amount of data retrieved for a read. If that is correct, why is it being used here?
Was its purpose to limit the amount of read work or write work? Given that we count a ClearRange as 1, I guess it was to limit the amount of write work. Can we unify that with what is being introduced here?

The count is indeed limiting write work, and more generally just limiting work, so that a caller could paginate, persist progress checkpoints, etc.

But the count, while likely a good proxy for how long the request will run, isn't on its own enough to avoid running afoul of the max command size in the fact of giant keys, so we need both it seems.


pkg/storage/mvcc.go, line 2198 at r1 (raw file):

Previously, sumeerbhola wrote…

We only write to the Batch in flushClearedKeys, so if there are 63 K-V pairs that together exceed the 64MB limit, and then we find a non-matching key and call flush, we will not notice that we have exceeded the batch size. I think we could approximate an upper bound on the size of the batch by just adding the key lengths. I think it would be good to push that approximation into clearMatchingKey since it knows when it is transitioning from individual clears to clear range.

Even better would be to lift that code out into a clearBatcher struct with functions flushClearedRun, clear, getCountAndApproxSize that can be used to cleanly separate the iteration from the write batching. And that would unify this if-block with the preceding one since both are doing the same thing wrt constructing a resume span.

Since we have an eye to backporting I didn't want to go too far into refactoring and pulling out a new helper just yet.

But I think I fixed the hypothetical 63-giant-keys case, by opting to flush the buffered keys as a clear-range rather than individual even if we didn't reach the 64 key mark that would usually motivate that, but instead if, when we go to flush them one-by-one, their encoded size is too large.

While I was at it, I switched to tracking the size as sum of bytes added, to avoid the type-sniff and intermittent batch size checks, and made the size limit an argument. While I don't want to make it a request param in this change (again, for backport), making it a function arg at least makes it easier to test.

@dt dt force-pushed the limit-revert-batch branch from 2c78c3e to dfe6752 Compare February 17, 2021 16:02
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @sumeerbhola)


pkg/storage/mvcc.go, line 2077 at r2 (raw file):

// batch that is too large -- in number of bytes -- for raft to replicate if the
// keys are very large. So if the total length of the keys or key spans cleared
// exceeds maxBatchByteSize it will also stop and return a reusme span.

resume


pkg/storage/mvcc.go, line 2171 at r2 (raw file):

						}
					}
					batchByteSize += encodedBufSize

shouldn't this be outside the for loop since encodedBufSize already reflects all the keys? If yes, can we get some test coverage so we don't have an inadvertent performance regression.

@dt dt force-pushed the limit-revert-batch branch 2 times, most recently from c52ac93 to e18cfeb Compare February 18, 2021 14:47
Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @sumeerbhola)


pkg/storage/mvcc.go, line 2077 at r2 (raw file):

Previously, sumeerbhola wrote…

resume

Done.


pkg/storage/mvcc.go, line 2171 at r2 (raw file):

Previously, sumeerbhola wrote…

shouldn't this be outside the for loop since encodedBufSize already reflects all the keys? If yes, can we get some test coverage so we don't have an inadvertent performance regression.

yep, indeed. The random test actually already has enough data to catch it, so just needed to add counting on how many resumes it did.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @sumeerbhola)


pkg/storage/mvcc_test.go, line 2403 at r3 (raw file):

			const keyLimit = 100
			keyLen := int64(len(roachpb.Key(fmt.Sprintf("%05d", 1)))) + MVCCVersionTimestampSize
			maxAttempts := (numKVs * keyLen) / byteLimit

It's a little hard to see which limit is the bottleneck. Ideally, we should have one test case each where one of the two is the bottleneck.
And how about a minAttempts to ensure that things are being broken into batches.

@dt dt force-pushed the limit-revert-batch branch from e18cfeb to 840282b Compare February 18, 2021 17:27
The existing limit in key-count/span-count can produce batches in excess of 64mb,
if, for example, they have very large keys. These batches then are rejected for
exceeding the raft command size limit.

This adds an aditional hard-coded limit of 32mb on the write batch to which keys
or spans to clear are added (if the command is executed against a non-Batch the
limit is ignored). The size of the batch is re-checked once every 32 keys.

Release note (bug fix): avoid creating batches that exceed the raft command limit (64mb)  when reverting ranges that contain very large keys.
@dt dt force-pushed the limit-revert-batch branch from 840282b to 89fcd2c Compare February 18, 2021 17:28
Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru and @sumeerbhola)


pkg/storage/mvcc_test.go, line 2403 at r3 (raw file):

Previously, sumeerbhola wrote…

It's a little hard to see which limit is the bottleneck. Ideally, we should have one test case each where one of the two is the bottleneck.
And how about a minAttempts to ensure that things are being broken into batches.

It's a bit easier to exercise the specific limits in isolation in the small, static tests above, so added cases there. I only used the random test for catching the over-counting because it actually had enough data to exceed the buffer-size case.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @sumeerbhola)

@dt
Copy link
Member Author

dt commented Feb 18, 2021

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 18, 2021

Build failed (retrying...):

@craig craig bot merged commit 40a35fe into cockroachdb:master Feb 18, 2021
@craig
Copy link
Contributor

craig bot commented Feb 18, 2021

Build succeeded:

@dt dt deleted the limit-revert-batch branch February 28, 2021 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants