-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: limit RevertRange batches to 32mb #59716
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andreimatei, and @dt)
pkg/storage/mvcc.go, line 2126 at r1 (raw file):
} const maxBatchByteSize, recheckBatchSizeEvery = 32 << 20, 32
It is confusing that we already have a maxBatchSize
parameter and that is set to MaxSpanRequestKeys
.
I thought MaxSpanRequestKeys
was to bound the amount of data retrieved for a read. If that is correct, why is it being used here?
Was its purpose to limit the amount of read work or write work? Given that we count a ClearRange
as 1, I guess it was to limit the amount of write work. Can we unify that with what is being introduced here?
pkg/storage/mvcc.go, line 2198 at r1 (raw file):
break } recheckBatchSize = 0
We only write to the Batch
in flushClearedKeys
, so if there are 63 K-V pairs that together exceed the 64MB limit, and then we find a non-matching key and call flush, we will not notice that we have exceeded the batch size. I think we could approximate an upper bound on the size of the batch by just adding the key lengths. I think it would be good to push that approximation into clearMatchingKey
since it knows when it is transitioning from individual clears to clear range.
Even better would be to lift that code out into a clearBatcher
struct with functions flushClearedRun
, clear
, getCountAndApproxSize
that can be used to cleanly separate the iteration from the write batching. And that would unify this if-block with the preceding one since both are doing the same thing wrt constructing a resume span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing looks good to me, but I'm not the right person to look. Seems like Sumeer's got you.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dt)
61afed7
to
2c78c3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up a bit and added some tests. RFAL!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @sumeerbhola)
pkg/storage/mvcc.go, line 2126 at r1 (raw file):
Previously, sumeerbhola wrote…
It is confusing that we already have a
maxBatchSize
parameter and that is set toMaxSpanRequestKeys
.
I thoughtMaxSpanRequestKeys
was to bound the amount of data retrieved for a read. If that is correct, why is it being used here?
Was its purpose to limit the amount of read work or write work? Given that we count aClearRange
as 1, I guess it was to limit the amount of write work. Can we unify that with what is being introduced here?
The count is indeed limiting write work, and more generally just limiting work, so that a caller could paginate, persist progress checkpoints, etc.
But the count, while likely a good proxy for how long the request will run, isn't on its own enough to avoid running afoul of the max command size in the fact of giant keys, so we need both it seems.
pkg/storage/mvcc.go, line 2198 at r1 (raw file):
Previously, sumeerbhola wrote…
We only write to the
Batch
influshClearedKeys
, so if there are 63 K-V pairs that together exceed the 64MB limit, and then we find a non-matching key and call flush, we will not notice that we have exceeded the batch size. I think we could approximate an upper bound on the size of the batch by just adding the key lengths. I think it would be good to push that approximation intoclearMatchingKey
since it knows when it is transitioning from individual clears to clear range.Even better would be to lift that code out into a
clearBatcher
struct with functionsflushClearedRun
,clear
,getCountAndApproxSize
that can be used to cleanly separate the iteration from the write batching. And that would unify this if-block with the preceding one since both are doing the same thing wrt constructing a resume span.
Since we have an eye to backporting I didn't want to go too far into refactoring and pulling out a new helper just yet.
But I think I fixed the hypothetical 63-giant-keys case, by opting to flush the buffered keys as a clear-range rather than individual even if we didn't reach the 64 key mark that would usually motivate that, but instead if, when we go to flush them one-by-one, their encoded size is too large.
While I was at it, I switched to tracking the size as sum of bytes added, to avoid the type-sniff and intermittent batch size checks, and made the size limit an argument. While I don't want to make it a request param in this change (again, for backport), making it a function arg at least makes it easier to test.
2c78c3e
to
dfe6752
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @sumeerbhola)
pkg/storage/mvcc.go, line 2077 at r2 (raw file):
// batch that is too large -- in number of bytes -- for raft to replicate if the // keys are very large. So if the total length of the keys or key spans cleared // exceeds maxBatchByteSize it will also stop and return a reusme span.
resume
pkg/storage/mvcc.go, line 2171 at r2 (raw file):
} } batchByteSize += encodedBufSize
shouldn't this be outside the for loop since encodedBufSize
already reflects all the keys? If yes, can we get some test coverage so we don't have an inadvertent performance regression.
c52ac93
to
e18cfeb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @sumeerbhola)
pkg/storage/mvcc.go, line 2077 at r2 (raw file):
Previously, sumeerbhola wrote…
resume
Done.
pkg/storage/mvcc.go, line 2171 at r2 (raw file):
Previously, sumeerbhola wrote…
shouldn't this be outside the for loop since
encodedBufSize
already reflects all the keys? If yes, can we get some test coverage so we don't have an inadvertent performance regression.
yep, indeed. The random test actually already has enough data to catch it, so just needed to add counting on how many resumes it did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @sumeerbhola)
pkg/storage/mvcc_test.go, line 2403 at r3 (raw file):
const keyLimit = 100 keyLen := int64(len(roachpb.Key(fmt.Sprintf("%05d", 1)))) + MVCCVersionTimestampSize maxAttempts := (numKVs * keyLen) / byteLimit
It's a little hard to see which limit is the bottleneck. Ideally, we should have one test case each where one of the two is the bottleneck.
And how about a minAttempts
to ensure that things are being broken into batches.
e18cfeb
to
840282b
Compare
The existing limit in key-count/span-count can produce batches in excess of 64mb, if, for example, they have very large keys. These batches then are rejected for exceeding the raft command size limit. This adds an aditional hard-coded limit of 32mb on the write batch to which keys or spans to clear are added (if the command is executed against a non-Batch the limit is ignored). The size of the batch is re-checked once every 32 keys. Release note (bug fix): avoid creating batches that exceed the raft command limit (64mb) when reverting ranges that contain very large keys.
840282b
to
89fcd2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru and @sumeerbhola)
pkg/storage/mvcc_test.go, line 2403 at r3 (raw file):
Previously, sumeerbhola wrote…
It's a little hard to see which limit is the bottleneck. Ideally, we should have one test case each where one of the two is the bottleneck.
And how about aminAttempts
to ensure that things are being broken into batches.
It's a bit easier to exercise the specific limits in isolation in the small, static tests above, so added cases there. I only used the random test for catching the over-counting because it actually had enough data to exceed the buffer-size case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @sumeerbhola)
TFTR! bors r+ |
Build failed (retrying...): |
Build succeeded: |
The existing limit in key-count/span-count can produce batches in excess of 64mb,
if, for example, they have very large keys. These batches then are rejected for
exceeding the raft command size limit.
This adds an aditional hard-coded limit of 32mb on the write batch to which keys
or spans to clear are added (if the command is executed against a non-Batch the
limit is ignored). The size of the batch is re-checked once every 32 keys.
Release note (bug fix): avoid creating batches that exceed the raft command limit (64mb) when reverting ranges that contain very large keys.