-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: scan empty right-hand side of split for stats #78218
kv: scan empty right-hand side of split for stats #78218
Conversation
b5ccfdc
to
deb759c
Compare
I confirmed that IMPORT (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Hooking this up in a TPC-E 50K trade table IMPORT shaved about 900-1200s off the 7100s import time 🎉
One thing that would be pretty convenient to IMPORT, if I could tempt you into adding another commit to this, would be including these stats in the Even more nifty would be to additionally, if that RHS stats total is >0, grab iter and seekge to find the first key and send that in the response as well, e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find. If it's often the case that one of the sides will be empty, maybe we could also just do a seek first to check?
I wonder if we should flip the default? like, manual splits are usually either in the middle of a range in which case, doesn't matter which we pick, or, very often, are sent to carve out a new table span, index span, partition, etc, in which case it is much more likely that the RHS is empty. |
deb759c
to
2db4c5d
Compare
There's since been lots of discussion about this in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1647551964065369.
Done. PTAL. |
2db4c5d
to
aa43bb4
Compare
Nice! maybe in the future we re-introduce some form of caller hint as to which side is smaller for small-but-non-empty cases, but I suspect this version gets us >90% of the benefit without changing the signatures, so I'm very 👍 on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this was a smaller change to the arithmetic than I expected. Do we have good coverage of both cases? I don't see any changes to tests. I assume we'd want to force the heuristic in some of the tests that today split in the middle & verify the stats (I know such tests exists but don't remember the name). LGTM pending that kind of test coverage.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 947 at r1 (raw file):
// isGlobalKeyspaceEmpty returns whether the global keyspace of the provided // range is entirely empty or whether it contains at least one key.
nit: the "or whether it contains at least one key" could be misconstrued as returning true
if there is a key, but this is not the intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had explicit test coverage, but it was removed when I removed the caller hint and made this heuristic-based. It's now harder to control the choice of split directly from far away. We could add the plumbing back in as a testing knob, but that's pretty heavy-weight. What do you think of using a metamorphic constant to occasionally default to scanning the RHS in tests to improve coverage?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
I can't form the most coherent argument around this yet, but it seems smelly to use a randomized constant for this. It's a straightforward code path, not a large number of options. The hint is either left or right and both need to be tested directly. This would be different if we had some sort of "heuristic cutoff" that would take a float64 or something like that, then we could randomize the threshold since it shouldn't make a difference anywhere. But even then we'd want to have a direct test for some hard-coded thresholds. |
I think we might actually want the caller-supplied flag version back? I'm still poking around here, but one case that jumps out so far -- when a range has been "filled" from left to right, say from If we just split at That said, that above-split case should be the minority of split sent even in #78523, so I'd still expect to see significant benefit from auto-RHS, but in practice I'm seeing almost none, so something else must be up here. I'll keep digging. So far on this run with logging for which side we scan, after the 100 initial splits mostly did take the rhs stats path, most of the growth since has been in the rhs=false case:
I'm logging which key it was that existed, so maybe I can try to piece together why we're seeing this. It's a little tricky since I can't tell the above split from a pre-split at this point, so will need to sleuth a little to tell which of these are expected or not. |
Aha, I think I understand what's up. In a multi-column family table like TPC-E's trade where the first family is tiny, we almost never have a truly empty RHS, since we add an sst with keys like I'll poke at the flush logic and maybe just let it exceed the size as needed until it gets to a split key and see if that changes things. |
Previously an sstable might end due to size at /table/i/rowX/col/Y, if some, but not all, families for rowX fit in that file. This is OK as far as KV and SQL are concerned, since after we add the next file which will start with rowX/colZ, the row is complete from the point of view of any scan. However it does mean that if, after adding this file we determine that we need to split before adding the next file, that split, as it must be at a row boundary, will be at rowX, not rowX/colZ. This too is OK, but has the slight downside of meaning that when we scatter the new RHS, starting at rowX, we have to move the colY family KV we just added in the prior prior file. While it is typically a trivial amount of data, it does make the RHS non-empty and thus require _some_ cost to move. This changes the size-based limit that triggers a file flush to wait for the next row boundary after the size is exceeded, so that SST bounds now also fall on row, and thus any future range split, bounds. This is particularly relevant in conjunction with cockroachdb#78218. Release note: none.
Okay, so I did a run of master, this change, my split-above branch #78523, this change rebased on that one, and another of this change on both split-above change and another tiny fix to round SST bounds up to row bounds in #79020. As seen earlier this change on its own is basically same as master -- there's always a key in the RHS make it pick LHS. So this change looks great, once IMPORT makes a could tweaks to get it empty RHSs more often. That said, the manual hinting version I ran last week was even better, closer to just one minute. I suspect this is mostly the split-above splits picking the big LHS, but I'm not sure. I'm happy to just go with this as-is, since it is a big win already, or go back to the caller-provided hint version. |
Don't see why we couldn't do both, if it's beneficial: provide a parameter to prefer the RHS, but if not set and a seek finds the RHS to be empty, then scan the RHS anyway. |
Yeah, I think we could (should) do both, but I'm also happy to say we land this PR as-is with its smaller footprint of not having an API change, get it and the other 22.1-bound stuff all backported, and then come back and benchmark the marginal difference that adding a manual hint buys to decide if it is worth it. |
Previously an sstable might end due to size at /table/i/rowX/col/Y, if some, but not all, families for rowX fit in that file. This is OK as far as KV and SQL are concerned, since after we add the next file which will start with rowX/colZ, the row is complete from the point of view of any scan. However it does mean that if, after adding this file we determine that we need to split before adding the next file, that split, as it must be at a row boundary, will be at rowX, not rowX/colZ. This too is OK, but has the slight downside of meaning that when we scatter the new RHS, starting at rowX, we have to move the colY family KV we just added in the prior prior file. While it is typically a trivial amount of data, it does make the RHS non-empty and thus require _some_ cost to move. This changes the size-based limit that triggers a file flush to wait for the next row boundary after the size is exceeded, so that SST bounds now also fall on row, and thus any future range split, bounds. This is particularly relevant in conjunction with cockroachdb#78218. Release note: none.
Is this good to merge? Are we okay with merging it now and then maybe adding the explicit hint in a follow-up if it looks like that'd be an even bigger win? |
79020: kv/bulk: chunk SSTs to row boundaries r=dt a=dt Previously an sstable might end due to size at /table/i/rowX/col/Y, if some, but not all, families for rowX fit in that file. This is OK as far as KV and SQL are concerned, since after we add the next file which will start with rowX/colZ, the row is complete from the point of view of any scan. However it does mean that if, after adding this file we determine that we need to split before adding the next file, that split, as it must be at a row boundary, will be at rowX, not rowX/colZ. This too is OK, but has the slight downside of meaning that when we scatter the new RHS, starting at rowX, we have to move the colY family KV we just added in the prior prior file. While it is typically a trivial amount of data, it does make the RHS non-empty and thus require _some_ cost to move. This changes the size-based limit that triggers a file flush to wait for the next row boundary after the size is exceeded, so that SST bounds now also fall on row, and thus any future range split, bounds. This is particularly relevant in conjunction with #78218. Release note: none. Co-authored-by: David Taylor <[email protected]>
Went back and re-ran TPC-C 50k import with this, just since I'd mostly been playing with TPC-E lately, and there too it is looking good: this PR + #78523, we see numbers like |
@tbg I've run this patch for many many roachprod hours and it looks very good and I'd like to get it into 22.1 along with other ingest patches; would you be content with merging it as-is for now and revisiting additional testing later (e.g. when we add an explicit side hint to the api) ? |
I'm fine merging as is if Nathan is fine with it. |
See conversation in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1647551964065369. Bulk ingestion operations like IMPORT and index backfills have a fast-path for in-order ingestion where they periodically manually split and scatter the empty head of the keyspace being ingested into. In tests, we've seen that this split-and-scatter step can be expensive. This appears to be due in part to the stats recomputation we perform during range splits. Currently, this stats computation always scans the left hand side of the split. This is unfortunate for bulk-issued manual splits, because those manual splits are intentionally performed on the right border of the range, meaning that their left hand side contains the entire ~500MB range and their right hand side is empty. This commit extends the range split logic by adding a heuristic that chooses to scan the right side of the split first computing stats in cases where the right side is entirely empty. The "scan first" part is subtle, because there are cases where a split needs to scan both sides when computing stats. Specifically, it needs to do so in cases where the range has estimates in its MVCCStats. For an explanation, see `split_stats_helper.go`. It's not clear to me whether this commit is sufficient to help bulk ingestion or whether we'll need to do something about these stats estimates as well.
aa43bb4
to
1dd8808
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here. I've added in a metamorphic constant that dictates the default choice for which side of the split to scan. That gives us plenty of test coverage to verify that stats are consistent on both halves of the split regardless of which side is scanned first. I debated reviving the plumbing that was in an earlier version of this PR (deb759c), but something feels wrong about adding knobs for the express purpose of being able to test that those knobs are respected. Without them, there's nothing in AdminSplit's external API to test.
TFTRs! Since David is waiting on this for #77157, I'll go ahead and merge.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 947 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: the "or whether it contains at least one key" could be misconstrued as returning
true
if there is a key, but this is not the intention.
Done.
Build succeeded: |
See conversation in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1647551964065369.
Part of #77157.
Bulk ingestion operations like IMPORT and index backfills have a
fast-path for in-order ingestion where they periodically manually split
and scatter the empty head of the keyspace being ingested into. In
tests, we've seen that this split-and-scatter step can be expensive.
This appears to be due in part to the stats recomputation we perform
during range splits.
Currently, this stats computation always scans the left hand side of the
split. This is unfortunate for bulk-issued manual splits, because those
manual splits are intentionally performed on the right border of the
range, meaning that their left hand side contains the entire ~500MB
range and their right hand side is empty.
This commit extends the range split logic by adding a heuristic that
chooses to scan the right side of the split first computing stats in
cases where the right side is entirely empty.
The "scan first" part is subtle, because there are cases where a split
needs to scan both sides when computing stats. Specifically, it needs to
do so in cases where the range has estimates in its MVCCStats. For an
explanation, see
split_stats_helper.go
. It's not clear to me whetherthis commit is sufficient to help bulk ingestion or whether we'll need
to do something about these stats estimates as well.