kv: scan empty right-hand side of split for stats #78218

nvanbenschoten · 2022-03-22T03:49:48Z

See conversation in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1647551964065369.
Part of #77157.

Bulk ingestion operations like IMPORT and index backfills have a
fast-path for in-order ingestion where they periodically manually split
and scatter the empty head of the keyspace being ingested into. In
tests, we've seen that this split-and-scatter step can be expensive.
This appears to be due in part to the stats recomputation we perform
during range splits.

Currently, this stats computation always scans the left hand side of the
split. This is unfortunate for bulk-issued manual splits, because those
manual splits are intentionally performed on the right border of the
range, meaning that their left hand side contains the entire ~500MB
range and their right hand side is empty.

This commit extends the range split logic by adding a heuristic that
chooses to scan the right side of the split first computing stats in
cases where the right side is entirely empty.

The "scan first" part is subtle, because there are cases where a split
needs to scan both sides when computing stats. Specifically, it needs to
do so in cases where the range has estimates in its MVCCStats. For an
explanation, see split_stats_helper.go. It's not clear to me whether
this commit is sufficient to help bulk ingestion or whether we'll need
to do something about these stats estimates as well.

cockroach-teamcity · 2022-03-22T03:49:55Z

This change is

nvanbenschoten · 2022-03-23T03:53:46Z

I confirmed that IMPORT (e.g. cockroach workload fixtures import tpcc) does not create ranges with estimates in their stats, so IMPORT will benefit from this change once it's plumbed up into SSTBatcher.

dt

Nice!

Hooking this up in a TPC-E 50K trade table IMPORT shaved about 900-1200s off the 7100s import time 🎉

dt · 2022-03-24T21:34:56Z

One thing that would be pretty convenient to IMPORT, if I could tempt you into adding another commit to this, would be including these stats in the AdminSplitResponse, at least for the RHS, which seems like it shouldn't cost us anything since we're computing them anyway (I poked at this briefly but wasn't sure how to thread that back from the split trigger).

Even more nifty would be to additionally, if that RHS stats total is >0, grab iter and seekge to find the first key and send that in the response as well, e.g. first_key or something, so that the splitter knows how much is in the new range they made and where it starts.

erikgrinaker

Nice find. If it's often the case that one of the sides will be empty, maybe we could also just do a seek first to check?

dt · 2022-03-25T13:11:09Z

I wonder if we should flip the default?

like, manual splits are usually either in the middle of a range in which case, doesn't matter which we pick, or, very often, are sent to carve out a new table span, index span, partition, etc, in which case it is much more likely that the RHS is empty.

nvanbenschoten · 2022-03-25T18:58:06Z

One thing that would be pretty convenient to IMPORT, if I could tempt you into adding another commit to this, would be including these stats in the AdminSplitResponse, at least for the RHS, which seems like it shouldn't cost us anything since we're computing them anyway (I poked at this briefly but wasn't sure how to thread that back from the split trigger).

Even more nifty would be to additionally, if that RHS stats total is >0, grab iter and seekge to find the first key and send that in the response as well, e.g. first_key or something, so that the splitter knows how much is in the new range they made and where it starts.

There's since been lots of discussion about this in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1647551964065369.

If it's often the case that one of the sides will be empty, maybe we could also just do a seek first to check?

Done. PTAL.

dt · 2022-03-25T19:10:00Z

Nice!

maybe in the future we re-introduce some form of caller hint as to which side is smaller for small-but-non-empty cases, but I suspect this version gets us >90% of the benefit without changing the signatures, so I'm very 👍 on it.

tbg

Nice, this was a smaller change to the arithmetic than I expected. Do we have good coverage of both cases? I don't see any changes to tests. I assume we'd want to force the heuristic in some of the tests that today split in the middle & verify the stats (I know such tests exists but don't remember the name). LGTM pending that kind of test coverage.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 947 at r1 (raw file):

// isGlobalKeyspaceEmpty returns whether the global keyspace of the provided
// range is entirely empty or whether it contains at least one key.

nit: the "or whether it contains at least one key" could be misconstrued as returning true if there is a key, but this is not the intention.

nvanbenschoten

We had explicit test coverage, but it was removed when I removed the caller hint and made this heuristic-based. It's now harder to control the choice of split directly from far away. We could add the plumbing back in as a testing knob, but that's pretty heavy-weight. What do you think of using a metamorphic constant to occasionally default to scanning the RHS in tests to improve coverage?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

tbg · 2022-03-28T16:04:32Z

What do you think of using a metamorphic constant to occasionally default to scanning the RHS in tests to improve coverage?

I can't form the most coherent argument around this yet, but it seems smelly to use a randomized constant for this. It's a straightforward code path, not a large number of options. The hint is either left or right and both need to be tested directly. This would be different if we had some sort of "heuristic cutoff" that would take a float64 or something like that, then we could randomize the threshold since it shouldn't make a difference anywhere. But even then we'd want to have a direct test for some hard-coded thresholds.
Is it so bad to plumb the knob? And aren't we anticipating letting the caller influence the heuristic anyway, at which point we'll do most of that plumbing anyway?

dt · 2022-03-29T16:44:08Z

I think we might actually want the caller-supplied flag version back?

I'm still poking around here, but one case that jumps out so far -- when a range has been "filled" from left to right, say from a to k, and we're about to split at j before continuing to fill at j, if we happen to know there's some existing data in the RHS above the span we're currently filling, at say s:

If we just split at j, the existence of s would cause us to count the generally bigger (since we just filled it) LHS data in [a, k). If we send a split to s first, as #78523 does, then that split will count [a, k), since it too has a non-empty RHS (by definition). Typically we observe counting [s, z) is cheaper than [a, k) since we're filling left to right at least at the flush level, so our caller would prefer to just hint to count the RHS -- indeed, potentially we could do even better on the calling side and track how much we've added to the LHS and know in some cases that it is more expensive to count than whatever could be in the RHS above s.

That said, that above-split case should be the minority of split sent even in #78523, so I'd still expect to see significant benefit from auto-RHS, but in practice I'm seeing almost none, so something else must be up here. I'll keep digging.

So far on this run with logging for which side we scan, after the 100 initial splits mostly did take the rhs stats path, most of the growth since has been in the rhs=false case:

➜  cockroach git:(both) ✗ roachprod run david-both "grep 'scan right false' logs/cockroach.log | wc -l"
david-both: grep 'scan right false' log... 10/10
   1: 55
   2: 65
   3: 65
   4: 65
   5: 78
   6: 64
   7: 77
   8: 62
   9: 79
  10: 76
➜  cockroach git:(both) ✗ roachprod run david-both "grep 'scan right true' logs/cockroach.log | wc -l"
david-both: grep 'scan right true' logs... 10/10
   1: 6
   2: 95
   3: 10
   4: 4
   5: 8
   6: 11
   7: 10
   8: 8
   9: 5
  10: 21

I'm logging which key it was that existed, so maybe I can try to piece together why we're seeing this. It's a little tricky since I can't tell the above split from a pre-split at this point, so will need to sleuth a little to tell which of these are expected or not.

dt · 2022-03-30T01:16:37Z

Aha, I think I understand what's up. In a multi-column family table like TPC-E's trade where the first family is tiny, we almost never have a truly empty RHS, since we add an sst with keys like row1/c0, row1/c1, row2/c0 ... rowX/c0, then we decide it is too big to fit rowX/c1 so we stop and split before we add that next key. We've added rowX/c0 at this point, however that split can't be at rowX/c1 since we don't allow mid-row splits; instead it has to be all the way back at just rowX, so rowX/c0 is in the RHS, and it isn't empty.

I'll poke at the flush logic and maybe just let it exceed the size as needed until it gets to a split key and see if that changes things.

Previously an sstable might end due to size at /table/i/rowX/col/Y, if some, but not all, families for rowX fit in that file. This is OK as far as KV and SQL are concerned, since after we add the next file which will start with rowX/colZ, the row is complete from the point of view of any scan. However it does mean that if, after adding this file we determine that we need to split before adding the next file, that split, as it must be at a row boundary, will be at rowX, not rowX/colZ. This too is OK, but has the slight downside of meaning that when we scatter the new RHS, starting at rowX, we have to move the colY family KV we just added in the prior prior file. While it is typically a trivial amount of data, it does make the RHS non-empty and thus require _some_ cost to move. This changes the size-based limit that triggers a file flush to wait for the next row boundary after the size is exceeded, so that SST bounds now also fall on row, and thus any future range split, bounds. This is particularly relevant in conjunction with cockroachdb#78218. Release note: none.

dt · 2022-03-30T04:21:07Z

Okay, so I did a run of master, this change, my split-above branch #78523, this change rebased on that one, and another of this change on both split-above change and another tiny fix to round SST bounds up to row bounds in #79020.

As seen earlier this change on its own is basically same as master -- there's always a key in the RHS make it pick LHS.
Split-above has some big advantages over master on its own in balance and not getting into runaway linear splits after every file due to full RHS, but still spends 20min in splits. Rebasing this on split-above didn't change the split time, however rebasing this on split-above plus #79020 looks great: total split time is down from 20ish minutes to 3-4, shaving 15min+ off of the 1h40m import.

So this change looks great, once IMPORT makes a could tweaks to get it empty RHSs more often.

That said, the manual hinting version I ran last week was even better, closer to just one minute. I suspect this is mostly the split-above splits picking the big LHS, but I'm not sure. I'm happy to just go with this as-is, since it is a big win already, or go back to the caller-provided hint version.

erikgrinaker · 2022-03-30T09:31:29Z

That said, the manual hinting version I ran last week was even better, closer to just one minute. I suspect this is mostly the split-above splits picking the big LHS, but I'm not sure. I'm happy to just go with this as-is, since it is a big win already, or go back to the caller-provided hint version.

Don't see why we couldn't do both, if it's beneficial: provide a parameter to prefer the RHS, but if not set and a seek finds the RHS to be empty, then scan the RHS anyway.

dt · 2022-03-30T11:49:17Z

Don't see why we couldn't do both

Yeah, I think we could (should) do both, but I'm also happy to say we land this PR as-is with its smaller footprint of not having an API change, get it and the other 22.1-bound stuff all backported, and then come back and benchmark the marginal difference that adding a manual hint buys to decide if it is worth it.

Previously an sstable might end due to size at /table/i/rowX/col/Y, if some, but not all, families for rowX fit in that file. This is OK as far as KV and SQL are concerned, since after we add the next file which will start with rowX/colZ, the row is complete from the point of view of any scan. However it does mean that if, after adding this file we determine that we need to split before adding the next file, that split, as it must be at a row boundary, will be at rowX, not rowX/colZ. This too is OK, but has the slight downside of meaning that when we scatter the new RHS, starting at rowX, we have to move the colY family KV we just added in the prior prior file. While it is typically a trivial amount of data, it does make the RHS non-empty and thus require _some_ cost to move. This changes the size-based limit that triggers a file flush to wait for the next row boundary after the size is exceeded, so that SST bounds now also fall on row, and thus any future range split, bounds. This is particularly relevant in conjunction with cockroachdb#78218. Release note: none.

dt · 2022-03-30T17:29:53Z

Is this good to merge? Are we okay with merging it now and then maybe adding the explicit hint in a follow-up if it looks like that'd be an even bigger win?

79020: kv/bulk: chunk SSTs to row boundaries r=dt a=dt Previously an sstable might end due to size at /table/i/rowX/col/Y, if some, but not all, families for rowX fit in that file. This is OK as far as KV and SQL are concerned, since after we add the next file which will start with rowX/colZ, the row is complete from the point of view of any scan. However it does mean that if, after adding this file we determine that we need to split before adding the next file, that split, as it must be at a row boundary, will be at rowX, not rowX/colZ. This too is OK, but has the slight downside of meaning that when we scatter the new RHS, starting at rowX, we have to move the colY family KV we just added in the prior prior file. While it is typically a trivial amount of data, it does make the RHS non-empty and thus require _some_ cost to move. This changes the size-based limit that triggers a file flush to wait for the next row boundary after the size is exceeded, so that SST bounds now also fall on row, and thus any future range split, bounds. This is particularly relevant in conjunction with #78218. Release note: none. Co-authored-by: David Taylor <[email protected]>

dt · 2022-03-30T22:04:26Z

Went back and re-ran TPC-C 50k import with this, just since I'd mostly been playing with TPC-E lately, and there too it is looking good: this PR + #78523, we see numbers like 18m55s sending, 1m52s splitting where they used to be very close.

dt · 2022-03-31T18:23:47Z

@tbg I've run this patch for many many roachprod hours and it looks very good and I'd like to get it into 22.1 along with other ingest patches; would you be content with merging it as-is for now and revisiting additional testing later (e.g. when we add an explicit side hint to the api) ?

tbg · 2022-03-31T19:42:02Z

I'm fine merging as is if Nathan is fine with it.

See conversation in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1647551964065369. Bulk ingestion operations like IMPORT and index backfills have a fast-path for in-order ingestion where they periodically manually split and scatter the empty head of the keyspace being ingested into. In tests, we've seen that this split-and-scatter step can be expensive. This appears to be due in part to the stats recomputation we perform during range splits. Currently, this stats computation always scans the left hand side of the split. This is unfortunate for bulk-issued manual splits, because those manual splits are intentionally performed on the right border of the range, meaning that their left hand side contains the entire ~500MB range and their right hand side is empty. This commit extends the range split logic by adding a heuristic that chooses to scan the right side of the split first computing stats in cases where the right side is entirely empty. The "scan first" part is subtle, because there are cases where a split needs to scan both sides when computing stats. Specifically, it needs to do so in cases where the range has estimates in its MVCCStats. For an explanation, see `split_stats_helper.go`. It's not clear to me whether this commit is sufficient to help bulk ingestion or whether we'll need to do something about these stats estimates as well.

nvanbenschoten

Sorry for the delay here. I've added in a metamorphic constant that dictates the default choice for which side of the split to scan. That gives us plenty of test coverage to verify that stats are consistent on both halves of the split regardless of which side is scanned first. I debated reviving the plumbing that was in an earlier version of this PR (deb759c), but something feels wrong about adding knobs for the express purpose of being able to test that those knobs are respected. Without them, there's nothing in AdminSplit's external API to test.

TFTRs! Since David is waiting on this for #77157, I'll go ahead and merge.

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 947 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: the "or whether it contains at least one key" could be misconstrued as returning true if there is a key, but this is not the intention.

Done.

craig · 2022-04-04T06:50:27Z

Build succeeded:

GitHub CI (Cockroach)

nvanbenschoten requested a review from dt March 22, 2022 03:49

nvanbenschoten force-pushed the nvanbenschoten/splitScatter branch from b5ccfdc to deb759c Compare March 23, 2022 03:52

nvanbenschoten requested a review from tbg March 23, 2022 03:52

nvanbenschoten marked this pull request as ready for review March 23, 2022 03:52

nvanbenschoten requested a review from a team as a code owner March 23, 2022 03:52

dt approved these changes Mar 24, 2022

View reviewed changes

erikgrinaker approved these changes Mar 25, 2022

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/splitScatter branch from deb759c to 2db4c5d Compare March 25, 2022 18:56

nvanbenschoten force-pushed the nvanbenschoten/splitScatter branch from 2db4c5d to aa43bb4 Compare March 25, 2022 19:05

dt approved these changes Mar 25, 2022

View reviewed changes

nvanbenschoten changed the title ~~kv: allow manual range split to choose which side to scan for stats~~ kv: scan empty right-hand side of split for stats Mar 25, 2022

tbg approved these changes Mar 28, 2022

View reviewed changes

nvanbenschoten commented Mar 28, 2022

View reviewed changes

dt mentioned this pull request Mar 30, 2022

kv/bulk: chunk SSTs to row boundaries #79020

Merged

dt mentioned this pull request Apr 1, 2022

importccl: investigate potential import performance regression #77157

Closed

nvanbenschoten force-pushed the nvanbenschoten/splitScatter branch from aa43bb4 to 1dd8808 Compare April 4, 2022 05:03

nvanbenschoten added the backport-22.1.x label Apr 4, 2022

nvanbenschoten commented Apr 4, 2022

View reviewed changes

craig bot merged commit fa126a4 into cockroachdb:master Apr 4, 2022

blathers-crl bot mentioned this pull request Apr 4, 2022

release-22.1: kv: scan empty right-hand side of split for stats #79311

Merged

nvanbenschoten deleted the nvanbenschoten/splitScatter branch April 11, 2022 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: scan empty right-hand side of split for stats #78218

kv: scan empty right-hand side of split for stats #78218

nvanbenschoten commented Mar 22, 2022 •

edited

Loading

cockroach-teamcity commented Mar 22, 2022

nvanbenschoten commented Mar 23, 2022

dt left a comment

dt commented Mar 24, 2022 •

edited

Loading

erikgrinaker left a comment

dt commented Mar 25, 2022

nvanbenschoten commented Mar 25, 2022

dt commented Mar 25, 2022

tbg left a comment

nvanbenschoten left a comment

tbg commented Mar 28, 2022

dt commented Mar 29, 2022 •

edited

Loading

dt commented Mar 30, 2022

dt commented Mar 30, 2022

erikgrinaker commented Mar 30, 2022

dt commented Mar 30, 2022 •

edited

Loading

dt commented Mar 30, 2022 •

edited

Loading

dt commented Mar 30, 2022

dt commented Mar 31, 2022

tbg commented Mar 31, 2022

nvanbenschoten left a comment

craig bot commented Apr 4, 2022

kv: scan empty right-hand side of split for stats #78218

kv: scan empty right-hand side of split for stats #78218

Conversation

nvanbenschoten commented Mar 22, 2022 • edited Loading

cockroach-teamcity commented Mar 22, 2022

nvanbenschoten commented Mar 23, 2022

dt left a comment

Choose a reason for hiding this comment

dt commented Mar 24, 2022 • edited Loading

erikgrinaker left a comment

Choose a reason for hiding this comment

dt commented Mar 25, 2022

nvanbenschoten commented Mar 25, 2022

dt commented Mar 25, 2022

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg commented Mar 28, 2022

dt commented Mar 29, 2022 • edited Loading

dt commented Mar 30, 2022

dt commented Mar 30, 2022

erikgrinaker commented Mar 30, 2022

dt commented Mar 30, 2022 • edited Loading

dt commented Mar 30, 2022 • edited Loading

dt commented Mar 30, 2022

dt commented Mar 31, 2022

tbg commented Mar 31, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

craig bot commented Apr 4, 2022

nvanbenschoten commented Mar 22, 2022 •

edited

Loading

dt commented Mar 24, 2022 •

edited

Loading

dt commented Mar 29, 2022 •

edited

Loading

dt commented Mar 30, 2022 •

edited

Loading

dt commented Mar 30, 2022 •

edited

Loading