Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: scan empty right-hand side of split for stats #78218

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Mar 22, 2022

See conversation in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1647551964065369.
Part of #77157.

Bulk ingestion operations like IMPORT and index backfills have a
fast-path for in-order ingestion where they periodically manually split
and scatter the empty head of the keyspace being ingested into. In
tests, we've seen that this split-and-scatter step can be expensive.
This appears to be due in part to the stats recomputation we perform
during range splits.

Currently, this stats computation always scans the left hand side of the
split. This is unfortunate for bulk-issued manual splits, because those
manual splits are intentionally performed on the right border of the
range, meaning that their left hand side contains the entire ~500MB
range and their right hand side is empty.

This commit extends the range split logic by adding a heuristic that
chooses to scan the right side of the split first computing stats in
cases where the right side is entirely empty.

The "scan first" part is subtle, because there are cases where a split
needs to scan both sides when computing stats. Specifically, it needs to
do so in cases where the range has estimates in its MVCCStats. For an
explanation, see split_stats_helper.go. It's not clear to me whether
this commit is sufficient to help bulk ingestion or whether we'll need
to do something about these stats estimates as well.

@nvanbenschoten nvanbenschoten requested a review from dt March 22, 2022 03:49
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/splitScatter branch from b5ccfdc to deb759c Compare March 23, 2022 03:52
@nvanbenschoten nvanbenschoten requested a review from tbg March 23, 2022 03:52
@nvanbenschoten nvanbenschoten marked this pull request as ready for review March 23, 2022 03:52
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner March 23, 2022 03:52
@nvanbenschoten
Copy link
Member Author

I confirmed that IMPORT (e.g. cockroach workload fixtures import tpcc) does not create ranges with estimates in their stats, so IMPORT will benefit from this change once it's plumbed up into SSTBatcher.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Hooking this up in a TPC-E 50K trade table IMPORT shaved about 900-1200s off the 7100s import time 🎉

@dt
Copy link
Member

dt commented Mar 24, 2022

One thing that would be pretty convenient to IMPORT, if I could tempt you into adding another commit to this, would be including these stats in the AdminSplitResponse, at least for the RHS, which seems like it shouldn't cost us anything since we're computing them anyway (I poked at this briefly but wasn't sure how to thread that back from the split trigger).

Even more nifty would be to additionally, if that RHS stats total is >0, grab iter and seekge to find the first key and send that in the response as well, e.g. first_key or something, so that the splitter knows how much is in the new range they made and where it starts.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. If it's often the case that one of the sides will be empty, maybe we could also just do a seek first to check?

@dt
Copy link
Member

dt commented Mar 25, 2022

I wonder if we should flip the default?

like, manual splits are usually either in the middle of a range in which case, doesn't matter which we pick, or, very often, are sent to carve out a new table span, index span, partition, etc, in which case it is much more likely that the RHS is empty.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/splitScatter branch from deb759c to 2db4c5d Compare March 25, 2022 18:56
@nvanbenschoten
Copy link
Member Author

One thing that would be pretty convenient to IMPORT, if I could tempt you into adding another commit to this, would be including these stats in the AdminSplitResponse, at least for the RHS, which seems like it shouldn't cost us anything since we're computing them anyway (I poked at this briefly but wasn't sure how to thread that back from the split trigger).

Even more nifty would be to additionally, if that RHS stats total is >0, grab iter and seekge to find the first key and send that in the response as well, e.g. first_key or something, so that the splitter knows how much is in the new range they made and where it starts.

There's since been lots of discussion about this in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1647551964065369.

If it's often the case that one of the sides will be empty, maybe we could also just do a seek first to check?

Done. PTAL.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/splitScatter branch from 2db4c5d to aa43bb4 Compare March 25, 2022 19:05
@dt
Copy link
Member

dt commented Mar 25, 2022

Nice!

maybe in the future we re-introduce some form of caller hint as to which side is smaller for small-but-non-empty cases, but I suspect this version gets us >90% of the benefit without changing the signatures, so I'm very 👍 on it.

@nvanbenschoten nvanbenschoten changed the title kv: allow manual range split to choose which side to scan for stats kv: scan empty right-hand side of split for stats Mar 25, 2022
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this was a smaller change to the arithmetic than I expected. Do we have good coverage of both cases? I don't see any changes to tests. I assume we'd want to force the heuristic in some of the tests that today split in the middle & verify the stats (I know such tests exists but don't remember the name). LGTM pending that kind of test coverage.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 947 at r1 (raw file):

// isGlobalKeyspaceEmpty returns whether the global keyspace of the provided
// range is entirely empty or whether it contains at least one key.

nit: the "or whether it contains at least one key" could be misconstrued as returning true if there is a key, but this is not the intention.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had explicit test coverage, but it was removed when I removed the caller hint and made this heuristic-based. It's now harder to control the choice of split directly from far away. We could add the plumbing back in as a testing knob, but that's pretty heavy-weight. What do you think of using a metamorphic constant to occasionally default to scanning the RHS in tests to improve coverage?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@tbg
Copy link
Member

tbg commented Mar 28, 2022

What do you think of using a metamorphic constant to occasionally default to scanning the RHS in tests to improve coverage?

I can't form the most coherent argument around this yet, but it seems smelly to use a randomized constant for this. It's a straightforward code path, not a large number of options. The hint is either left or right and both need to be tested directly. This would be different if we had some sort of "heuristic cutoff" that would take a float64 or something like that, then we could randomize the threshold since it shouldn't make a difference anywhere. But even then we'd want to have a direct test for some hard-coded thresholds.
Is it so bad to plumb the knob? And aren't we anticipating letting the caller influence the heuristic anyway, at which point we'll do most of that plumbing anyway?

@dt
Copy link
Member

dt commented Mar 29, 2022

I think we might actually want the caller-supplied flag version back?

I'm still poking around here, but one case that jumps out so far -- when a range has been "filled" from left to right, say from a to k, and we're about to split at j before continuing to fill at j, if we happen to know there's some existing data in the RHS above the span we're currently filling, at say s:

If we just split at j, the existence of s would cause us to count the generally bigger (since we just filled it) LHS data in [a, k). If we send a split to s first, as #78523 does, then that split will count [a, k), since it too has a non-empty RHS (by definition). Typically we observe counting [s, z) is cheaper than [a, k) since we're filling left to right at least at the flush level, so our caller would prefer to just hint to count the RHS -- indeed, potentially we could do even better on the calling side and track how much we've added to the LHS and know in some cases that it is more expensive to count than whatever could be in the RHS above s.

That said, that above-split case should be the minority of split sent even in #78523, so I'd still expect to see significant benefit from auto-RHS, but in practice I'm seeing almost none, so something else must be up here. I'll keep digging.

So far on this run with logging for which side we scan, after the 100 initial splits mostly did take the rhs stats path, most of the growth since has been in the rhs=false case:

➜  cockroach git:(both) ✗ roachprod run david-both "grep 'scan right false' logs/cockroach.log | wc -l"
david-both: grep 'scan right false' log... 10/10
   1: 55
   2: 65
   3: 65
   4: 65
   5: 78
   6: 64
   7: 77
   8: 62
   9: 79
  10: 76
➜  cockroach git:(both) ✗ roachprod run david-both "grep 'scan right true' logs/cockroach.log | wc -l"
david-both: grep 'scan right true' logs... 10/10
   1: 6
   2: 95
   3: 10
   4: 4
   5: 8
   6: 11
   7: 10
   8: 8
   9: 5
  10: 21

I'm logging which key it was that existed, so maybe I can try to piece together why we're seeing this. It's a little tricky since I can't tell the above split from a pre-split at this point, so will need to sleuth a little to tell which of these are expected or not.

@dt
Copy link
Member

dt commented Mar 30, 2022

Aha, I think I understand what's up. In a multi-column family table like TPC-E's trade where the first family is tiny, we almost never have a truly empty RHS, since we add an sst with keys like row1/c0, row1/c1, row2/c0 ... rowX/c0, then we decide it is too big to fit rowX/c1 so we stop and split before we add that next key. We've added rowX/c0 at this point, however that split can't be at rowX/c1 since we don't allow mid-row splits; instead it has to be all the way back at just rowX, so rowX/c0 is in the RHS, and it isn't empty.

I'll poke at the flush logic and maybe just let it exceed the size as needed until it gets to a split key and see if that changes things.

dt added a commit to dt/cockroach that referenced this pull request Mar 30, 2022
Previously an sstable might end due to size at /table/i/rowX/col/Y, if
some, but not all, families for rowX fit in that file. This is OK as far
as KV and SQL are concerned, since after we add the next file which will
start with rowX/colZ, the row is complete from the point of view of any
scan. However it does mean that if, after adding this file we determine
that we need to split before adding the next file, that split, as it
must be at a row boundary, will be at rowX, not rowX/colZ. This too is
OK, but has the slight downside of meaning that when we scatter the new
RHS, starting at rowX, we have to move the colY family KV we just added
in the prior prior file. While it is typically a trivial amount of data,
it does make the RHS non-empty and thus require _some_ cost to move.

This changes the size-based limit that triggers a file flush to wait for
the next row boundary after the size is exceeded, so that SST bounds now
also fall on row, and thus any future range split, bounds.

This is particularly relevant in conjunction with cockroachdb#78218.

Release note: none.
@dt
Copy link
Member

dt commented Mar 30, 2022

Okay, so I did a run of master, this change, my split-above branch #78523, this change rebased on that one, and another of this change on both split-above change and another tiny fix to round SST bounds up to row bounds in #79020.

As seen earlier this change on its own is basically same as master -- there's always a key in the RHS make it pick LHS.
Split-above has some big advantages over master on its own in balance and not getting into runaway linear splits after every file due to full RHS, but still spends 20min in splits. Rebasing this on split-above didn't change the split time, however rebasing this on split-above plus #79020 looks great: total split time is down from 20ish minutes to 3-4, shaving 15min+ off of the 1h40m import.

So this change looks great, once IMPORT makes a could tweaks to get it empty RHSs more often.

That said, the manual hinting version I ran last week was even better, closer to just one minute. I suspect this is mostly the split-above splits picking the big LHS, but I'm not sure. I'm happy to just go with this as-is, since it is a big win already, or go back to the caller-provided hint version.

@erikgrinaker
Copy link
Contributor

That said, the manual hinting version I ran last week was even better, closer to just one minute. I suspect this is mostly the split-above splits picking the big LHS, but I'm not sure. I'm happy to just go with this as-is, since it is a big win already, or go back to the caller-provided hint version.

Don't see why we couldn't do both, if it's beneficial: provide a parameter to prefer the RHS, but if not set and a seek finds the RHS to be empty, then scan the RHS anyway.

@dt
Copy link
Member

dt commented Mar 30, 2022

Don't see why we couldn't do both

Yeah, I think we could (should) do both, but I'm also happy to say we land this PR as-is with its smaller footprint of not having an API change, get it and the other 22.1-bound stuff all backported, and then come back and benchmark the marginal difference that adding a manual hint buys to decide if it is worth it.

dt added a commit to dt/cockroach that referenced this pull request Mar 30, 2022
Previously an sstable might end due to size at /table/i/rowX/col/Y, if
some, but not all, families for rowX fit in that file. This is OK as far
as KV and SQL are concerned, since after we add the next file which will
start with rowX/colZ, the row is complete from the point of view of any
scan. However it does mean that if, after adding this file we determine
that we need to split before adding the next file, that split, as it
must be at a row boundary, will be at rowX, not rowX/colZ. This too is
OK, but has the slight downside of meaning that when we scatter the new
RHS, starting at rowX, we have to move the colY family KV we just added
in the prior prior file. While it is typically a trivial amount of data,
it does make the RHS non-empty and thus require _some_ cost to move.

This changes the size-based limit that triggers a file flush to wait for
the next row boundary after the size is exceeded, so that SST bounds now
also fall on row, and thus any future range split, bounds.

This is particularly relevant in conjunction with cockroachdb#78218.

Release note: none.
@dt
Copy link
Member

dt commented Mar 30, 2022

Is this good to merge? Are we okay with merging it now and then maybe adding the explicit hint in a follow-up if it looks like that'd be an even bigger win?

craig bot pushed a commit that referenced this pull request Mar 30, 2022
79020: kv/bulk: chunk SSTs to row boundaries r=dt a=dt

Previously an sstable might end due to size at /table/i/rowX/col/Y, if
some, but not all, families for rowX fit in that file. This is OK as far
as KV and SQL are concerned, since after we add the next file which will
start with rowX/colZ, the row is complete from the point of view of any
scan. However it does mean that if, after adding this file we determine
that we need to split before adding the next file, that split, as it
must be at a row boundary, will be at rowX, not rowX/colZ. This too is
OK, but has the slight downside of meaning that when we scatter the new
RHS, starting at rowX, we have to move the colY family KV we just added
in the prior prior file. While it is typically a trivial amount of data,
it does make the RHS non-empty and thus require _some_ cost to move.

This changes the size-based limit that triggers a file flush to wait for
the next row boundary after the size is exceeded, so that SST bounds now
also fall on row, and thus any future range split, bounds.

This is particularly relevant in conjunction with #78218.

Release note: none.

Co-authored-by: David Taylor <[email protected]>
@dt
Copy link
Member

dt commented Mar 30, 2022

Went back and re-ran TPC-C 50k import with this, just since I'd mostly been playing with TPC-E lately, and there too it is looking good: this PR + #78523, we see numbers like 18m55s sending, 1m52s splitting where they used to be very close.

@dt
Copy link
Member

dt commented Mar 31, 2022

@tbg I've run this patch for many many roachprod hours and it looks very good and I'd like to get it into 22.1 along with other ingest patches; would you be content with merging it as-is for now and revisiting additional testing later (e.g. when we add an explicit side hint to the api) ?

@tbg
Copy link
Member

tbg commented Mar 31, 2022

I'm fine merging as is if Nathan is fine with it.

See conversation in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1647551964065369.

Bulk ingestion operations like IMPORT and index backfills have a
fast-path for in-order ingestion where they periodically manually split
and scatter the empty head of the keyspace being ingested into. In
tests, we've seen that this split-and-scatter step can be expensive.
This appears to be due in part to the stats recomputation we perform
during range splits.

Currently, this stats computation always scans the left hand side of the
split. This is unfortunate for bulk-issued manual splits, because those
manual splits are intentionally performed on the right border of the
range, meaning that their left hand side contains the entire ~500MB
range and their right hand side is empty.

This commit extends the range split logic by adding a heuristic that
chooses to scan the right side of the split first computing stats in
cases where the right side is entirely empty.

The "scan first" part is subtle, because there are cases where a split
needs to scan both sides when computing stats. Specifically, it needs to
do so in cases where the range has estimates in its MVCCStats. For an
explanation, see `split_stats_helper.go`. It's not clear to me whether
this commit is sufficient to help bulk ingestion or whether we'll need
to do something about these stats estimates as well.
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here. I've added in a metamorphic constant that dictates the default choice for which side of the split to scan. That gives us plenty of test coverage to verify that stats are consistent on both halves of the split regardless of which side is scanned first. I debated reviving the plumbing that was in an earlier version of this PR (deb759c), but something feels wrong about adding knobs for the express purpose of being able to test that those knobs are respected. Without them, there's nothing in AdminSplit's external API to test.

TFTRs! Since David is waiting on this for #77157, I'll go ahead and merge.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 947 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: the "or whether it contains at least one key" could be misconstrued as returning true if there is a key, but this is not the intention.

Done.

@craig
Copy link
Contributor

craig bot commented Apr 4, 2022

Build succeeded:

@craig craig bot merged commit fa126a4 into cockroachdb:master Apr 4, 2022
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/splitScatter branch April 11, 2022 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants