Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: add range key support for SST iterators and writers #82799

Merged
merged 2 commits into from
Jun 24, 2022

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jun 12, 2022

This patch adds NewPebbleSSTIterator(), which constructs an MVCC
iterator for SSTs. Unlike the existing sstIterator, it supports range
keys and merged iteration across multiple SSTs. It is based on a new
pebble.NewExternalIter() API for SST iteration, and reuses
pebbleIterator for the CRDB logic. Value verification has been split
out to a separate, general VerifyingMVCCIterator.

This iterator currently has significantly worse performance than the
existing SST iterator (as much as 100%), so it is not used yet outside
of tests. This will be optimized later.

The patch also adds support for writing range keys in SSTWriter. This
is only enabled when the SST format is TableFormatPebblev2 or newer,
which is only the case once the cluster version reaches
EnablePebbleFormatVersionRangeKeys.

Resolves #82586.

Release note: None

@erikgrinaker erikgrinaker requested review from nicktrav, jbowens, aliher1911 and a team June 12, 2022 18:30
@erikgrinaker erikgrinaker requested a review from a team as a code owner June 12, 2022 18:30
@erikgrinaker erikgrinaker self-assigned this Jun 12, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-sst-iter branch 2 times, most recently from 12a7f24 to 5984d9f Compare June 13, 2022 13:12
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-sst-iter branch from 5984d9f to 532d334 Compare June 15, 2022 13:20
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 15, 2022

Looks like TestStoreRangeMergeRaftSnapshot is failing here, because ClearRangeWithHeuristic unconditionally drops a Pebble range tombstones across the range key span, thus generating additional SSTs when clearing range data.

This is unnecessary, and should be avoided. However, it will require exposing range keys via EngineIterator first, see #82935.

Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reading - flushing some comments.

Reviewed 4 of 11 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)


pkg/storage/sst_writer.go line 172 at r2 (raw file):

		return errors.Wrapf(err, "failed to encode MVCC value for range key %s", rangeKey)
	}
	fw.DataSize += int64(len(rangeKey.StartKey)) + int64(len(rangeKey.EndKey)) + int64(len(valueRaw))

We don't account for the timestamp in the size? I see there's no change from the existing functionality - just curious. Is it more like "user data", and doesn't care about the on-disk size?


pkg/storage/sst_writer.go line 183 at r2 (raw file):

func (fw *SSTWriter) ExperimentalClearMVCCRangeKey(rangeKey MVCCRangeKey) error {
	if !fw.supportsRangeKeys {
		return nil // noop

I assume this is to avoid having litter the client code with if supportsRangeKey checks? Instead we'll unconditionally call this method and handle the switching in the one place.

Given we're not panic-ing, is there any risk of subtle errors creeping in here - say some code was expecting the table to have range keys, but it didn't, so we just chugged along. Maybe that's ok.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @nicktrav)


pkg/storage/sst_writer.go line 172 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

We don't account for the timestamp in the size? I see there's no change from the existing functionality - just curious. Is it more like "user data", and doesn't care about the on-disk size?

Doesn't seem like it, no -- that bugged me a bit too. But I didn't want to get sidetracked by potentially breaking API changes and stuff here, so decided to just go with the status quo. We can address that separately if need be.


pkg/storage/sst_writer.go line 183 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

I assume this is to avoid having litter the client code with if supportsRangeKey checks? Instead we'll unconditionally call this method and handle the switching in the one place.

Given we're not panic-ing, is there any risk of subtle errors creeping in here - say some code was expecting the table to have range keys, but it didn't, so we just chugged along. Maybe that's ok.

It's because the code often doesn't have access to e.g. settings to make this determination, and doesn't really care either.

Usually this is called via some other method, e.g. ClearRawRange(), which then calls through to ExperimentalClearMVCCRangeKey(). The original caller doesn't care whether we drop a tombstone or not, it only cares that we delete everything in the span -- if we don't support range keys, then there cannot be any range keys there to clear, so we can omit the tombstone and everyone's happy. However, we need ClearRawRange to work both before and after range keys are enabled.

Note that PutMVCCRangeKey is a different story -- in that case the original caller definitely needs to care about whether range keys are supported, and we'll error otherwise. But not clearing range keys when there cannot exist any range keys to clear gives the intended outcome anyway.

That said, this all makes sense in the Engine case (where this code was lifted from), since we have global synchronization via the local Pebble instance. The SSTWriter is subtly different, in that it may actually matter whether we drop a RangeKeyDelete into an SST that's then ingested on a different node (who may not have had its version gate flipped yet). This is controlled by EnablePebbleFormatVersionRangeKeys, which does a migration to make sure everyone supports these SSTs.

I've looked over the uses of MakeIngestionSSTWriter, and it seems like in the places that matter (e.g. when applying Raft snapshots and during range merges and such) we're always building and applying the SSTs locally under an exclusive lock. So I don't think there can be a race there where we start building the SST, then the version gate flips, someone writes a range key, and we fail to clear it. It's worth confirming this, but I don't think we need to block this PR on it, and it's unclear how it'd affect the SSTWriter in any case -- it'd have to rely on external synchronization, I think.

Am I missing anything?

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-sst-iter branch from 532d334 to bb295ac Compare June 16, 2022 13:43
Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing again - I'm picking through the test cases slowly.

Reviewed 3 of 11 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)


pkg/storage/sst_writer.go line 172 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Doesn't seem like it, no -- that bugged me a bit too. But I didn't want to get sidetracked by potentially breaking API changes and stuff here, so decided to just go with the status quo. We can address that separately if need be.

sgtm


pkg/storage/sst_writer.go line 183 at r2 (raw file):
Thanks for the detail.

Am I missing anything?

I don't think so. I think this is fine.


pkg/storage/testdata/mvcc_histories/sst_iter line 5 at r2 (raw file):

# separately.
#
# TODO(erikgrinaker): Add overlapping range keys when it doesn't panic.

I assume this is a Cockroach-level limitation? The pebble writer should already support this.


pkg/storage/testdata/mvcc_histories/sst_iter line 12 at r2 (raw file):

sst_put k=b ts=1 v=b1
sst_put k=c ts=2
sst_put_rangekey k=b end=f ts=4

Worth interleaving the rangekeys and point keys a bit to highlight the fact that they don't necessarily need to be added after all the point keys have been added? i.e. this range key could be the first key added and it should still produce the same output.


pkg/storage/testdata/mvcc_histories/sst_iter_multi line 16 at r2 (raw file):

# 4. Write a new range key that partially replaces one in 3 with different localTs.
# 5. Clear the [h-j) span.
#

Worth mentioning that the output of this test command has the tables in reverse order?


pkg/storage/testdata/mvcc_histories/sst_iter_multi line 41 at r2 (raw file):

----
>> at end:
>> sst-0:

Should we have some way of representing a range key del / range del?


pkg/storage/testdata/mvcc_histories/sst_writer line 37 at r2 (raw file):

# Writing the same key multiple times uses the first key.
#
# TODO(erikgrinaker): Is this expected?

This surprised me too - but looking at the raw sstable that's written, I see the two keys.

Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Nothing major other than the few questions / observations about the test cases.

:lgtm:

Reviewed 4 of 11 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)


pkg/storage/testdata/mvcc_histories/sst_writer line 66 at r2 (raw file):

# Clearing a span does not affect SST contents.
#
# TODO(erikgrinaker): Is this correct?

I left a comment elsewhere - it would be nice to see the range dels / range key dels in the output, given there are physical keys in the sstable. Maybe something like a "raw iteration mode", that uses just a boring old sstable.Reader for these sst tests?

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @nicktrav)


pkg/storage/testdata/mvcc_histories/sst_iter line 5 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

I assume this is a Cockroach-level limitation? The pebble writer should already support this.

No, it's cockroachdb/pebble#1760


pkg/storage/testdata/mvcc_histories/sst_iter line 12 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Worth interleaving the rangekeys and point keys a bit to highlight the fact that they don't necessarily need to be added after all the point keys have been added? i.e. this range key could be the first key added and it should still produce the same output.

Yeah, may as well.


pkg/storage/testdata/mvcc_histories/sst_iter_multi line 16 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Worth mentioning that the output of this test command has the tables in reverse order?

Yeah, maybe it's better to keep them oldest-to-newest, as they're written. This shows them in the order they're fed to NewExternalIter, and earlier SSTs take precedence over latter SSTs, so we have to reverse them to get the expected order where later writes replace earlier writes -- but I think we should optimize for test case understandability here, even though the API may use them differently.


pkg/storage/testdata/mvcc_histories/sst_writer line 37 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

This surprised me too - but looking at the raw sstable that's written, I see the two keys.

I think it's just because it uses the first key it finds while iterating.


pkg/storage/testdata/mvcc_histories/sst_writer line 66 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

I left a comment elsewhere - it would be nice to see the range dels / range key dels in the output, given there are physical keys in the sstable. Maybe something like a "raw iteration mode", that uses just a boring old sstable.Reader for these sst tests?

Probably a good idea. I figured that'd be an implementation detail at this level, but would be useful to dump them.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-sst-iter branch from bb295ac to 5658912 Compare June 16, 2022 19:47
Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @nicktrav)

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-sst-iter branch from 5658912 to 66bbb6e Compare June 17, 2022 13:28
@erikgrinaker erikgrinaker requested a review from a team as a code owner June 17, 2022 13:28
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-sst-iter branch 2 times, most recently from 63ecb99 to 760d847 Compare June 17, 2022 14:52
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like TestStoreRangeMergeRaftSnapshot is failing here, because ClearRangeWithHeuristic unconditionally drops a Pebble range tombstones across the range key span, thus generating additional SSTs when clearing range data.

This is unnecessary, and should be avoided. However, it will require exposing range keys via EngineIterator first, see #82935.

Had to pull in #83031 for this, sorry about the noise. For now, I've made a quick hack in ClearRangeWithHeuristic to get tests to pass, as a separate commit. Wrote up #83032 to revisit the range clearing APIs and heuristics.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @nicktrav)


pkg/storage/testdata/mvcc_histories/sst_writer line 37 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I think it's just because it uses the first key it finds while iterating.

Can you confirm that this behavior makes sense? I think this might be what makes a rangedel ineffectual as well, since it sees the SET before the RANGEDEL, or some such. Have a look at this case:

https://github.com/cockroachdb/cockroach/blob/760d847dfcff1de66e91347219e334b086d79f48/pkg/storage/testdata/mvcc_histories/sst_writer#L80-L114


pkg/storage/testdata/mvcc_histories/sst_writer line 66 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Probably a good idea. I figured that'd be an implementation detail at this level, but would be useful to dump them.

Done. This is much better, good idea.

Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass, and addressed the open comments. :lgtm:

Reviewed 2 of 8 files at r6, 3 of 6 files at r7, 1 of 1 files at r8, 11 of 11 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)


pkg/storage/testdata/mvcc_histories/sst_writer line 37 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Can you confirm that this behavior makes sense? I think this might be what makes a rangedel ineffectual as well, since it sees the SET before the RANGEDEL, or some such. Have a look at this case:

https://github.com/cockroachdb/cockroach/blob/760d847dfcff1de66e91347219e334b086d79f48/pkg/storage/testdata/mvcc_histories/sst_writer#L80-L114

Range dels with sequence number m only apply to keys with seqnum n if n < m (see here). As all the keys in the table are written out with the same sequence number, the range del wont apply.

That's also why we see the rangekeyset and the rangekeydel next to each other over the span c-e (see here).

In the case you write out a new table with just the sst_clear_range and have that shadow the first table, it should take effect. I think you already have coverage for this in sst_iter_multi, but I guess you could just put another sst_finish after the sst_clear_range to put that in a standalone table?


pkg/storage/testdata/mvcc_histories/sst_writer line 50 at r9 (raw file):

iter_scan: "a"/1.000000000,0=/BYTES/a1 {a-c}/[3.000000000,0={localTs=2.000000000,0}/<empty>]
iter_scan: .
>> at end:

Super nitpicky, but I was expecting to see the raw sstable output first, then followed by the scan output, given we do sst_finish, followed by sst_iter_new. Maybe there's something subtle in the test harness that prevents that, so not a big issue.

This patch adds a quick hack to `ClearRangeWithHeuristic` to avoid
dropping a Pebble range tombstone across the range key span if there are
no range keys.

Similar heuristics exist in `Pebble.ExperimentalClearAllRangeKeys`, but
we may be writing to an `SSTWriter` here which can't have these
heuristics. This scheme needs to be revamped, but for now we add this
quick hack to pass tests and unblock `SSTWriter` range key support.

Release note: None
This patch adds `NewPebbleSSTIterator()`, which constructs an MVCC
iterator for SSTs. Unlike the existing `sstIterator`, it supports range
keys and merged iteration across multiple SSTs. It is based on a new
`pebble.NewExternalIter()` API for SST iteration, and reuses
`pebbleIterator` for the CRDB logic. Value verification has been split
out to a separate, general `VerifyingMVCCIterator`.

This iterator currently has significantly worse performance than the
existing SST iterator (as much as 100%), so it is not used yet outside
of tests. This will be optimized later.

The patch also adds support for writing range keys in `SSTWriter`. This
is only enabled when the SST format is `TableFormatPebblev2` or newer,
which is only the case once the cluster version reaches
`EnablePebbleFormatVersionRangeKeys`.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-sst-iter branch from c970e69 to cf162f9 Compare June 24, 2022 11:02
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @nicktrav)


pkg/storage/testdata/mvcc_histories/sst_writer line 37 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Range dels with sequence number m only apply to keys with seqnum n if n < m (see here). As all the keys in the table are written out with the same sequence number, the range del wont apply.

That's also why we see the rangekeyset and the rangekeydel next to each other over the span c-e (see here).

In the case you write out a new table with just the sst_clear_range and have that shadow the first table, it should take effect. I think you already have coverage for this in sst_iter_multi, but I guess you could just put another sst_finish after the sst_clear_range to put that in a standalone table?

Thanks, that makes sense. Added test cases both for same SST and different SST.


pkg/storage/testdata/mvcc_histories/sst_writer line 50 at r9 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Super nitpicky, but I was expecting to see the raw sstable output first, then followed by the scan output, given we do sst_finish, followed by sst_iter_new. Maybe there's something subtle in the test harness that prevents that, so not a big issue.

This is an artifact of the test harness -- it outputs the persisted state at the end of each test, after commands have been run. It can output the intermediate states after each mutation by using run trace, but that quickly gets noisy.

Split the test into two instead: one which sets up the data, and one which iterates.

@erikgrinaker
Copy link
Contributor Author

TFTR!

bors r=nicktrav

@craig
Copy link
Contributor

craig bot commented Jun 24, 2022

Build succeeded:

@craig craig bot merged commit 851e329 into cockroachdb:master Jun 24, 2022
@erikgrinaker erikgrinaker deleted the mvcc-range-tombstones-sst-iter branch June 24, 2022 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: support range keys in SSTWriter and sstIterator
3 participants