-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: add range key support for SST iterators and writers #82799
storage: add range key support for SST iterators and writers #82799
Conversation
12a7f24
to
5984d9f
Compare
5984d9f
to
532d334
Compare
Looks like This is unnecessary, and should be avoided. However, it will require exposing range keys via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reading - flushing some comments.
Reviewed 4 of 11 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)
pkg/storage/sst_writer.go
line 172 at r2 (raw file):
return errors.Wrapf(err, "failed to encode MVCC value for range key %s", rangeKey) } fw.DataSize += int64(len(rangeKey.StartKey)) + int64(len(rangeKey.EndKey)) + int64(len(valueRaw))
We don't account for the timestamp in the size? I see there's no change from the existing functionality - just curious. Is it more like "user data", and doesn't care about the on-disk size?
pkg/storage/sst_writer.go
line 183 at r2 (raw file):
func (fw *SSTWriter) ExperimentalClearMVCCRangeKey(rangeKey MVCCRangeKey) error { if !fw.supportsRangeKeys { return nil // noop
I assume this is to avoid having litter the client code with if supportsRangeKey
checks? Instead we'll unconditionally call this method and handle the switching in the one place.
Given we're not panic-ing, is there any risk of subtle errors creeping in here - say some code was expecting the table to have range keys, but it didn't, so we just chugged along. Maybe that's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @nicktrav)
pkg/storage/sst_writer.go
line 172 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
We don't account for the timestamp in the size? I see there's no change from the existing functionality - just curious. Is it more like "user data", and doesn't care about the on-disk size?
Doesn't seem like it, no -- that bugged me a bit too. But I didn't want to get sidetracked by potentially breaking API changes and stuff here, so decided to just go with the status quo. We can address that separately if need be.
pkg/storage/sst_writer.go
line 183 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
I assume this is to avoid having litter the client code with
if supportsRangeKey
checks? Instead we'll unconditionally call this method and handle the switching in the one place.Given we're not panic-ing, is there any risk of subtle errors creeping in here - say some code was expecting the table to have range keys, but it didn't, so we just chugged along. Maybe that's ok.
It's because the code often doesn't have access to e.g. settings to make this determination, and doesn't really care either.
Usually this is called via some other method, e.g. ClearRawRange()
, which then calls through to ExperimentalClearMVCCRangeKey()
. The original caller doesn't care whether we drop a tombstone or not, it only cares that we delete everything in the span -- if we don't support range keys, then there cannot be any range keys there to clear, so we can omit the tombstone and everyone's happy. However, we need ClearRawRange
to work both before and after range keys are enabled.
Note that PutMVCCRangeKey
is a different story -- in that case the original caller definitely needs to care about whether range keys are supported, and we'll error otherwise. But not clearing range keys when there cannot exist any range keys to clear gives the intended outcome anyway.
That said, this all makes sense in the Engine
case (where this code was lifted from), since we have global synchronization via the local Pebble instance. The SSTWriter
is subtly different, in that it may actually matter whether we drop a RangeKeyDelete
into an SST that's then ingested on a different node (who may not have had its version gate flipped yet). This is controlled by EnablePebbleFormatVersionRangeKeys
, which does a migration to make sure everyone supports these SSTs.
I've looked over the uses of MakeIngestionSSTWriter
, and it seems like in the places that matter (e.g. when applying Raft snapshots and during range merges and such) we're always building and applying the SSTs locally under an exclusive lock. So I don't think there can be a race there where we start building the SST, then the version gate flips, someone writes a range key, and we fail to clear it. It's worth confirming this, but I don't think we need to block this PR on it, and it's unclear how it'd affect the SSTWriter
in any case -- it'd have to rely on external synchronization, I think.
Am I missing anything?
532d334
to
bb295ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing again - I'm picking through the test cases slowly.
Reviewed 3 of 11 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)
pkg/storage/sst_writer.go
line 172 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Doesn't seem like it, no -- that bugged me a bit too. But I didn't want to get sidetracked by potentially breaking API changes and stuff here, so decided to just go with the status quo. We can address that separately if need be.
sgtm
pkg/storage/sst_writer.go
line 183 at r2 (raw file):
Thanks for the detail.
Am I missing anything?
I don't think so. I think this is fine.
pkg/storage/testdata/mvcc_histories/sst_iter
line 5 at r2 (raw file):
# separately. # # TODO(erikgrinaker): Add overlapping range keys when it doesn't panic.
I assume this is a Cockroach-level limitation? The pebble writer should already support this.
pkg/storage/testdata/mvcc_histories/sst_iter
line 12 at r2 (raw file):
sst_put k=b ts=1 v=b1 sst_put k=c ts=2 sst_put_rangekey k=b end=f ts=4
Worth interleaving the rangekeys and point keys a bit to highlight the fact that they don't necessarily need to be added after all the point keys have been added? i.e. this range key could be the first key added and it should still produce the same output.
pkg/storage/testdata/mvcc_histories/sst_iter_multi
line 16 at r2 (raw file):
# 4. Write a new range key that partially replaces one in 3 with different localTs. # 5. Clear the [h-j) span. #
Worth mentioning that the output of this test command has the tables in reverse order?
pkg/storage/testdata/mvcc_histories/sst_iter_multi
line 41 at r2 (raw file):
---- >> at end: >> sst-0:
Should we have some way of representing a range key del / range del?
pkg/storage/testdata/mvcc_histories/sst_writer
line 37 at r2 (raw file):
# Writing the same key multiple times uses the first key. # # TODO(erikgrinaker): Is this expected?
This surprised me too - but looking at the raw sstable that's written, I see the two keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Nothing major other than the few questions / observations about the test cases.
Reviewed 4 of 11 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)
pkg/storage/testdata/mvcc_histories/sst_writer
line 66 at r2 (raw file):
# Clearing a span does not affect SST contents. # # TODO(erikgrinaker): Is this correct?
I left a comment elsewhere - it would be nice to see the range dels / range key dels in the output, given there are physical keys in the sstable. Maybe something like a "raw iteration mode", that uses just a boring old sstable.Reader
for these sst tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @nicktrav)
pkg/storage/testdata/mvcc_histories/sst_iter
line 5 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
I assume this is a Cockroach-level limitation? The pebble writer should already support this.
No, it's cockroachdb/pebble#1760
pkg/storage/testdata/mvcc_histories/sst_iter
line 12 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Worth interleaving the rangekeys and point keys a bit to highlight the fact that they don't necessarily need to be added after all the point keys have been added? i.e. this range key could be the first key added and it should still produce the same output.
Yeah, may as well.
pkg/storage/testdata/mvcc_histories/sst_iter_multi
line 16 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Worth mentioning that the output of this test command has the tables in reverse order?
Yeah, maybe it's better to keep them oldest-to-newest, as they're written. This shows them in the order they're fed to NewExternalIter
, and earlier SSTs take precedence over latter SSTs, so we have to reverse them to get the expected order where later writes replace earlier writes -- but I think we should optimize for test case understandability here, even though the API may use them differently.
pkg/storage/testdata/mvcc_histories/sst_writer
line 37 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
This surprised me too - but looking at the raw sstable that's written, I see the two keys.
I think it's just because it uses the first key it finds while iterating.
pkg/storage/testdata/mvcc_histories/sst_writer
line 66 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
I left a comment elsewhere - it would be nice to see the range dels / range key dels in the output, given there are physical keys in the sstable. Maybe something like a "raw iteration mode", that uses just a boring old
sstable.Reader
for these sst tests?
Probably a good idea. I figured that'd be an implementation detail at this level, but would be useful to dump them.
bb295ac
to
5658912
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @nicktrav)
5658912
to
66bbb6e
Compare
63ecb99
to
760d847
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like TestStoreRangeMergeRaftSnapshot is failing here, because ClearRangeWithHeuristic unconditionally drops a Pebble range tombstones across the range key span, thus generating additional SSTs when clearing range data.
This is unnecessary, and should be avoided. However, it will require exposing range keys via EngineIterator first, see #82935.
Had to pull in #83031 for this, sorry about the noise. For now, I've made a quick hack in ClearRangeWithHeuristic
to get tests to pass, as a separate commit. Wrote up #83032 to revisit the range clearing APIs and heuristics.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @nicktrav)
pkg/storage/testdata/mvcc_histories/sst_writer
line 37 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I think it's just because it uses the first key it finds while iterating.
Can you confirm that this behavior makes sense? I think this might be what makes a rangedel ineffectual as well, since it sees the SET before the RANGEDEL, or some such. Have a look at this case:
pkg/storage/testdata/mvcc_histories/sst_writer
line 66 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Probably a good idea. I figured that'd be an implementation detail at this level, but would be useful to dump them.
Done. This is much better, good idea.
760d847
to
c970e69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another pass, and addressed the open comments.
Reviewed 2 of 8 files at r6, 3 of 6 files at r7, 1 of 1 files at r8, 11 of 11 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)
pkg/storage/testdata/mvcc_histories/sst_writer
line 37 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Can you confirm that this behavior makes sense? I think this might be what makes a rangedel ineffectual as well, since it sees the SET before the RANGEDEL, or some such. Have a look at this case:
Range dels with sequence number m
only apply to keys with seqnum n
if n < m
(see here). As all the keys in the table are written out with the same sequence number, the range del wont apply.
That's also why we see the rangekeyset
and the rangekeydel
next to each other over the span c-e
(see here).
In the case you write out a new table with just the sst_clear_range
and have that shadow the first table, it should take effect. I think you already have coverage for this in sst_iter_multi
, but I guess you could just put another sst_finish
after the sst_clear_range
to put that in a standalone table?
pkg/storage/testdata/mvcc_histories/sst_writer
line 50 at r9 (raw file):
iter_scan: "a"/1.000000000,0=/BYTES/a1 {a-c}/[3.000000000,0={localTs=2.000000000,0}/<empty>] iter_scan: . >> at end:
Super nitpicky, but I was expecting to see the raw sstable output first, then followed by the scan output, given we do sst_finish
, followed by sst_iter_new
. Maybe there's something subtle in the test harness that prevents that, so not a big issue.
This patch adds a quick hack to `ClearRangeWithHeuristic` to avoid dropping a Pebble range tombstone across the range key span if there are no range keys. Similar heuristics exist in `Pebble.ExperimentalClearAllRangeKeys`, but we may be writing to an `SSTWriter` here which can't have these heuristics. This scheme needs to be revamped, but for now we add this quick hack to pass tests and unblock `SSTWriter` range key support. Release note: None
This patch adds `NewPebbleSSTIterator()`, which constructs an MVCC iterator for SSTs. Unlike the existing `sstIterator`, it supports range keys and merged iteration across multiple SSTs. It is based on a new `pebble.NewExternalIter()` API for SST iteration, and reuses `pebbleIterator` for the CRDB logic. Value verification has been split out to a separate, general `VerifyingMVCCIterator`. This iterator currently has significantly worse performance than the existing SST iterator (as much as 100%), so it is not used yet outside of tests. This will be optimized later. The patch also adds support for writing range keys in `SSTWriter`. This is only enabled when the SST format is `TableFormatPebblev2` or newer, which is only the case once the cluster version reaches `EnablePebbleFormatVersionRangeKeys`. Release note: None
c970e69
to
cf162f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @nicktrav)
pkg/storage/testdata/mvcc_histories/sst_writer
line 37 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Range dels with sequence number
m
only apply to keys with seqnumn
ifn < m
(see here). As all the keys in the table are written out with the same sequence number, the range del wont apply.That's also why we see the
rangekeyset
and therangekeydel
next to each other over the spanc-e
(see here).In the case you write out a new table with just the
sst_clear_range
and have that shadow the first table, it should take effect. I think you already have coverage for this insst_iter_multi
, but I guess you could just put anothersst_finish
after thesst_clear_range
to put that in a standalone table?
Thanks, that makes sense. Added test cases both for same SST and different SST.
pkg/storage/testdata/mvcc_histories/sst_writer
line 50 at r9 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Super nitpicky, but I was expecting to see the raw sstable output first, then followed by the scan output, given we do
sst_finish
, followed bysst_iter_new
. Maybe there's something subtle in the test harness that prevents that, so not a big issue.
This is an artifact of the test harness -- it outputs the persisted state at the end of each test, after commands have been run. It can output the intermediate states after each mutation by using run trace
, but that quickly gets noisy.
Split the test into two instead: one which sets up the data, and one which iterates.
TFTR! bors r=nicktrav |
Build succeeded: |
This patch adds
NewPebbleSSTIterator()
, which constructs an MVCCiterator for SSTs. Unlike the existing
sstIterator
, it supports rangekeys and merged iteration across multiple SSTs. It is based on a new
pebble.NewExternalIter()
API for SST iteration, and reusespebbleIterator
for the CRDB logic. Value verification has been splitout to a separate, general
VerifyingMVCCIterator
.This iterator currently has significantly worse performance than the
existing SST iterator (as much as 100%), so it is not used yet outside
of tests. This will be optimized later.
The patch also adds support for writing range keys in
SSTWriter
. Thisis only enabled when the SST format is
TableFormatPebblev2
or newer,which is only the case once the cluster version reaches
EnablePebbleFormatVersionRangeKeys
.Resolves #82586.
Release note: None