-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: add experimental MVCC range tombstone primitives #76131
storage: add experimental MVCC range tombstone primitives #76131
Conversation
857fa5c
to
8ed1537
Compare
8ed1537
to
b83ced0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 20 files at r1, 6 of 8 files at r2, 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)
pkg/kv/kvserver/spanset/batch.go, line 181 at r3 (raw file):
// HasPointAndRange implements SimpleMVCCIterator. func (i *MVCCIterator) HasPointAndRange() (bool, bool) { panic("not implemented")
why is this not passing through to the underlying storage.MVCCIterator
?
pkg/storage/engine.go, line 78 at r3 (raw file):
// key and/or a range key. Range keys are only emitted when requested via // IterOptions.KeyTypes. HasPointAndRange() (bool, bool)
So the Valid
bool would imply at least one of these 2 bools is true?
It may be time to do part of the cleanup mentioned below (not the part about renaming NextEngineKey). The cleanup would include returning (bool, error) from the Seek methods and eliminate Valid. I am hopeful that it is a mechanical step and could happen in a PR preceding this one.
// TODO(sumeer): change MVCCIterator.Next() to match the
// return values, change all its callers.
pkg/storage/engine.go, line 628 at r3 (raw file):
// ExperimentalPutMVCCRangeKey writes a value to an MVCC range key. It is // currently only used for range tombstones, which have a value of nil.
I have a mild preference to specializing this for MVCC-range-tombstones (or whatever we are calling them so as to not confuse with the various other similar terms). Then one wouldn't be passing a nil value.
Btw, does this also mean that we are not writing the original key range as a value, or that's just an implementation detail not exposed here.
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
// RangeBounds implements the MVCCIterator interface. func (p *pebbleIterator) RangeBounds() (roachpb.Key, roachpb.Key) { return p.iter.RangeBounds()
I think there is a problem here.
The []bytes returned by pebble.Iterator
are not necessarily roachpb.Keys. They are an encoded key, that when parsed will give a roachpb.Key
. One could parse both the start and end, but it is also legal for Pebble to return something like [a@20, a@10)@50, which if we parse and use only the roachpb.Key
will give us [a, a) which is empty.
I think the solution here is to borrow an idea from an earlier version of the Pebble range keys RFC, but instead apply it at the CockroachDB level: make this appear as a point key a@50 to the client of pebbleIterator. There will need to be some de-duping in this interface since if the next range key is [a@10,a@5)@50 we don't want to emit a@50 again.
The next range key must be of the form [a@10,)@50 since we don't allow unsetting of arbitrary key spans (they must be prefixes). Say it is [a@10,b@20)@50. We could produce [a,b)@50. There are a couple of problems with this. If the point key is b@25, it looks as if the range does not actually overlap with the point. We know the actual end of this range key will be at a key > b. We could expose an inclusive [a,b]@50 perhaps?
I think something reasonable can be worked out here that preserves a model to the caller that the range keys are of the form [roachpb.Key,roachpb.Key)@ts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @jbowens, @sumeerbhola, and @ts)
pkg/kv/kvserver/spanset/batch.go, line 181 at r3 (raw file):
Previously, sumeerbhola wrote…
why is this not passing through to the underlying
storage.MVCCIterator
?
It will. I was doing the minimum necessary to get feedback on this draft, I'll fix it up either when I complete this PR or in a follow-up PR where I do the batcheval latching bits (which is where this comes in).
pkg/storage/engine.go, line 78 at r3 (raw file):
Previously, sumeerbhola wrote…
So the
Valid
bool would imply at least one of these 2 bools is true?
It may be time to do part of the cleanup mentioned below (not the part about renaming NextEngineKey). The cleanup would include returning (bool, error) from the Seek methods and eliminate Valid. I am hopeful that it is a mechanical step and could happen in a PR preceding this one.// TODO(sumeer): change MVCCIterator.Next() to match the // return values, change all its callers.
I suppose yes. But we have less than 3 weeks to get all of the MVCC range tombstone functionality in and migrate all callers, so I won't be spending time on any related cleanups now. We can do that for 22.2.
pkg/storage/engine.go, line 78 at r3 (raw file):
Previously, aliher1911 (Oleg) wrote…
I find this naming confusing because of "And". Maybe
HasValue() (point bool, range bool)
would be better?
This just mirrors the Pebble API, I figured that'd cause less confusion.
pkg/storage/engine.go, line 80 at r3 (raw file):
Yep, I'll be adding ample documentation on this once we have consensus that we'll be doing this PR rather than the alternative (#76090). I may defer that until after stability though, since that's backportable and we have very little time left to complete all of the MVCC bulk ops work.
Is RangeBounds a union of all RangeKeys() overlapping at current timestamp?
The tl;dr is: Range keys and point keys exist independently, but can be iterated over in parallel. Range keys are also fragmented, such that all overlapping range keys have identical start/end bounds. So let's say we're at point c@5 where there is also a stack of range key fragments [a-f) at timestamps 1,2,3. RangeBounds() returns [a-f), Key() return c@5, RangeKeys() return 1:foo,2:bar,3:baz (or for range tombstones, simply nil
).
pkg/storage/engine.go, line 628 at r3 (raw file):
I have a mild preference to specializing this for MVCC-range-tombstones (or whatever we are calling them so as to not confuse with the various other similar terms). Then one wouldn't be passing a nil value.
Yeah, I've been going back and forth on this. But we may have another use-case coming up for 22.2 (ranged intents for IMPORT INTO), so I figured we might as well generalize this. Not a strongly held opinion though, but the functional delta between tombstones or general keys is tiny (value or not).
Btw, does this also mean that we are not writing the original key range as a value, or that's just an implementation detail not exposed here.
Yes, I realized that it didn't make sense, e.g. due to CRDB range splits/merges. For example, if we write [a-d), then split across three ranges [a-b), [b-c), [c-d), the middle bit can get GCed before the others. Similarly, two range tombstones that abut will merge into one. We'd have to rewrite all of the fragment values in these cases, but it just seems much better/simpler to not give them an identity at all. It should be considered a "tombstone continuum" that can be sliced and diced at will.
This also has the nice property of mirroring the point key API, where nil
is tombstone and values are arbitrary byte slices.
pkg/storage/mvcc_key.go, line 296 at r3 (raw file):
Previously, aliher1911 (Oleg) wrote…
nit: it sounds ambiguous mvcc implies a timestamp aready, maybe just a keyspan at a timestamp?
Yeah, this wording is a bit clumsy, will clean it up.
pkg/storage/mvcc_range_tombstone_iterator.go, line 24 at r3 (raw file):
Previously, aliher1911 (Oleg) wrote…
straddle "it" or "the iterator"?
Ah -- should be "it", i.e. the bound. I'll write it out.
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
Previously, sumeerbhola wrote…
I think there is a problem here.
The []bytes returned bypebble.Iterator
are not necessarily roachpb.Keys. They are an encoded key, that when parsed will give aroachpb.Key
. One could parse both the start and end, but it is also legal for Pebble to return something like [a@20, a@10)@50, which if we parse and use only theroachpb.Key
will give us [a, a) which is empty.I think the solution here is to borrow an idea from an earlier version of the Pebble range keys RFC, but instead apply it at the CockroachDB level: make this appear as a point key a@50 to the client of pebbleIterator. There will need to be some de-duping in this interface since if the next range key is [a@10,a@5)@50 we don't want to emit a@50 again.
The next range key must be of the form [a@10,)@50 since we don't allow unsetting of arbitrary key spans (they must be prefixes). Say it is [a@10,b@20)@50. We could produce [a,b)@50. There are a couple of problems with this. If the point key is b@25, it looks as if the range does not actually overlap with the point. We know the actual end of this range key will be at a key > b. We could expose an inclusive [a,b]@50 perhaps?
I think something reasonable can be worked out here that preserves a model to the caller that the range keys are of the form [roachpb.Key,roachpb.Key)@ts.
Is that still the case now that we make sure Pebble doesn't split user keys across SSTs? I though that was done so that we were guaranteed to get all suffixes for the range key in a single call?
@jbowens Can you confirm?
pkg/storage/mvcc_key_test.go, line 312 at r3 (raw file):
Previously, aliher1911 (Oleg) wrote…
Should it be
"start at end": {MVCCRangeKey{StartKey: a, EndKey: a, Timestamp: ts1}, ""},
Yes, thanks for spotting this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 20 files at r1, 3 of 8 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/engine.go, line 619 at r2 (raw file):
// boundaries will be cleared. Clears are idempotent. // // This method is primarily intented for MVCC garbage collection and similar
nit: s/intented/intended/
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Is that still the case now that we make sure Pebble doesn't split user keys across SSTs? I though that was done so that we were guaranteed to get all suffixes for the range key in a single call?
@jbowens Can you confirm?
That change referred to 'user keys' from the perspective of Pebble, which are full versioned MVCC keys from the perspective of Cockroach (There can be multiple only if a Pebble snapshot prevented Pebble from dropping earlier keys).
Pebble still allows a@2
and a@1
to be written to separate sstables, which allows versions to be surfaced within range key fragment bounds.
It's a little unfortunate these internal bounds are surfaced, because RangeKeySet
, RangeKeyUnset
and RangeKeyDelete
all require their bounds to be unsuffixed. If the user can only set, unset and delete range keys at prefixes, then a physical range key bound [a@2, b@3)
does imply that the logical range key state must be unchanged in the range [a, b]
(note the inclusive b
end boundary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
That change referred to 'user keys' from the perspective of Pebble, which are full versioned MVCC keys from the perspective of Cockroach (There can be multiple only if a Pebble snapshot prevented Pebble from dropping earlier keys).
Pebble still allows
a@2
anda@1
to be written to separate sstables, which allows versions to be surfaced within range key fragment bounds.It's a little unfortunate these internal bounds are surfaced, because
RangeKeySet
,RangeKeyUnset
andRangeKeyDelete
all require their bounds to be unsuffixed. If the user can only set, unset and delete range keys at prefixes, then a physical range key bound[a@2, b@3)
does imply that the logical range key state must be unchanged in the range[a, b]
(note the inclusiveb
end boundary).
@sumeerbhola's earlier iterations of the range keys API required the user provide an ImmediateSucessor(k)
function that returns a key k2
such that there are no valid keys greater than k
and less than k2
. I'm forgetting all the details that earlier proposal, but ImmediateSuccessor
allows for representing an inclusive end boundary like the above [a,b]
as [a, IS(b))
.
Could we allow (*storage.pebbleIterator)
to broaden the range key bounds returned by Pebble using its own implementation of an ImmediateSuccessor
function for roachpb.Key
s? It makes me uncomfortable to broaden the bounds like that, but it sure would be convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/engine.go, line 628 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I have a mild preference to specializing this for MVCC-range-tombstones (or whatever we are calling them so as to not confuse with the various other similar terms). Then one wouldn't be passing a nil value.
Yeah, I've been going back and forth on this. But we may have another use-case coming up for 22.2 (ranged intents for IMPORT INTO), so I figured we might as well generalize this. Not a strongly held opinion though, but the functional delta between tombstones or general keys is tiny (value or not).
Btw, does this also mean that we are not writing the original key range as a value, or that's just an implementation detail not exposed here.
Yes, I realized that it didn't make sense, e.g. due to CRDB range splits/merges. For example, if we write [a-d), then split across three ranges [a-b), [b-c), [c-d), the middle bit can get GCed before the others. Similarly, two range tombstones that abut will merge into one. We'd have to rewrite all of the fragment values in these cases, but it just seems much better/simpler to not give them an identity at all. It should be considered a "tombstone continuum" that can be sliced and diced at will.
This also has the nice property of mirroring the point key API, where
nil
is tombstone and values are arbitrary byte slices.
Agreed that using nil makes sense.
pkg/storage/mvcc_range_tombstone_iterator.go, line 60 at r3 (raw file):
// tombstones as soon as possible. Otherwise, a single tombstone across the the // entire key span would require all other tombstones at other timestamps to be // buffered in memory before they could be emitted. However, see the Fragmented
Perhaps add that this optimization means only the tombstones that are currently incomplete need to be buffered, while if we ordered by start key we would additionally need to buffer all complete tombstones that started after currently incomplete ones.
pkg/storage/mvcc_range_tombstone_iterator.go, line 61 at r3 (raw file):
// entire key span would require all other tombstones at other timestamps to be // buffered in memory before they could be emitted. However, see the Fragmented // option which emits non-deterministic fragments in StartKey,Timestamp order.
(I haven't read the implementation)
- What does "defragmentation" mean here. Does it mean that all abutting fragments (from a roachpb.Key perspective) and having the same timestamp will be combined.
- In this non-deterministic case, I assume we have the same issue we are discussing in the other thread?
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
@sumeerbhola's earlier iterations of the range keys API required the user provide an
ImmediateSucessor(k)
function that returns a keyk2
such that there are no valid keys greater thank
and less thank2
. I'm forgetting all the details that earlier proposal, butImmediateSuccessor
allows for representing an inclusive end boundary like the above[a,b]
as[a, IS(b))
.Could we allow
(*storage.pebbleIterator)
to broaden the range key bounds returned by Pebble using its own implementation of anImmediateSuccessor
function forroachpb.Key
s? It makes me uncomfortable to broaden the bounds like that, but it sure would be convenient.
I think we can broaden because we already have an ImmediateSuccessor function for roachpb.Key
, roachpb.Key.Next()
It amuses me that we simply shifted the problem without realizing it. Though it was probably the correct shift to make given that we only have to handle this when iterating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the feedback and prototyping I've done elsewhere, this approach seems preferable to #76090. I'm going to start wrapping this up while we discuss further.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/mvcc_range_tombstone_iterator.go, line 60 at r3 (raw file):
Previously, sumeerbhola wrote…
Perhaps add that this optimization means only the tombstones that are currently incomplete need to be buffered, while if we ordered by start key we would additionally need to buffer all complete tombstones that started after currently incomplete ones.
I agree, will clarify that.
pkg/storage/mvcc_range_tombstone_iterator.go, line 61 at r3 (raw file):
What does "defragmentation" mean here. Does it mean that all abutting fragments (from a roachpb.Key perspective) and having the same timestamp will be combined.
Yes. I'm likely going to generalize this into a range key defragmenter rather than a tombstone defragmenter, which will also check the value (so two abutting range keys at the same suffix with different values are considered different defragmented range keys).
In this non-deterministic case, I assume we have the same issue we are discussing in the other thread?
Yes. In that case, this iterator essentially reduces to the normal SimpleMVCCIterator
range key functionality. This option was carried over from the other draft PR where we this iterator was the only place we surfaced range keys, it should likely be removed before completing this PR.
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
Could we allow (*storage.pebbleIterator) to broaden the range key bounds returned by Pebble using its own implementation of an ImmediateSuccessor function for roachpb.Keys? It makes me uncomfortable to broaden the bounds like that, but it sure would be convenient.
So storage.pebbleIterator
will essentially check whether the range keys at the current position have a suffix, and if so defragment until we complete the start/end prefixes? Or is there some simpler option that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Could we allow (*storage.pebbleIterator) to broaden the range key bounds returned by Pebble using its own implementation of an ImmediateSuccessor function for roachpb.Keys? It makes me uncomfortable to broaden the bounds like that, but it sure would be convenient.
So
storage.pebbleIterator
will essentially check whether the range keys at the current position have a suffix, and if so defragment until we complete the start/end prefixes? Or is there some simpler option that I'm missing?
To clarify: can this suffix "leakage" occur only at the iterator bounds, or anywhere within the iteration span? My understanding was that it could occur anywhere within the span, but from what you're saying it sounds like it's only at the iterator bounds so setting UpperBound:key.Next() essentially solves it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
Or is there some simpler option that I'm missing?
I'm suggesting something like:
start, end := p.iter.RangeBounds()
if hasTimestamp(start) {
start = stripTimestamp(start)
}
if hasTimestamp(end) {
end = stripTimestamp(end).Next()
}
The suffixes can appear anywhere with the iteration span. Pebble may fragment the range keys into these suffixed keys because of sstable fragmentation. But because we never allow committing range keys with suffixed keys, a range key with suffixed bounds necessarily must be a physical fragment of a logical range key that covers whole user keys. We don't know the full defragmented bounds, but we do know that the entirety of the bounds' prefixes must fall within the defragmented bounds.
One consequence of the above is that you can have overlapping RangeBounds returned by pebbleIterator
. eg, [b@3, c@2)
, [c@2, d)
is translated by pebbleIterator
into [b, c\0)
and [c, d)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Or is there some simpler option that I'm missing?
I'm suggesting something like:
start, end := p.iter.RangeBounds() if hasTimestamp(start) { start = stripTimestamp(start) } if hasTimestamp(end) { end = stripTimestamp(end).Next() }
The suffixes can appear anywhere with the iteration span. Pebble may fragment the range keys into these suffixed keys because of sstable fragmentation. But because we never allow committing range keys with suffixed keys, a range key with suffixed bounds necessarily must be a physical fragment of a logical range key that covers whole user keys. We don't know the full defragmented bounds, but we do know that the entirety of the bounds' prefixes must fall within the defragmented bounds.
One consequence of the above is that you can have overlapping RangeBounds returned by
pebbleIterator
. eg,[b@3, c@2)
,[c@2, d)
is translated bypebbleIterator
into[b, c\0)
and[c, d)
But in order to get the full set of suffixes and values for the current prefix I’ll have to keep iterating, so seems like we’ll have to do defragmentation here too then. A bit unfortunate, but it’s workable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
But in order to get the full set of suffixes and values for the current prefix I’ll have to keep iterating, so seems like we’ll have to do defragmentation here too then. A bit unfortunate, but it’s workable.
I don’t follow why we need to defragment? The full set of range keys for the prefix should be visible, even though the RangeBounds are narrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/pebble_iterator.go, line 587 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I don’t follow why we need to defragment? The full set of range keys for the prefix should be visible, even though the RangeBounds are narrow
Ah, of course. It's a bit unintuitive to map the 2D MVCC span to the 1D Pebble keyspan.
What would be a simple and deterministic way to construct this state for tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/engine.go, line 80 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yep, I'll be adding ample documentation on this once we have consensus that we'll be doing this PR rather than the alternative (#76090). I may defer that until after stability though, since that's backportable and we have very little time left to complete all of the MVCC bulk ops work.
Is RangeBounds a union of all RangeKeys() overlapping at current timestamp?
The tl;dr is: Range keys and point keys exist independently, but can be iterated over in parallel. Range keys are also fragmented, such that all overlapping range keys have identical start/end bounds. So let's say we're at point c@5 where there is also a stack of range key fragments [a-f) at timestamps 1,2,3. RangeBounds() returns [a-f), Key() return c@5, RangeKeys() return 1:foo,2:bar,3:baz (or for range tombstones, simply
nil
).
I think that's why it feels a bit off. For point keys we iterate mvcc values and next moves to next timestamp or nextkey moves to next key with highest timestamp, but rangekeys iterates over key "aggregates" exposed via RangeKeys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @sumeerbhola, and @ts)
pkg/storage/engine.go, line 80 at r3 (raw file):
Previously, aliher1911 (Oleg) wrote…
I think that's why it feels a bit off. For point keys we iterate mvcc values and next moves to next timestamp or nextkey moves to next key with highest timestamp, but rangekeys iterates over key "aggregates" exposed via RangeKeys?
Yes, they iterate over stacks of key fragments. The fragmentation is desirable at the storage layer, for example to be able to store range keys across multiple SSTs and across multiple CRDB ranges without incurring cross-SST/range read costs when accessing them. The stacking is desirable because it allows us to easily see all range keys that overlap with a given point, and also has to do with mapping the 2D MVCC keyspace onto the flat 1D Pebble keyspace and using range keys across these 1D points.
This fragmentation is one of the learnings from back when we implemented range tombstones in Pebble, you can read more about this in https://github.com/cockroachdb/pebble/blob/master/docs/rocksdb.md#range-deletions.
To iterate over defragmented range keys, which is more intuitive, use MVCCRangeTombstoneIterator
(soon to be renamed MVCCRangeKeyIterator
). Fragmented iteration is primarily useful for two cases: when iterating over point keys but wanting to see range keys that overlap the point, and when needing to emit partial results that can be resumed via a ResumeSpan
(e.g. for Export
requests). We will likely also add a point synthesizing iterator that will convert range keys that overlap points into point keys, which can then be iterated over as normal.
I'll write a few more comments about this. I'm going to flesh them out and write a tech note when I have time, but for now I need to hustle to get everything in before stability.
7e1c55e
to
942c848
Compare
+1 to that. And I was imagining that pebbleIterator would remember that it had output up to |
942c848
to
e4e462d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've cleaned up the PR, the only remaining bit is to handle these odd range bounds -- I'll add that once I get a test case set up.
I was imagining that pebbleIterator would remember that it had output up to c\0, so when it is on point key cc and has stripped the range span to [c,d), it will then truncate it to [c\0,d). Though perhaps no one upstream is relying on this behavior of the range spans being non-overlapping.
Yeah, that makes sense. I think that's a nice and intuitive guarantee to provide in an iterator.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @jbowens, @sumeerbhola, and @ts)
6b34e2a
to
1d961ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of specifically defragmenting the physical fragmentation introduced by sstable splits. At iteration time Pebble would defragment up to the next change in range key state
I think this would be great, as long as we can come up with a solution for exports/backups. I can't think of anything else that would need mid-key range bounds.
The seek/reversal truncation semantics on that PR deserve a bit more consideration. I don't think they should be a problem, and they're probably worth the performance/complexity win -- and I suppose if anything did come up we could likely handle it during MVCCRangeKeyIterator
defragmentation.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @nicktrav, @sumeerbhola, and @ts)
@jbowens @sumeerbhola I'd like to get this merged to unblock some work on other teams, mind giving this a review? For backups, it turns out that overlapping SST range keys is fine (on restore we build new SSTs for ingestion), so I'm going to try the simple thing of mixing range keys and point keys during exports. That shouldn't really affect anything on the Pebble side either way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 20 files at r1, 1 of 8 files at r2, 2 of 12 files at r4, 3 of 10 files at r5, 1 of 1 files at r10, 3 of 4 files at r13, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @nicktrav, @sumeerbhola, and @ts)
pkg/storage/engine.go, line 116 at r13 (raw file):
// internal SST structure (which changes with compactions) and the store's // write history. They will also split and merge along with CRDB ranges, can // be partially removed by GC, and may be truncated by iterator bounds.
nit: leave a TODO to update this comment and the RangeBounds()
comment once determinism is merged, and we're guaranteed to have deterministic non-overlapping bounds?
pkg/storage/engine.go, line 615 at r13 (raw file):
// // It is safe to modify the contents of the arguments after it returns. ClearRawRange(start, end roachpb.Key) error
Where will RangeKeyDelete(a, b)
come into play? Is it in ClearRawRange
? AFAIK, we only need RangeKeyDelete
for replica removal, at which point we want to lay down both a Pebble RANGEDEL
and a RANGEKEYDELETE
.
pkg/storage/engine.go, line 634 at r13 (raw file):
// // Note that when used on batches, subsequent reads may not reflect the result // of the ClearMVCCRange.
I think this came up somewhere else recently, but I believe this comment is outdated. Pebble range tombstones written to an indexed batch will hide covered points.
pkg/storage/engine.go, line 637 at r13 (raw file):
// // It is safe to modify the contents of the arguments after it returns. ClearMVCCRange(start, end MVCCKey) error
Should ClearMVCCRange
's comment be clarified that it exclusively removes point keys, and does not remove MVCC range keys?
pkg/storage/mvcc.go, line 2235 at r13 (raw file):
} return rw.ExperimentalPutMVCCRangeKey(MVCCRangeKey{ StartKey: startKey, EndKey: endKey, Timestamp: timestamp}, nil)
I'm mildly bothered by the fact that these MVCCDeleteRange
range keys have nothing recording the fact that they are a MVCCDeleteRange
and not some future kind of range key. Here, and within range-key masking in Pebble, we're assuming that we'll never have another use case for writing a new kind of range key to the user key space with a suffix.
I suppose if we ever had another use case for range keys within the user key space, we could require it have a non-empty value and adjust Pebble to only mask value-less range keys.
pkg/storage/mvcc_key.go, line 354 at r13 (raw file):
return errors.Errorf("no timestamp") } if k.StartKey.Compare(k.EndKey) > 0 {
Shouldn't this be >= 0
since the EndKey
is exclusive?
pkg/storage/pebble_iterator.go, line 339 at r13 (raw file):
if ok := p.iter.Valid(); ok { // TODO(erikgrinaker): Pebble does not seem to respect bounds for range // keys, so for now we invalidate the iterator here instead.
This sounds like a Pebble bug. Is this definitely still the case? I can look into it.
pkg/storage/mvcc_range_key_iterator.go, line 29 at r13 (raw file):
// this would be pointless: giving e.g. a@4 would skip a range key starting at // a@5, but the range key would logically exist at the adjacent a.Next()@5 so // it would be emitted almost immediately anyway.
Should Pebble require iterator bounds be prefixes if IterKeyType
is IterKeyTypeRangesOnly
or IterKeyTypePointsAndRanges
? Otherwise, the truncation to bounds could surface MVCC keys in RangeBounds()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @aliher1911 and @sumeerbhola from 3 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @jbowens, @nicktrav, @sumeerbhola, and @ts)
pkg/storage/engine.go, line 116 at r13 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: leave a TODO to update this comment and the
RangeBounds()
comment once determinism is merged, and we're guaranteed to have deterministic non-overlapping bounds?
Will do, thanks.
pkg/storage/engine.go, line 615 at r13 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Where will
RangeKeyDelete(a, b)
come into play? Is it inClearRawRange
? AFAIK, we only needRangeKeyDelete
for replica removal, at which point we want to lay down both a PebbleRANGEDEL
and aRANGEKEYDELETE
.
Thanks for bringing this up. Replica removal sounds right. I haven't thought that far yet, but we'll possibly add a separate method for it. I don't know if we can use ClearRawRange
since that can also be used on timestamped keys. But I'm going to look into range splits/merges and such shortly, and will keep this in mind.
pkg/storage/engine.go, line 634 at r13 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I think this came up somewhere else recently, but I believe this comment is outdated. Pebble range tombstones written to an indexed batch will hide covered points.
Yeah, came out of a recent postmortem. I'll submit a PR to clean it up, thanks.
pkg/storage/engine.go, line 637 at r13 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Should
ClearMVCCRange
's comment be clarified that it exclusively removes point keys, and does not remove MVCC range keys?
Probably, once we clarify the semantics. The biggest part of implementing MVCC range keys will be to review all of the existing MVCC/storage/KV APIs and make them handle range keys in a reasonable way. For now, none of the APIs handle them, so it's kind of meaningless to point it out (we'd have to point it out on everything). This will likely happen during the stability period, and possibly into the 22.1 release.
pkg/storage/mvcc.go, line 2235 at r13 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I'm mildly bothered by the fact that these
MVCCDeleteRange
range keys have nothing recording the fact that they are aMVCCDeleteRange
and not some future kind of range key. Here, and within range-key masking in Pebble, we're assuming that we'll never have another use case for writing a new kind of range key to the user key space with a suffix.I suppose if we ever had another use case for range keys within the user key space, we could require it have a non-empty value and adjust Pebble to only mask value-less range keys.
Yeah. But we'd likely want the caller to control that anyway, e.g. by registering a masking callback or value filter or something, I think?
pkg/storage/mvcc_key.go, line 354 at r13 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Shouldn't this be
>= 0
since theEndKey
is exclusive?
We allow zero-length ranges elsewhere in our APIs, e.g. it's valid to request a scan over the range [a-a) which will be empty. So I've allowed this for range keys too, for consistency. Of course, it'll essentially be a noop.
pkg/storage/pebble_iterator.go, line 339 at r13 (raw file):
Previously, jbowens (Jackson Owens) wrote…
This sounds like a Pebble bug. Is this definitely still the case? I can look into it.
Yes, I believe so. Try commenting it out and run TestMVCCRangeKeyIterator
, and you'll see lots of test cases go into infinite loops. When I looked into it a while ago, it seemed like calling .Next()
when positioned at the last range key fragment of a bounded iterator would place the iterator back at the last fragment again, leading to an infinite loop. But I didn't look at it too closely.
pkg/storage/mvcc_range_key_iterator.go, line 29 at r13 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Should Pebble require iterator bounds be prefixes if
IterKeyType
isIterKeyTypeRangesOnly
orIterKeyTypePointsAndRanges
? Otherwise, the truncation to bounds could surface MVCC keys inRangeBounds()
.
I'm a bit worried about backwards compatibility -- we'll enable IterKeyTypePointsAndRanges
by default on lots of existing iterators on the read path, and I don't want to have to audit them all to make sure they don't use versioned bounds. Maybe we never do, but I'll have to look into it. Can we punt this decision for later? I'll keep it in mind and add some guards/handling for this in the MVCC iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backups, it turns out that overlapping SST range keys is fine (on restore we build new SSTs for ingestion)
Can you point me to the relevant CockroachDB code. I am curious how we are merging these ssts now (I vaguely remember seeing an sst merging iterator, but I can't find it -- I might have imagined it).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @jbowens, @nicktrav, @sumeerbhola, and @ts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind, I found multiIterator
. I think we'll need to change that code since we don't want to repeat the fragmenting logic in the CockroachDB code. I think we should expose a Pebble provided "merging iterator" that can be used to merge the various ssts. And then wrap the MVCC code on top of it. We should probably standardize on InternalIterator
as the one unifying raw read interface, over which such merging operates, instead of things like sstable.Iterator
.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @jbowens, @nicktrav, @sumeerbhola, and @ts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense -- would be great to avoid duplicating that in CRDB. But for restore, we'd want it to be tolerant of key collisions, which in the range key case would mean merging the range keys (as long as they have identical values). We'd also ideally want to do range key masking here, because restore only cares about the latest visible point key, which I think we might get for free by using InternalIterator
?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @jbowens, @nicktrav, @sumeerbhola, and @ts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @jbowens, @nicktrav, @sumeerbhola, and @ts)
pkg/storage/engine.go, line 78 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I suppose yes. But we have less than 3 weeks to get all of the MVCC range tombstone functionality in and migrate all callers, so I won't be spending time on any related cleanups now. We can do that for 22.2.
Can you add a TODO to fix this later. I worry about technical debt in interfaces (not that you are responsible for what we have already accrued).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @jbowens, @nicktrav, @sumeerbhola, and @ts)
pkg/storage/engine.go, line 78 at r3 (raw file):
Previously, sumeerbhola wrote…
Can you add a TODO to fix this later. I worry about technical debt in interfaces (not that you are responsible for what we have already accrued).
Yes, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should expose a Pebble provided "merging iterator" that can be used to merge the various ssts.
This one 'merging iterator' would avoid the need for exposing an iter off the sstable.Reader
that wraps using the interleaving iterator, right? cockroachdb/pebble#1513 (review)
Do we want to do this right now? I'm unclear on how much of a lift it'll be to adapt the Pebble point mergingIter
for this use.
Should this interface take vfs.File
s (so that we can construct point and range key iterators as necessary), or we could explicitly take separate sets of point iterators and range key iterators (and I'm assuming no rangedel iterators.).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @nicktrav, @sumeerbhola, and @ts)
pkg/storage/engine.go, line 637 at r13 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Probably, once we clarify the semantics. The biggest part of implementing MVCC range keys will be to review all of the existing MVCC/storage/KV APIs and make them handle range keys in a reasonable way. For now, none of the APIs handle them, so it's kind of meaningless to point it out (we'd have to point it out on everything). This will likely happen during the stability period, and possibly into the 22.1 release.
Sounds good
pkg/storage/mvcc.go, line 2235 at r13 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yeah. But we'd likely want the caller to control that anyway, e.g. by registering a masking callback or value filter or something, I think?
Yeah, for sure. I guess empty-value mapping to delete makes sense and mirrors point deletes. Seems fine.
pkg/storage/mvcc_key.go, line 354 at r13 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
We allow zero-length ranges elsewhere in our APIs, e.g. it's valid to request a scan over the range [a-a) which will be empty. So I've allowed this for range keys too, for consistency. Of course, it'll essentially be a noop.
Ok, I'll need to double check Pebble's behavior here. We might not tolerate them well at the moment.
pkg/storage/pebble_iterator.go, line 339 at r13 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yes, I believe so. Try commenting it out and run
TestMVCCRangeKeyIterator
, and you'll see lots of test cases go into infinite loops. When I looked into it a while ago, it seemed like calling.Next()
when positioned at the last range key fragment of a bounded iterator would place the iterator back at the last fragment again, leading to an infinite loop. But I didn't look at it too closely.
Got it, I'll look into it.
pkg/storage/mvcc_range_key_iterator.go, line 29 at r13 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I'm a bit worried about backwards compatibility -- we'll enable
IterKeyTypePointsAndRanges
by default on lots of existing iterators on the read path, and I don't want to have to audit them all to make sure they don't use versioned bounds. Maybe we never do, but I'll have to look into it. Can we punt this decision for later? I'll keep it in mind and add some guards/handling for this in the MVCC iterator.
Yeah, let's punt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @nicktrav, @sumeerbhola, and @ts)
pkg/storage/mvcc_key.go, line 354 at r13 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Ok, I'll need to double check Pebble's behavior here. We might not tolerate them well at the moment.
I'd be fine with checking this in MVCC too and dropping the Pebble call.
1d961ad
to
d1d2ea9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it you were asking Sumeer here, but I'll throw in my 2c.
This one 'merging iterator' would avoid the need for exposing an iter off the sstable.Reader that wraps using the interleaving iterator, right? cockroachdb/pebble#1513 (review)
Yes. If this merging iterator allows us to use an arbitrary number of SSTs (including 1) with mixed point and range keys and interleaved iteration then that covers our needs. Masking would be a nice-to-have that we can punt for now.
Do we want to do this right now? I'm unclear on how much of a lift it'll be to adapt the Pebble point mergingIter for this use.
We're going to want interleaved point/range key iteration in SSTs for backup/restore, and we'd like to get the restore code updated and merged before stability if possible (i.e. during next week).
We can either do the SST merging in CRDB or in Pebble, but we're going to need it somewhere. If we do the merging in CRDB then we'll have to reimplement the interleaving logic on top of the merging logic, so it seems advantageous to do it in Pebble where we can hopefully reuse some of the existing logic.
Should this interface take vfs.Files (so that we can construct point and range key iterators as necessary), or we could explicitly take separate sets of point iterators and range key iterators (and I'm assuming no rangedel iterators.).
I don't really have a strong or informed opinion here, but I think we're usually holding binary SST data so vfs.File
seems reasonable. Would there be an advantage to taking iterators?
Dismissed @aliher1911 and @sumeerbhola from 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @50, @aliher1911, @erikgrinaker, @jbowens, @nicktrav, @sumeerbhola, and @ts)
372721a
to
5cd3afc
Compare
This patch adds initial experimental primitives for MVCC range tombstones and the range keys they build on, based on experimental Pebble range keys, * Data structures: * `MVCCRangeKey` * `MVCCRangeKeyValue` * `nil` value for range tombstones (as with point tombstones) * Engine support for reading, writing, and clearing range keys: * `Engine.ExperimentalClearMVCCRangeKey()` * `Engine.ExperimentalPutMVCCRangeKey()` * `SimpleMVCCIterator.HasPointAndRange()` * `SimpleMVCCIterator.RangeBounds()` * `SimpleMVCCIterator.RangeKeys()` * `MVCCRangeKeyIterator` * MVCC function for writing range tombstones: * `ExperimentalMVCCDeleteRangeUsingTombstone()` Range tombstones do not have a distinct identity, and should instead be considered a tombstone continuum: they will merge with abutting tombstones, can be partially cleared, can split or merge along with ranges, and so on. Bounded scans will truncate them to the scan bounds. The generalized range keys that range tombstones build on are also exposed via the `Engine` API. This is primarily for internal MVCC use. Exposing this in terms of range key/value pairs rather than range tombstones allows for additional use-cases such as ranged intents. Range tombstones are not yet handled in the rest of the MVCC or KV API, nor are they persisted to disk. Subsequent pull requests will extend their functionality and integrate them with other components. Release note: None
5cd3afc
to
9f3ef09
Compare
bors r=jbowens,sumeerbhola,nicktrav TFTRs! |
Build succeeded: |
76921: storage: revert experimental MVCC range tombstones r=aliher1911 a=erikgrinaker This reverts most of #76131, #76203, and #76478 -- except minor changes that were unrelated to the range tombstones themselves. This leaves a gap for cluster version `Internal:78` -- I think that's probably fine, but I've left a comment. Co-authored-by: Erik Grinaker <[email protected]>
This is an alternative to #76090. Rather than hiding Pebble range keys inside the MVCC API, this exposes them as a first-class primitive for internal use. This allows callers to take advantage of Pebble range key functionality such as combined point/range key iteration and range key masking, which is useful e.g. for
IncrementalMVCCIterator
andSSTIterator
(used by backups and rangefeed catchup scans). It will also allow the upcoming point synthesizing iterator to just be anotherMVCCIterator
implementation that can be plugged in as needed, instead of sitting belowpebbleIterator
.For consistency with point keys, this also stores range tombstones as a range key with a
nil
value instead of a mostly-useless Protobuf that only adds overhead.This patch adds initial experimental primitives for MVCC range
tombstones and the range keys they build on, based on experimental
Pebble range keys,
Data structures:
MVCCRangeKey
MVCCRangeKeyValue
nil
value for range tombstones (as with point tombstones)Engine support for reading, writing, and clearing range keys:
Engine.ExperimentalClearMVCCRangeKey()
Engine.ExperimentalPutMVCCRangeKey()
SimpleMVCCIterator.HasPointAndRange()
SimpleMVCCIterator.RangeBounds()
SimpleMVCCIterator.RangeKeys()
MVCCRangeKeyIterator
MVCC function for writing a range tombstones:
ExperimentalMVCCDeleteRangeUsingTombstone()
Range tombstones do not have a distinct identity, and should instead be
considered a tombstone continuum: they will merge with abutting
tombstones, can be partially cleared, can split or merge along with
ranges, and so on. Bounded scans will truncate them to the scan bounds.
The generalized range keys that range tombstones build on are also
exposed via the
Engine
API. This is primarily for internal MVCC use.Exposing this in terms of range key/value pairs rather than range
tombstones allows for additional use-cases such as ranged intents.
Range tombstones are not yet handled in the rest of the MVCC or KV API,
nor are they persisted to disk. Subsequent pull requests will extend
their functionality and integrate them with other components.
Touches #70412.
Release note: None