-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: draft design sketch for ranged keys #1341
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dumped the current text from the issue here so we can collaborate on the design.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @erikgrinaker and @jbowens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like pushing this complexity into Pebble rather than trying to do it above Pebble.
Reviewable status: 0 of 1 files reviewed, 6 unresolved discussions (waiting on @erikgrinaker, @jbowens, and @sumeerbhola)
docs/rfcs/20211018_range_keys.md, line 38 at r1 (raw file):
There are two new operations: - `Set([k1, k2), [optional suffix], <value>)`: This represents the
We should come up with the real name for this operation at some point. Set
and Del
imply point operations to me. RangeSet
and RangeDel
would make sense, except RangeDel
will likely get confused with DeleteRange
. SpanSet
and SpanDel
?
docs/rfcs/20211018_range_keys.md, line 39 at r1 (raw file):
- `Set([k1, k2), [optional suffix], <value>)`: This represents the mapping `[k1, k2)@suffix => value`.
Should the optional suffix be explicitly supplied, or can we require that the suffixes returned by Split
on k1
and k2
are equal?
docs/rfcs/20211018_range_keys.md, line 86 at r1 (raw file):
iterators used for compaction writes. - Point deletes only apply to points.
I think you can incorporate this bullet into the one above about "point keys and range keys do not interact". I'd find it clarifying to be explicit that Del
operations only apply to range keys.
docs/rfcs/20211018_range_keys.md, line 91 at r1 (raw file):
can remove part of a range key, just like the new Del we introduced earlier. Note that these two delete operations are different since the Del requires that the suffix match. A RANGEDEL on the other hand
What are the semantics of a RANGEDEL
applying to a part of a range key? Consider a RANGEDEL [a@30, d@90)
and a range-key [c,e)@70
. Does the range-key get masked to become [d,e)@70
?
Do range-key Del
operations mask point operations? I think not based on other bullets, but that is another difference between the two operations. Is it problematic for CRDB if Del
operations mask point operations? I ask because I find it a bit unfortunate that we're going to have both RANGEDEL
and range-key Del
. That feels like it will cause confusion. Perhaps it is necessary (I haven't thought through this as much as you have).
docs/rfcs/20211018_range_keys.md, line 153 at r1 (raw file):
Key() []byte PointValue() []byte RangeEndKey) []byte
Nit: s/RangeEndKey)/RangeEndKey()/g
docs/rfcs/20211018_range_keys.md, line 406 at r1 (raw file):
cleanly applied to range keys since it is not limited to a subset of versions. However, currently one can also construct RANGEDELs like [b@30,d@90), which are acceptable for bulk deletion of points, but
Do we ever construct RANGEDELs in CRDB like [b@30,d@90)
? My recollection is that we're always passing in an empty MVCC timestamp for DeleteRange
operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @erikgrinaker and @sumeerbhola)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
[Continuing our discussion from #1339]
There are difficulties here. For example [b@30,c). How does it represent that bb@49 is masked
Good point. I do think we want the sstable and its ranges to be self-describing, even if we think we could work around it. It seems subtle and error prone otherwise. I think your diagram from the range deletions document is a useful visualization here if we redefine the y-axis to be timestamp:
^
| . > . > . > . > .↥
| . > . > . > . > .
| . > . > . > . > .
@50 | V > x↧ > . > . > .
@40 | . > x. > x. > y. > .
@30 | . > x. > x. > y. > .
@20 | . > x. > y. > y. > .
@10 | .> x.> y.> y.> .
-------------------------------------------------------->
k IS(k) IS(IS(k)) IS(IS(IS(k)))
k k1 k2 k3
Say we want to split this range into two sstables, x and y, at such that k1@20
is the first key in the second sstable. We need to be able to encode that a range spans whole user keys, and also a that a range acts only within the timestamps of a key.
We could define [a@t1,b@t2)
as covering the keys in [a,b)@t1 + b@[t1,t2)
. In the above diagram, sstable x would have a range [k@50, k1@20)
. The sstable y needs to encode the rest of the key k1
, and could do that as [k1@20,ImmediateSuccessor(k1))
. Then, the remainder of the whole user keys covered could be represented as [ImmediateSuccessor(k1)@50,k3)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @erikgrinaker, @petermattis, and @sumeerbhola)
docs/rfcs/20211018_range_keys.md, line 91 at r1 (raw file):
What are the semantics of a RANGEDEL applying to a part of a range key? Consider a RANGEDEL [a@30, d@90) and a range-key [c,e)@70. Does the range-key get masked to become [d,e)@70?
Good question. This is why the later text introduces RANGEDEL_PREFIX that can only be of the form [a, d).
Do range-key Del operations mask point operations? I think not based on other bullets, but that is another difference between the two operations. Is it problematic for CRDB if Del operations mask point operations?
I realized I've used masking in a different sense below for MVCC version masking. So here let me use the work overwrite. I'm actually not sure whether having range keys not overwrite point keys and vice versa is the right thing to do. And now that we need ImmediateSuccessor should there really be a difference between b and [b,ImmediateSuccessor(b))? The ImmediateSuccessor also makes it viable to construct user keys to accomplish such overwrites. One practical issue is that point keys are deterministic -- if we blur the difference between point and range keys, code that is trying to anchor itself at deterministic places (like the metamorphic test) may no longer be able to do so. We could still preserve the difference from an iterator and storage perspective while doing the overwrites. In that world a range-key Del operation would also apply to point keys.
docs/rfcs/20211018_range_keys.md, line 406 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Do we ever construct RANGEDELs in CRDB like
[b@30,d@90)
? My recollection is that we're always passing in an empty MVCC timestamp forDeleteRange
operations.
We do, in https://github.com/cockroachdb/cockroach/blob/c8c186f2d27f24c489aba5f6a38b57b48f628148/pkg/storage/mvcc.go#L2171-L2213
That is the only place (its easy to tell now since the Writer
makes the caller distinguish between ClearMVCCRangeAndIntents(start, end roachpb.Key)
and ClearMVCCRange(start, end MVCCKey)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, and @sumeerbhola)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
We could define [a@t1,b@t2) as covering the keys in [a,b)@t1 + b@[t1,t2). In the above diagram, sstable x would have a range [k@50, k1@20). The sstable y needs to encode the rest of the key k1, and could do that as [k1@20,ImmediateSuccessor(k1)). Then, the remainder of the whole user keys covered could be represented as [ImmediateSuccessor(k1)@50,k3).
In this example are point keys for k1 spanning the two sstables?
Another alternative:
- [k,k1)@50 in first sstable. Does not apply to any k1 prefix.
- [k1, ImmediateSuccessor(k1))@50 in first sstable, because we want it to version-mask point key k1@49 in the first sstable
- First point key in the second sstable is k@40. So include [k1, ImmediateSuccessor(k1))@41 in the second sstable.
This semantics of a range key are simpler in this case. But we can't in the future merge [k, ImmediateSuccessor(k1))@50, [k1, ImmediateSuccessor(k1))@41, unlike your idea.
What bothers me about both options is that we are starting to construct versions that were not in the original write, and I think that it will increase complexity.
I'm still wondering whether we can get away with [k1, ImmediateSuccessor(k1))@50 in the first sstable being able to version-mask the k@40 in the second sstable. Given that range keys do not follow the LSM invariant, I suspect we'll need to keep internal state in the iterator about ranges from sstables that are no longer loaded. I need to think more about how the iterator would function ignoring any issues wrt improperly split range keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, and @sumeerbhola)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
In this example are point keys for k1 spanning the two sstables?
Yeah.
Given that range keys do not follow the LSM invariant
This might be a dumb question, but why do they not follow the LSM invariant with respect to the keys they cover? Can we do more or less exactly what we do with range deletions?
We have a versioned keyspace like: a
, a@t10
, a@t8
, a@t3
, b
, b@t11
, b@t12
. We could choose to define Pebble range keys as occupying the one-dimensional keyspace just like Pebble range deletions. A range key [a@t10, b@t11)
covers all keys k
s.t. a@t10 ≤ k < b@t11
in the one-dimensional space. It doesn't mask all of them. Those semantics are up to the iterator consumer.
Keys with higher sequence numbers but the same prefix may descend beneath other keys with lower sequence numbers and the same prefix. So keys with the same prefix may be jumbled up and down the LSM, regardless of sequence order. However since range keys cover that same one-dimensional space, the LSM invariant can be maintained with the respect to the relationship between the range key and the point keys it covers. Range keys are fragmented across sstables and contribute to sstable boundaries like range deletions do. [It seems like we could also do the same "effective-within" logical truncation that we do with range deletions to avoid constructing synthetic keys using ImmediateSuccessor
?]
With this structure the mergingIter
would keep one sstable's range key block open at every level of the LSM just like it does with range deletions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, and @v)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
but why do they not follow the LSM invariant with respect to the keys they cover? Can we do more or less exactly what we do with range deletions?
I misspoke in my previous comment. I had meant to say "don't follow the LSM invariant for prefixes". They do follow the LSM invariant for the user key since [k1, k2)@v#s where k1 != k2 contributes k1@v#s to the sstable start bound and k2#inf to the sstable end bound.
I thought a bit about the mergingIter case and I think what is written in the current text of the doc suffices. Consider the two cases (lets ignore equal versions for point and range keys for now -- we need to clean up the spec in that area):
- [k1,k2)@v#s where k1 != k2: this guarantees that there is no other sstable on the same level that has (a) point key k1@v2 and v2 < v, (b) point keys with prefixes in [k1, k2) and prefix > k1. So once the masking iterator has "reached" k1@v it can stay at this sstable (for this level) until the heap for points reaches k2. Which means this sstable stays loaded (at this level) while we need to use [k1,k2)@v#s to compute the masking behavior.
- [k1,IS(k1))@v#s: This behaves like a point k1@v#s. There can be a point at this level with k1@v2 and v2 < v, which is in the next sstable. But before we transition to that next sstable, the merging iterator only needs to remember that it has emitted something for prefix k1 (which is the last prefix it has emitted), and so it should skip past all other k1 prefixed keys. It doesn't even need to remember @v since the heap ordering will ensure that any k1 it encounters in the future will have version < v.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, and @v)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
Previously, sumeerbhola wrote…
but why do they not follow the LSM invariant with respect to the keys they cover? Can we do more or less exactly what we do with range deletions?
I misspoke in my previous comment. I had meant to say "don't follow the LSM invariant for prefixes". They do follow the LSM invariant for the user key since [k1, k2)@v#s where k1 != k2 contributes k1@v#s to the sstable start bound and k2#inf to the sstable end bound.
I thought a bit about the mergingIter case and I think what is written in the current text of the doc suffices. Consider the two cases (lets ignore equal versions for point and range keys for now -- we need to clean up the spec in that area):
- [k1,k2)@v#s where k1 != k2: this guarantees that there is no other sstable on the same level that has (a) point key k1@v2 and v2 < v, (b) point keys with prefixes in [k1, k2) and prefix > k1. So once the masking iterator has "reached" k1@v it can stay at this sstable (for this level) until the heap for points reaches k2. Which means this sstable stays loaded (at this level) while we need to use [k1,k2)@v#s to compute the masking behavior.
- [k1,IS(k1))@v#s: This behaves like a point k1@v#s. There can be a point at this level with k1@v2 and v2 < v, which is in the next sstable. But before we transition to that next sstable, the merging iterator only needs to remember that it has emitted something for prefix k1 (which is the last prefix it has emitted), and so it should skip past all other k1 prefixed keys. It doesn't even need to remember @v since the heap ordering will ensure that any k1 it encounters in the future will have version < v.
That makes sense, but isn't it still more complicated than fragmenting these ranges like we do range deletions, such that each sstable is self-describing?
I might be overlooking something, but it seems like fragmenting these ranges within the one-dimensional space and ignoring prefixes would also allow us to not require ImmediateSuccessor
implementations. Reusing the same concepts as range tombstones also reduces complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When discussing range tombstones for CRDB, we noted that a key span which sees frequent range deletes would keep accumulating fragments over time: any new range key that spans an existing range key has to write the same amount of fragments or more. Without an opposing mechanism that opportunistically merges fragments (e.g. during compaction/GC), the amount of fragments could build up until it becomes pathological. I could not see any mention of this in the document; should we consider merging fragments during compaction where possible to avoid this?
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, and @v)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, and @v)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
We probably need a more synchronous discussion.
That makes sense, but isn't it still more complicated than fragmenting these ranges like we do range deletions, such that each sstable is self-describing?
I'm not sure what that would really be like.
RANGEDELs are "easier" since they are not made visible to the user via the Iterator
interface, so the improper splitting is hidden. One can't also delete parts of a RANGEDEL with another user operation, like we can do with these range keys.
Which is what makes me happiest with designs that only store and expose range keys of the form [k1,k2)@v, and have the sstable bounds be self-describing as it pertains to these range keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we consider merging fragments during compaction where possible to avoid this?
good point -- that was an omission that should be easy to rectify.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, and @v)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, and @v)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
Previously, sumeerbhola wrote…
We probably need a more synchronous discussion.
That makes sense, but isn't it still more complicated than fragmenting these ranges like we do range deletions, such that each sstable is self-describing?
I'm not sure what that would really be like.
RANGEDELs are "easier" since they are not made visible to the user via theIterator
interface, so the improper splitting is hidden. One can't also delete parts of a RANGEDEL with another user operation, like we can do with these range keys.
Which is what makes me happiest with designs that only store and expose range keys of the form [k1,k2)@v, and have the sstable bounds be self-describing as it pertains to these range keys.
I suspect my thinking is still anchored to some of the earlier thinking, before we went down the path of interleaving range keys in the iterator output.
I was originally thinking of range keys as "coloring" a span of the 1-dimensional keyspace. You can use range keys to say all point keys in the span [a,c)
have color blue. All point keys in the span [b,d)
have color green. Or all keys in the span [d@t11, d@t3)
have color orange. During iteration, the iterator surfaces a point key and all covering 'colors' applied through range keys. The user above Pebble is responsible for applying whatever semantics they attribute to the colors. These range keys are conceptually similar to range tombstones, since their spans are only defined over the 1-dimensional key space only and know nothing of key prefixes. Unlike range tombstones, they don't overwrite each other. They stack.
In the case of MVCC delete range, the timestamp of a delete range is stored outside of the span (eg, in the value, or the "param" in your earlier iteration). You might have a range key that was fragmented into the span [d@t11, d@t6)
but separately encodes a timestamp <@t4
that the user-level semantics interpret as irrelevant.
The Pebble iterator needs to expose every covering color to the user, because it knows nothing of the semantics of a color. To avoid this, I think you could still add Pebble knowledge of these separate suffixes associated a range key so that Pebble only surfaces 1 range key for a suffix > then the current point key's suffix. I think the user would need to provide a comparator for suffixes to be able to do this.
These 1-dimensional range keys are nice because they act very similar to RANGEDELs and can easily be split to accommodate compactions splitting prefixes across multiple sstables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
(I need to think more about what you are proposing, but some preliminary questions below)
Going back to your earlier example:
We could define [a@t1,b@t2) as covering the keys in [a,b)@t1 + b@[t1,t2). In the above diagram, sstable x would have a range [k@50, k1@20). The sstable y needs to encode the rest of the key k1, and could do that as [k1@20,ImmediateSuccessor(k1)). Then, the remainder of the whole user keys covered could be represented as [ImmediateSuccessor(k1)@50,k3).
- IIUC, when the user originally writes [k1,k2)@t, in this model it would actually be [k1@t,k2) with value <@t. Is that correct?
- One of the reasons to not expose all the colors is that point keys are no longer coincident with range keys when we are dealing with MVCC keys. For example, with [a,c)@50, when we expose point a@60 through the iterator, the range key does not overlap, but when we expose b@60, there is an overlap based on just the key range (if we use the 1-dimensional keyspace). Semantically, it seems dubious to have this difference since both a@60 and b@60 are not masked, and the iterator should not behave differently for the two (whether in masked mode or not).
- Hence the non-masked mode has the repetition semantics in order to be consistent, and output a@60, [a,c)@50, b@60, [b,c)@50. When exactly this repetition occurs needs some cleanup I think -- the text currently says only if there is a lower versioned point that succeeds. It may be that we want to repeat if there is any point at the prefix, regardless of what version it has.
- I am not sure we need the generality (for the user) implied by "In the case of MVCC delete range, the timestamp of a delete range is stored outside of the span" and "needs to expose every covering color to the user, because it knows nothing of the semantics of a color". Pebble has documented split as being for MVCC, and Pebble needs some semantic knowledge for doing efficient masking. I suspect you are leaning in the same direction with the statement about "comparator for suffixes".
The main question I have is what are we getting in terms of (a) benefit to user semantics, or (b) implementation benefit, by supporting [k1@t1,k2@t2), compared to [k1,k2)@t. I believe the latter is easier to explain. It is very likely that I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @41, @50, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
IIUC, when the user originally writes [k1,k2)@t, in this model it would actually be [k1@t,k2) with value <@t. Is that correct?
Yeah, that's right.
Semantically, it seems dubious to have this difference since both a@60 and b@60 are not masked, and the iterator should not behave differently for the two (whether in masked mode or not).
I think the primary difference between these two paradigms is whether or not Pebble has first-class MVCC support, or continues to work within a 1-dimensional realm.
Pebble needs some semantic knowledge for doing efficient masking
I'm not sure about this, but I thought that it only needed to expose the ability to a) dynamically mask using a block property filter and b) to expose the covering ranges. This might be what I'm overlooking.
The main question I have is what are we getting in terms of (a) benefit to user semantics, or (b) implementation benefit, by supporting [k1@t1,k2@t2), compared to [k1,k2)@t. I believe the latter is easier to explain.
The benefits I see are:
- Pebble continues to deal with 1-dimensional keyspace only. Although Pebble knows about
Comparer.Suffix
, its usage is extremely limited. MVCC is a user-level projection onto Pebble's 1-dimensional keyspace, and prefix-range keys are a divergence from that. - The range keys and RANGEDELs mirror each other closely. I suspect this conceptual similarity will make implementation easier, in compactions, fragmenting and maybe iteration. Also, it removes the need for
RANGEDEL_PREFIX
since range keys andRANGEDEL
s occupy the same 1-dimensional keyspace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
IIUC, when the user originally writes [k1,k2)@t, in this model it would actually be [k1@t,k2) with value <@t. Is that correct?
Yeah, that's right.
Semantically, it seems dubious to have this difference since both a@60 and b@60 are not masked, and the iterator should not behave differently for the two (whether in masked mode or not).
I think the primary difference between these two paradigms is whether or not Pebble has first-class MVCC support, or continues to work within a 1-dimensional realm.
Pebble needs some semantic knowledge for doing efficient masking
I'm not sure about this, but I thought that it only needed to expose the ability to a) dynamically mask using a block property filter and b) to expose the covering ranges. This might be what I'm overlooking.
The main question I have is what are we getting in terms of (a) benefit to user semantics, or (b) implementation benefit, by supporting [k1@t1,k2@t2), compared to [k1,k2)@t. I believe the latter is easier to explain.
The benefits I see are:
- Pebble continues to deal with 1-dimensional keyspace only. Although Pebble knows about
Comparer.Suffix
, its usage is extremely limited. MVCC is a user-level projection onto Pebble's 1-dimensional keyspace, and prefix-range keys are a divergence from that.- The range keys and RANGEDELs mirror each other closely. I suspect this conceptual similarity will make implementation easier, in compactions, fragmenting and maybe iteration. Also, it removes the need for
RANGEDEL_PREFIX
since range keys andRANGEDEL
s occupy the same 1-dimensional keyspace.
I tried to flesh out the behavior of this 1-dimensional key space model
I considered [k1@t,k2)#s, < @t
in this model, which is what the user will write. The semantic equivalent in the other model (which offers less flexibility) is [k1,k2)@t#s
. The behavior wrt 1-dimensional keys is the same for both: the start is k1@t#s
and the end is k2#inf
.
Now we start considering the differences.
-
Say we have
[a@50,d)#s <@50
and we have a compaction output where pointc@40#s
is in the first sstable andc@39#s
is in the second sstable. We would write[a@50,c@39)#s < @50
to the first sst and[c@39,d)#s < @50
to the second sst. The other model would write[a,c)@50#s
and[c,IS(c))@50#s
to the first sst and[IS(c),d)@50#s
to the second sst. Both accomplish the requirement that the two ssts are non-overlapping: the first model ends the first sst atc@39#inf
and starts the second sst atc@39#s
, and the second model ends the first sst atc@40#s
and starts the second sst atc@39#s
. The second model needs the information from the first sst to mask the c in the second sst (we've discussed how this can be handled in an earlier comment). The first model does not need that. Note that in this example I used the same seqnum for all these points and ranges -- say they came about due to sst ingestion. -
Non-masked iteration: say there is also a point c@60 in the previous example. And consider the case where we haven't fragmented the range key. How will we present this output in the iterator? One possibility is to break at 1-dimensional key boundaries: so
[a@50,c@60), (c@60, [c@60,c@40)), (c@40, [c@40,c@39)), ...
This is fine, but IMO less satisfying because c@60 sees this mixing while a hypothetical a@60 doesn't, and both are not masked. -
For the same non-masked iteration how will it behave from the user's perspective when we have the two sstables. Will we expose the
[a@50,c@39)#s < @50
and[c@39,d)#s < @50
to the user, or will we try to merge them. If we expose these, I worry about complexity in explaining this. Successfully merging these will require looking ahead in the iterator, which may not be practical since it involves switching to a different sstable in the same level. -
Now say we had an
[a@10,d)#s <@10
written to the LSM prior to the write of[a@50,d)#s <@50
. How should Pebble interpret this newer write in the 1-dimensional key space? In the 1-dimensional key space the second write subsumes the first. What prevents it from being compacted away? Is it because the values are different? -
The same issue arises when someone writes a
del [a@50,d)#s <@50
. We need to prevent it from deleting[a@10,d)#s <@10
too. The value does not save us since there is no value written with the del.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md, line 370 at r1 (raw file):
@sumeerbhola I fleshed out some more thoughts on how I think this timestamp-carrying value would work, and I appended them to this document. The 'coloring' design aims to maintain as much parity between delete ranges and range keys as possible, so that all our learnings from one apply directly to the other.
For the same non-masked iteration how will it behave from the user's perspective when we have the two sstables. Will we expose the [a@50,c@39)#s < @50 and [c@39,d)#s < @50 to the user, or will we try to merge them. If we expose these, I worry about complexity in explaining this. Successfully merging these will require looking ahead in the iterator, which may not be practical since it involves switching to a different sstable in the same level.
I think the explanation might be clear within the context of Pebble not understanding MVCC timestamps at all. Pebble surfaces range keys that cover point keys. The user is responsible for imposing the second dimension of timestamps. It must look at the point key, then look at the range keys (that it knows overlap in key prefix, because otherwise Pebble wouldn't have surfaced it) to determine if they overlap within the timestamp dimension.
The same issue arises when someone writes a del [a@50,d)#s <@50. We need to prevent it from deleting [a@10,d)#s <@10 too. The value does not save us since there is no value written with the del.
I think there would be a param/value written with the del in that alternative design, because the param/value is what differentiates range keys at different timestamps.
I'm somewhat lost -- could you add a few examples? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat lost -- could you add a few examples?
Sorry, I'm doing a poor job explaining here. Maybe walking through an
evolution from range tombstones into these range keys will make their
similarities and differences clearer, and highlight anything I'm
overlooking:
Imagine we duplicate the entire implementation of range deletions
alongside the existing one. There's a separate block within an sstable,
a separate RANGEKEY key kind. All the internal iterator state,
fragmentation, etc is identical and parallel.
We omit the IsDeleted calls that check whether point keys are covered by
these alternate reality RANGEKEY keys, so they don't drop points. We
omit the cascading iterator seeking optimizations, and we omit the
elision of these keys. These alternate reality range keys don't do
anything: they're just a span that we can set that gets fragmented and
compacted alongside other keys.
There would be some unknown additional complexity in compactions, just
because there's now two streams of range keys instead of one: Anywhere
we consult c.rangeDelFrag
we also need to consult c.rangeKeyFrag
. We
need to flush both up to any sstable split point. All the truncation
semantics are the same. All the sequence number semantics with respect
to other keys of the same kind are the same as well (eg, RANGEDELs
cause other overlapping RANGEDEL fragments within the same snapshot to
be dropped. If they're separated by a snapshot, both are kept.)
Now we add the ability to associate a value with these alternate
range keys. A normal range tombstone deleting [a,c]#3 looks like
a.RANGEDEL#3: c
For these alternate keys, we prefix the end key with a varint encoding
of the end key's length and stick the value after:
a.RANGEKEY#3: <varint-encoding(1)>c<value>
When we feed these range keys into the fragmenter, we need to first
parse the varint. We provide the start and end key but also the value.
The fragmenter just passes the opaque value around, preserving the value
on both sides of a fragmentation point. We can use the same fragmenter
implementation for range deletions and range keys. Range tombstones just
always have a nil value.
Now we have a range key that associates a key range with a value.
There's no way to read these range keys, but since we magically
duplicated all the range deletion iterator code, our levelIters are
maintaining pointers into their current open sstables' RANGEKEY blocks
and keeping it in sync with the levelIters point key.
Since we duplicated RANGEDEL code and semantics with respect to
tombstones within the same snapshot overwriting each other, new RANGEKEY
keys overwrite each other in compactions. For example if #5 and #10 are
in the same snapshot stripe and meet in a compaction, #10 wins:
b.RANGEKEY#10:c,bar and a.RANGEKEY#5:z,foo
compacted become
a.RANGEKEY#5:b,foo b.RANGEKEY#10:c,bar c.RANGEKEY#5:z,foo
These are functional range keys. We could now add a method to the pebble
Iterator called RangeValue() []byte
that just returns the value of the
highest-levelled range key that overlaps the iterator's current point
key position (and has a seqnum less than our iterator's seq num). This
would just need to iterate through the levelIters, checking if any of
their RANGEKEY block pointers are pointing to a key covering the point.
Now we have a simple interface that exposes a 'range' value associated
with every point key and reuses almost all of the mechanics, semantics,
etc. of range tombstones.
We removed the elision of these RANGEKEY keys earlier, so now there's no
way to unset these RANGEKEYs. We account for this by allowing only
RANGEKEYs with zero-length values to be elided when in the bottommost
level and the last snapshot stripe. Removing these RANGEKEYs is done by
setting an empty value.
These range keys allow us to associate a value with a key range, and
query a point's range value just as efficiently as we can check if a
point is deleted by a range tombstone.
But they're very hard to use to implement MVCC delete range. Writing a
RANGEKEY overwrites the RangeValue for all points covered by the
RANGEKEY. We need to be able to mark deleted [a,f)@t2
and [b,c)@t3
and
remember that [b,c) was deleted at t2 and t3.
To solve that problem, we revisit the decision to inherit RANGEDEL's
semantics of RANGEDEL keys overwriting each other. When two overlapping
range keys meet in a compaction, the fragmenter splits them into two
fragments with identical bounds but different sequence numbers and
different values:
b.RANGEKEY#10:c,t3
b.RANGEKEY#5:c,t2
We inherited the behavior of RANGEDELs that drops the second key.
Instead, we alter the compaction iterator handling of RANGEKEYs to
invoke a user-defined 'MergeRangeValue' operation on the two values,
returning a combined value. The compaction invokes the user-defined
MergeRangeValue operation, passing t3
,t2
. The pkg/storage
MergeRangeValue
function returns an encoded representation of both
timestamps [t3, t2]
. Instead of outputting just the original fragment
b.RANGEKEY#10:c,t3
, the compaction outputs b.RANGEKEY#10:c,[t3,t2]
.
We're keeping the same number of fragments as before, and the logic
surrounding the range key bounds (inherited from RANGEDELs) is
unchanged. The only difference is that values are merged.
Now we can write a RANGEKEY([a,f),@t2)
and a RANGEKEY([b,c),@t3)
and
ensure we don't lose information about @t2
. We still need to change
the iterator interface though. If the two RANGEKEYs are still in
different levels of the LSM, an iterator positioned at b
could have a
L2 levelIter whose rangeKeyIter
is pointing at RANGEKEY([b,c),@t3)
and
a L6 levelIter whose rangeKeyIter
is pointing at RANGEKEY([a,f),@t2)
.
A straightforward solution is to implement (*Iterator).RangeValue()
by
calling the user-defined MergeRangeValue
, merging all of these
overlapping range keys' values, surfacing the merged value to the user.
There are two problems:
- Merging at read-time might be inefficient. The user knows the
semantics of their own range values, and they might not need to read all
of the operands. - The API doesn't expose how long this RangeValue extends for. The
completely merged value is only valid until the earliest end boundary of
all the merged fragments. The iterator could find which one is earliest
and surface it, but that requires O(fragments) key comparisons, which
again might not be necessary depending on the range value's
user-prescribed semantics.
Instead, we can expose each of these fragments, including start and end
bounds and intermediary values, to the end user. We can expose them
ordered, and provide a convenience function that does compute the merged
value using MergeRangeValue
. In the case of CockroachDB's MVCC Delete
Range, this allows CockroachDB to:
a) Access the fragment start and end bounds associated with range values.
b) Find the first fragment encoding a timestamp value that is beneath the
read timestamp and use that timestamp as the effective delete-range
timestamp, avoiding unnecessarily merging all the fragments in the common
case of recent read timestamps.
Much of the semantics and implementation of these RANGEKEYs are just
copied over from RANGEDELs. These RANGEKEYs only deal with opaque Pebble
user keys, just like RANGEDELs, and know essentially nothing of the MVCC
semantics.
It is possible to have fragmented RANGEKEYs that aren't useful for
CockroachDB. For example, consider a DEL-RANGE([a,b)@t20
. If a
compaction is splitting the various a@
MVCC keys into separate
sstables, you could end up with a sstable with keys:
a@90:foo1
a@80:foo2
a@80:foo3
...
a@50:foo
The bounds of this sstable might be [a@90,a@40)
. The 1-dimensional
Pebble key range of the DEL-RANGE
RANGEKEY overlaps this sstable,
so this sstable contains a single RANGEKEY fragment whose "effective,
within" bounds are [a@90,a@40)
. From the perspective of the user
(CockroachDB's MVCC package), this isn't meaningful. None of the keys
contained within the sstable are deleted by the RANGEKEY key. Whenever a
CockroachDB MVCC iterator asks whether any of these point keys are
deleted by a MVCC Delete Range, it would find the @t20 timestamp as the
earliest timestamp less than the read timestamp and find that the
relevant user key's timestamp is >@t20
, deciding the key is not
deleted by a MVCC delete range.
It seems this redundancy is inherent to the MVCC package's projection of
a two-dimensional MVCC key space onto Pebble's one-dimensional key
space. This redundancy is similar to the requirement that a CockroachDB
MVCC iterator skip over redundant MVCC versions in order to arrive at a
live one. Any alternative that avoids this artifact would require
Pebble becoming more intimately intertwined with the two-dimensional
MVCC key space, and would add significant design and implementation
distance between our existing Pebble range tombstones and Pebble
range keys.
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One bit I didn't give enough attention to above is overlapping RANGEKEY fragments separated by a snapshot within the same sstable. Consider an open snapshot at #7
and these two range keys meet in a compaction:
b.RANGEKEY#10:c,t3
b.RANGEKEY#5:c,t2
We need to preserve the history at the earlier snapshot, but that competes with our merging that ensures an iterator at any given sequence number to only need to examine 1 RANGEKEY per level to figure out the 'range value' of a point. Since these two fragments are unmerged, a read at sequence number #12
needs both fragments' values. A couple options:
- Merge their values as we would if there was no snapshot, but still preserve the older key. The sstable would contain
b.RANGEKEY#10:c,[t3,t2]
andb.RANGEKEY#5:c,t2
. An iterator at#12
only needs to read the first fragment, and the invariant that an iterator only needs to expose at most O(num-levels) range fragments is maintained. This might add complexity to compactions though, because when this sstable is compacted, we don't know thatb.RANGEKEY#10:c,[t3,t2]
andb.RANGEKEY#5:c,t2
came from the same sstable and don't require merging. In the case of CockroachDB's MVCC Delete Range, I don't think this case matters since value merging can be made idempotent for that use case.- Require MergeRangeValues implement semantics such that
MergeRangeValue(a, b) = c = MergeRangeValue(c, b)
. - Plumb knowledge of the originating sstable, so that compactions can avoid unnecessary merging.
- Require MergeRangeValues implement semantics such that
- Store the fragments unmerged and surface all of them to the iterator. Instead of O(N) range fragments, the iterator may need to consider O(N+M) fragments, where N is the number of levels and M is the number of taken snapshots). Additionally, an iterator may now need to
Next
in alevelIter
'srangeKeyIter
to check for additional unmerged fragments.
I lean towards either of option 1's variants because option 2 complicates iteration
Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)
I am still thinking about this -- some initial thoughts.
Is this central to this idea? Empty value implying deletion is a bit magical from an interface design perspective. |
Are we landing in this difficulty because we want to promote version t2 from #5 to #10 for read time efficiency? Snapshots are not common enough, so why should we bother with this promotion unless #5 and #10 are in the same snapshot stripe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more thoughts:
-
With the nonprescriptive approach, I don't think the generality is a goal itself. I do think there is some small value to maintaining the delineation between pkg/storage's MVCC implementation and Pebble, and maintaining a clear separation of concerns between those levels. It's not always possible: eg, bloom filters necessitating Comparer.Split. I agree we should drop generality if it makes implementation cleaner or easier.
-
I'm not sure, we might still be disagreeing about the underlying implementation approach. I might just still be misunderstanding the implementation of the prescriptive approach or haven't fully grokked how the pieces fit together. My motivation for pushing the MVCC complexity up is to keep the Pebble implementation simple. In my mind, the prescriptive approach doesn't eliminate that MVCC complexity, it works it into Pebble compaction, iteration, and sstable bounds logic, which I think are already pretty complicated.
-
The pkg/storage RangeValueMerger in the nonprescriptive approach is complicated. But it's also just a pure function that is easy to exhaustively unit test, which I think makes the complexity more manageable than if worked into Pebble.
-
Regarding masking in the nonprescriptive appraoch, I do think masking would be requested by the CockroachDB iterator when it finds a range key that it wants to use for masking. I didn't think it was too complicated, and it mirrors the existing block property filters generality, but I definitely could be overlooking something here.
-
-
Regarding unsetting range keys, assigning significance to the empty value isn't integral. An alternative is to allow the RangeValueMerger to return a bool signifying that the range key may be dropped. The intention is that CockroachDB can write range keys like
+t3
or-t3
, which are merged by the range value merger. When a range key has no timestamps, the range key is useless and we want some mechanism for it to be dropped. I think you're right that making it explicit is cleaner and less surprising than assigning special significance to the empty value. -
"Are we landing in this difficulty because we want to promote version t2 from sstable: prefix bloom filter #5 to compaction: sstable output size #10 for read time efficiency?" — It's for iterator implementation simplicity. I'm hoping to maintain that for any given point key, there is at most 1 overlapping range key fragment in each level that an iterator needs to consider/surface (this is true for range tombstones). A pebble.Iterator could ask each mergingIterLevel for its single overlapping range key fragment, if any. Otherwise, I think we need to Next through a level's fragments and then Prev back to the beginning for the next user key. My understanding in the prescriptive approach is that within each level, you may have n range keys that overlap the current user key (eg
[a,e)@100#5, [a,e)@40#4, [a,e)@30#3 [a,e)@30#2
), and each prefix change requires Prev-ing back to the beginning.
My understanding of our views is that:
- We both think the nonprescriptive approach pushes some of the complexity up, forcing pkg/storage to participate directly in the implementation MVCC range tombstones, and we should be get something of value if we do that.
- We have differing estimates of the Pebble-level implementation complexity in the prescriptive approach. I'm estimating high complexity, which leads me to want to trade reduced Pebble implementation complexity for higher pkg/storage complexity, whereas you estimate the prescriptive approach is similar complexity in the Pebble implementation.
I think the biggest things that I see as complex in the prescriptive approach are sstable bounds/compaction logic, and the merging and interleaving of range keys during iteration. I'll try to dig deeper into these and thoroughly document how I understand these to work in the prescriptive approach.
Reviewable status: 0 of 1 files reviewed, 8 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md, line 381 at r5 (raw file):
- [b,ImmediateSuccessor(b))@50 is effectively representing a point b@50. Hence the end bound for the first sstable is b@40#s1, which does not overlap with the b@30#s2 start of the second sstable.
I want to make sure I understand how fragmentation and sstable bounds work here in the case of multiple overlapping range keys.
Point keys: a@100, a@30, b@100, b@40, b@30, b@10, c@40, d@40, e@30
Range keys: [a,e)@50, [b,d)@20
Prefix-fragmented range keys: [a,b)@50, [b,d)@50, [b,d)@20, [d,e)@50
During a compaction, say the prefix b needs to be split across sstables.
sstable one:
point keys: a@100, a@30, b@100, b@40, b@30
range keys:
- [a,b)@50 contributes smallest a@50, largest b#inf
- [b,IS(b))@50 contributes smallest b@50, largest b@50
sstable two
point keys: b@10, c@40, d@40, e@30
range keys:
- [b,d)@20 contributes smallest b@20, largest d#inf
- [IS(b),d)@50 contributes smallest IS(b)@50, largest d#inf
- [d,e)@50 contributes smallest d@50, largest e#inf
The resulting sstable bounds are [a@100,b@30]
and [b@20, e@30]
. Is this right? This produces a gap between sstable bounds, unlike range tombstones, but that's ok, because we only care about placing the range keys in the right spot for iteration—not completely covering a 1-dimensional space of Pebble's user key space?
One awkward bit is that we need to produce the fragment [IS(b),d)@50
when we're processing the split point, but that fragment actually sorts after the next range key [b,d)@20
. Do we need to produce point ranges using immediate successor for all remaining range keys with the prefix that is split across sstables? So in the case, we'd also put [b,IS(b))@20
in the first sstable and [IS(b),d)#20
in the second sstable? And I guess the compaction iterator would need to buffer all range keys at the current prefix?
(I was hoping we could get to a consensus as we understand each approach more deeply, and I am still hopeful of that, but we'll also need to timebox this eventually, and pick something. Let's talk synchronously on how to approach that.) |
I believe so.
We currently produce a gap between sstable bounds with range tombstones too, and it is worse since the "apply within sstable bounds" semantically means that some seqnums are no longer deleted as discussed in https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md#loss-of-power. Of course that gap (for range tombstones) will go away when we no longer have user keys spanning two sstables on the same level. With range keys I would argue there isn't a gap:
We don't need to do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
For the MVCC/Pebble boundary, do you anticipate implementation of MVCC point tombstone masking also being moved into Pebble? Pebble definitely needs some level of knowledge of MVCC for MVCC-knowledge, but I believe it can be reduced to narrowly-scoped user-implemented interfaces, like
Comparer
does today. -
"Just like
a@50
anda@30
are separate keys,[a,b)@50
and[a,b)@30
are separate keys." — I think that's up for us to decide? In the value merging design, range keys are more descriptions of point keys than they are keys with substance. I don't think the requirement that they be keys is necessary, or necessarily a clearer interface. If these range keys are keys rather than descriptions of ranges of point keys, they exist in a parallel but also intertwined/interleaved keyspace which is at least a concept I have trouble reconciling. -
"It is not like block property filters which are specified at iterator creation time" — I guess I don't understand why that makes them very dissimilar. Doesn't your example here with two overlapping range keys apply to the prescriptive approach? In both cases we need to make a decision about which range key to use for masking, it's just a matter of whether we're making that decision in Pebble or in pkg/storage? I don't think Pebble should remember the masking: the CockroachDB iterator should request it each time. This might just be my preference for simple Pebble implementation showing :).
-
Regarding determinism, the 'RangeValue' of every point key is deterministic and can be used in metamorphic testing. Just like in the prescriptive approach, the metamorphic testing can assert on the expected range value of a point, rather than the intermediary iterator states. The fragment boundaries are not deterministic, but that's also the case in the prescriptive approach.
We don't need to do
[b,IS(b))@20
since its start keyb@20
doesn't belong in the first sstable
What do we do about [b,d)@20
ordering before [IS(b),d)@50
then? In this model, we buffer all keys with the same prefix in the fragmenter. We output [b,IS(b))@50
in the first sstable, but need to remember that we need to also output [IS(b),d)@20
in the second sstable. We can't write it yet though even though we're starting the second sstable, because the fragmenter is still on the b
prefix and we have unflushed fragments beginning with b
. We also don't know whether there might be fragments further down the input iterators starting with IS(b)
. We need to additionally buffer which fragments have been partially flushed as a point-fragment? This is the type of complexity we're incurring because we're deviating from range tombstones' semantics with respect to sstable boundaries. If we reuse range deletion range semantics, I believe we can reuse the existing fragmenter and sstable boundaries logic near verbatim.
Separately, I've been mulling over the MVCCStats
issue, and I haven't thought of anything great. I think one option is to prevent MVCC range tombstones from contributing to MVCCStats
at all (other than updating covered keys' MVCCStats
contributions). We want MVCC range tombstones to contribute to MVCCStats
so long as CockroachDB still needs to do work to clean up the range keys. Can we tweak them such that CockroachDB doesn't need to clean up these range keys manually?
Riffing off the earlier 'coloring' concept, we could redefine range keys so that they're not first-class keys themselves, but instead a 'coloring' on top of and describing point keys. They have no substance themselves and only exist to describe point keys. If there are no described point keys beneath a range key, the range key has no effect and may be elided. As long as CockroachDB garbage collects all the point keys described by a range key, the range key will automatically be dropped too. This is the same mechanic as Pebble range tombstones.
In both the previously discussed designs, this would require changes so that Pebble can discern when all point keys deleted by a MVCC range tombstone fragment have been dropped. In the prescriptive approach, I think this would require a comparator for suffixes? In the value-merging approach, I think it would require the user provide something akin to a Describes(rangeValue []byte, pointKey []byte) bool
function.
Reviewable status: 0 of 1 files reviewed, 8 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought regarding Pebble performing masking automatically during iteration, doesn't that preclude the possibility of eventually allowing MVCC range deletes to be transactional? Or at least it will require us to provide a mechanism for Pebble to consult intents?
Reviewable status: 0 of 1 files reviewed, 8 unresolved discussions (waiting on @41, @50, @erikgrinaker, @petermattis, @sumeerbhola, @t, and @v)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 44 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md, line 167 at r24 (raw file):
Previously, sumeerbhola wrote…
The current RANGEDEL would continue to apply only to points?
Yeah
docs/rfcs/20211018_range_keys.md, line 195 at r24 (raw file):
Previously, sumeerbhola wrote…
if both
hasPoint
andhasRange
are true, does the value returned byKey()
equalstart
?
I was leaning towards no, because I thought it would be preferable to have the fragment bounds stable over the lifetime of an iterator. What do you think?
docs/rfcs/20211018_range_keys.md, line 349 at r24 (raw file):
Previously, sumeerbhola wrote…
should this say RangeKeyMask instead of IteratorMask?
I was imagining IteratorMask
would be an interface (with unexported methods). It seems like we could conceivably fold BlockPropertyFilter
s under this field too and expose constructors for combining several iterator masks, kinda like the functional options pattern.
docs/rfcs/20211018_range_keys.md, line 379 at r24 (raw file):
Previously, sumeerbhola wrote…
s/form/from/
Done.
docs/rfcs/20211018_range_keys.md, line 444 at r24 (raw file):
Previously, sumeerbhola wrote…
defined
Done.
docs/rfcs/20211018_range_keys.md, line 491 at r24 (raw file):
Previously, sumeerbhola wrote…
we could impose the constraint that a prefix cannot be split across multiple range key sstables.
Would this help with anything? It doesn't seem to help with the iterator interface, given that we've now designed an interface which exposes range keys aligned with the point key.
I think it eliminates almost all my concerns with the original representation of range keys ([k1,k2)@t1#s
as [email protected]#s→ k2
), because it avoids complicating compaction or sstable boundaries. We could remove the merging of keys and tweak the iterator interface to expose an iterator over the range key fragments covering a point:
RangeKeys() *RangeKeyIterator
type RangeKey struct {
Suffix []byte
Value []byte
}
func (r *RangeKeyIterator) Next() RangeKey {
}
The advantage would be that if you have many large-valued range keys all overlapping, you don't need to merge them into a single key that must sit in a single sstable block. Instead of requiring O(sum-overlapping-key-values) memory, it requires O(max-overlapping-key-value) memory per level. The downside is that you need to prev back to the first fragment with the fragment bounds whenever you next, even if the next key falls within the same fragment bounds.
We could also choose to initially buffer all the overlapping fragments in memory, represent them as [email protected]#s→ k2
and expose them in an iterator interface that just reads from the buffered fragments. Later, if we begin to encounter issues with many overlapping large-valued range keys and iterator memory usage, we have an out.
docs/rfcs/20211018_range_keys.md, line 494 at r24 (raw file):
Previously, sumeerbhola wrote…
of higher levels, iterators, etc all needing to deal with the concept of two parallel LSMs.
This does seem messy wrt code changes. This would spread into the manifest too.
We could infer the LSM to which the sstable belongs from the Smallest
and Largest
key kinds in the file metadata to avoid requiring manifest format changes.
docs/rfcs/20211018_range_keys.md, line 1032 at r24 (raw file):
Previously, sumeerbhola wrote…
r3 has 3 ranges?
oops, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 44 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md, line 195 at r24 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I was leaning towards no, because I thought it would be preferable to have the fragment bounds stable over the lifetime of an iterator. What do you think?
I don't have an opinion. Whatever is simpler.
docs/rfcs/20211018_range_keys.md, line 349 at r24 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I was imagining
IteratorMask
would be an interface (with unexported methods). It seems like we could conceivably foldBlockPropertyFilter
s under this field too and expose constructors for combining several iterator masks, kinda like the functional options pattern.
So RangeKeyMask is an implementation of IteratorMask?
I don't mind the generalization but we should generalize only if it doesn't introduce much code complexity. The reason to generalize BlockPropertyFilter
s up front was that we were making an sstable format change. RangeKeyMask is a purely code-level abstraction that we could mark as "experimental and subject to incompatible changes, including removal", so that only CockroachDB uses it. Then we could generalize it over time if needed.
Document the new Split requirements necessary for range keys support. See cockroachdb#1341.
Document the new Split requirements necessary for range keys support. See cockroachdb#1341.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 61 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nicktrav, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)
a discussion (no related file):
Flushing some comments. Still reading.
docs/rfcs/20211018_range_keys.md, line 271 at r27 (raw file):
(a,[a,e)), (b,[a,e)), (c,[a,e)), (d,[a,e)), e
Should this also emit the f
point key?
docs/rfcs/20211018_range_keys.md, line 318 at r27 (raw file):
(point, range)
( [a,e)@50)
It's not immediately apparent why the range key [a,e)@50
is emitted first, without a point key. Perhaps a comment?
edit: @sumeerbhola was able to clear this up in an offline discussion. It might be worth referring back to the section that talks about widening the span on the lower bound to cover a
, and not the initial lower bound of a@50
.
docs/rfcs/20211018_range_keys.md, line 340 at r27 (raw file):
Only range keys with suffixes that sort after the mask's suffix mask point keys. A range key that meets this condition only masks points with suffixes that sort after the range key's suffix.
I found these two sentences rather dense and hard to unpack. I was almost expecting to see another sentence that followed that could help relieve some of the cognitive overhead, i.e. "Put another way ... blah". The "soft after" term also requires some mental gymnastics, as we are doing a sort-descending.
In looking at the examples, I think I have arrived at a mental model where the RangeKeyMask
sets an "upper bound" on the suffix. A point key contained in the range that falls both "below" the mask suffix and the next lowest range key suffix will be masked. Anything "above" the range key suffix will not be masked. Something like:
RangeKey RangeKeyMask(@50, ...)
[a,c)@30 |
apple@10 a@20 | a@40 |
x x | o |
|---------|---------|---------|---------|---------|---------| suffix
@0 @10 @20 @30 @40 @50 @60
|
Masked <----|----> Unmasked
Am I right in thinking about this in terms of overlapping regions? Could this help with explaining the masking concept?
docs/rfcs/20211018_range_keys.md, line 375 at r27 (raw file):
Example: A user may construct an iterator with a mask
RangeKeyMask(@t50, ...)
.
It may be useful if there was an additional, extensive example that shows how this works (i.e. by demonstrating which keys are masked / elided / not emitted).
It may also be nice to have it in the style of the (point, range)
tabulated example you have above.
docs/rfcs/20211018_range_keys.md, line 434 at r27 (raw file):
Fragmentation may occur at user key contained within the `[k1,k2)` range, including at intermediary keys that contain suffixes higher or lower than the suffix of the range key.
This point seems important enough to justify having an explicit example highlighting what kind of fragmentation is permissible.
docs/rfcs/20211018_range_keys.md, line 440 at r27 (raw file):
or
Is this intention of this paragraph to say that we will support both? Or is this to motivate the picking of on one or the other?
docs/rfcs/20211018_range_keys.md, line 447 at r27 (raw file):
range keys have no effect on point keys during compactions.
Is it worth stating the inverse too? Point keys have no impact on range keys?
docs/rfcs/20211018_range_keys.md, line 450 at r27 (raw file):
shared
Can you drop in an inline definition of what we mean by "shared" here? Are you talking about the case where an sstable contains both point key blocks and range key blocks? Or something else?
docs/rfcs/20211018_range_keys.md, line 451 at r27 (raw file):
that contains range keys
This reads a little confusingly. Do you mean something like "reads must iterate through the range key blocks of sstables with range keys that fall within the bounds of the iterator"? It read as if there was more to the sentence.
docs/rfcs/20211018_range_keys.md, line 457 at r27 (raw file):
range-key LSM
In the paragraph before these bullet points, we talk about this "parallel range-key LSM". I wonder if we can think of it as something like a "logical range-key LSM". My understanding is that it's still in the same "physical" LSM, just stored in blocks that are separate to the point keys.
docs/rfcs/20211018_range_keys.md, line 486 at r27 (raw file):
requiring ingests to target
I was a little confused by this phrasing. Do ingests have the ability to themselves target a level in the LSM? Could / should this be something along the lines of "... resulting in ingestions writing to at least L5 or higher". Seems like the main point is that because these keys hang around longer and aren't deleted / elided, this forces newer ranges to be placed in higher levels.
docs/rfcs/20211018_range_keys.md, line 491 at r27 (raw file):
to be order of magnitude
nit: to be an order of magnitude
docs/rfcs/20211018_range_keys.md, line 521 at r27 (raw file):
k1.RANGESET#s2 → (k2,[(@t2,v2),(@t1,v1)])
Given the description below, I wonder if we should update this to be something like:
k1.RANGESET#s2 → (k2,[(@t2,len(v2)),(@t1,len(v1))][v1,v2])
It took me a bit to understand what was meant by "value itself is encoded at the end of the RANGESET
's value", and how the reverse order part is useful (i.e. can index from the back, on the fly). It may be a little verbose though.
docs/rfcs/20211018_range_keys.md, line 575 at r27 (raw file):
As described above, overlapping `RANGESET`s and `RANGEUNSET`s within the same snapshot stripe and sstable are merged
Should the definition of "overlapping" here be more strict in the sense that only when the start and end key are the same can we merge into a single key-value pair. For example, k1.RANGESET#s1 -> (k2,[@t1])
can merge with k1.RANGESET#s2 -> (k2,[@t2])
into a single key-value. But can we merge k1.RANGESET#s1 -> (k2,[@t1])
with k1.RANGESET#s2 -> (k3,[@t2])
(i.e. different end-key)?
docs/rfcs/20211018_range_keys.md, line 623 at r27 (raw file):
must not be masked
We talk about masking here in this example, but don't specify what the mask is. Is "mask" the correct term to be using here? Or does it only makes sense to refer to masking when we are specifying a mask (i.e. RangeKeyMask
).
docs/rfcs/20211018_range_keys.md, line 625 at r27 (raw file):
not be masked because it's not covered by the new range key, and indeed that's the case because the covering range key's fragment is unchanged `[a,j)@{t1}#10`.
Might be nice to talk about what happens to the point key m@t2#11
in right-hand half of the fragmented range.
My current understanding is that RangeKeyMask(@t3, ...)
would mask this point key, whereas RangeKeyMask(@t1, ...)
would not mask it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 61 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nicktrav, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md, line 271 at r27 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Should this also emit the
f
point key?
Yeah, updated.
docs/rfcs/20211018_range_keys.md, line 318 at r27 (raw file):
Previously, nicktrav (Nick Travers) wrote…
It's not immediately apparent why the range key
[a,e)@50
is emitted first, without a point key. Perhaps a comment?edit: @sumeerbhola was able to clear this up in an offline discussion. It might be worth referring back to the section that talks about widening the span on the lower bound to cover
a
, and not the initial lower bound ofa@50
.
Added a line below this example:
Remember that the suffixless `a` sorts before all suffixed keys with the same
prefix (eg, `a@100`).
docs/rfcs/20211018_range_keys.md, line 340 at r27 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Only range keys with suffixes that sort after the mask's suffix mask point keys. A range key that meets this condition only masks points with suffixes that sort after the range key's suffix.
I found these two sentences rather dense and hard to unpack. I was almost expecting to see another sentence that followed that could help relieve some of the cognitive overhead, i.e. "Put another way ... blah". The "soft after" term also requires some mental gymnastics, as we are doing a sort-descending.
In looking at the examples, I think I have arrived at a mental model where the
RangeKeyMask
sets an "upper bound" on the suffix. A point key contained in the range that falls both "below" the mask suffix and the next lowest range key suffix will be masked. Anything "above" the range key suffix will not be masked. Something like:RangeKey RangeKeyMask(@50, ...) [a,c)@30 | apple@10 a@20 | a@40 | x x | o | |---------|---------|---------|---------|---------|---------| suffix @0 @10 @20 @30 @40 @50 @60 | Masked <----|----> Unmasked
Am I right in thinking about this in terms of overlapping regions? Could this help with explaining the masking concept?
Yeah, that's right. I think it might be useful to include a two-dimensional ascii graph, eg, with user key on the x-axis and suffix on the y-axis. That might help demonstrate how masking works.
docs/rfcs/20211018_range_keys.md, line 440 at r27 (raw file):
Previously, nicktrav (Nick Travers) wrote…
or
Is this intention of this paragraph to say that we will support both? Or is this to motivate the picking of on one or the other?
To motivate picking one or the other. Clarified the language here.
docs/rfcs/20211018_range_keys.md, line 450 at r27 (raw file):
Previously, nicktrav (Nick Travers) wrote…
shared
Can you drop in an inline definition of what we mean by "shared" here? Are you talking about the case where an sstable contains both point key blocks and range key blocks? Or something else?
Clarified up above.
docs/rfcs/20211018_range_keys.md, line 457 at r27 (raw file):
Previously, nicktrav (Nick Travers) wrote…
range-key LSM
In the paragraph before these bullet points, we talk about this "parallel range-key LSM". I wonder if we can think of it as something like a "logical range-key LSM". My understanding is that it's still in the same "physical" LSM, just stored in blocks that are separate to the point keys.
In this context, it is referring to a separate physical LSM.
docs/rfcs/20211018_range_keys.md, line 486 at r27 (raw file):
Previously, nicktrav (Nick Travers) wrote…
requiring ingests to target
I was a little confused by this phrasing. Do ingests have the ability to themselves target a level in the LSM? Could / should this be something along the lines of "... resulting in ingestions writing to at least L5 or higher". Seems like the main point is that because these keys hang around longer and aren't deleted / elided, this forces newer ranges to be placed in higher levels.
"Target" is a bit of jargon derived from the code, because during ingestion we determine the ingestTargetLevel
for each level. I'll update the language to use less jargon.
docs/rfcs/20211018_range_keys.md, line 575 at r27 (raw file):
Previously, nicktrav (Nick Travers) wrote…
As described above, overlapping `RANGESET`s and `RANGEUNSET`s within the same snapshot stripe and sstable are merged
Should the definition of "overlapping" here be more strict in the sense that only when the start and end key are the same can we merge into a single key-value pair. For example,
k1.RANGESET#s1 -> (k2,[@t1])
can merge withk1.RANGESET#s2 -> (k2,[@t2])
into a single key-value. But can we mergek1.RANGESET#s1 -> (k2,[@t1])
withk1.RANGESET#s2 -> (k3,[@t2])
(i.e. different end-key)?
they're always exactly overlapping because they're fragmented. updated to resurface that point here.
docs/rfcs/20211018_range_keys.md, line 623 at r27 (raw file):
Previously, nicktrav (Nick Travers) wrote…
must not be masked
We talk about masking here in this example, but don't specify what the mask is. Is "mask" the correct term to be using here? Or does it only makes sense to refer to masking when we are specifying a mask (i.e.
RangeKeyMask
).
Yeah, "Masking" only refers to RangeKeyMask
. The range keys themselves are the "masks", and the RangeKeyMask
configured by the user specifies which range keys may act as masks.
Document the new Split requirements necessary for range keys support. See #1341.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 52 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nicktrav, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md, line 623 at r27 (raw file):
range keys themselves are the "masks", and the RangeKeyMask configured by the user specifies which range keys may act as masks.
I think this is a really nice way of re-stating things, if you could incorporate somehow.
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
Range keys (see #1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to #1339.
44eace9
to
ade6a02
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 62 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nicktrav, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md
line 93 at r29 (raw file):
### Writes This design introduces a three new write operations:
nit: introduces three ...
docs/rfcs/20211018_range_keys.md
line 294 at r29 (raw file):
A pebble.Iterator, which shows user keys, will visit the following iterator positions during forward iteration:
nit: the fruit example from the previous section is nice, and the only thing it lacks is point keys. It may be better to expand on that example here rather than introduce these a, b, ... keys.
docs/rfcs/20211018_range_keys.md
line 314 at r29 (raw file):
the range key to those bounds. For example, if the upper bound was c, the sequence seen would be `(a,[a,c))`, `(b,[a,c))`. The same applies to lower bound and backward iteration.
nit: these 3 bullets stating iteration properties of range keys can probably be stated with the pure range key example from the previous section.
docs/rfcs/20211018_range_keys.md
line 475 at r29 (raw file):
numbers, like other internal keys. These keys are written to the same sstables as point keys and stored in a separate block, similar to the existing range deletion tombstones. Log-structured merge tree level invariants are valid across
nit: I think this "same sstables" and "separate block" sentence is repeating what was said earlier in the section.
docs/rfcs/20211018_range_keys.md
line 478 at r29 (raw file):
range and point keys (just like they are for range tombstones and point keys). That is, `RangeKeySet([k1,k2))#s2` cannot be at a lower level than `RangeKeySet(k)#s1` where `k \in [k1,k2)` and `s1 < s2`.
RangeKeySet(k)? Did you mean to use a point key?
We should probably mention both LSM invariants among range keys, and between range and point keys.
IIUC, we have both, but we don't really rely on the latter in our implementation.
docs/rfcs/20211018_range_keys.md
line 511 at r29 (raw file):
* End key, `varstring`, encodes the end user key of the fragment. * A series of (suffix, value-length) tuples representing the logical range keys
s/value-length/value/
(iirc, we are no longer doing the opposite order stuff that was in the previous iteration of this doc)
docs/rfcs/20211018_range_keys.md
line 649 at r29 (raw file):
`Split(k) < len(k)`. # TKTK: Discuss merging iterator
todo in a later PR?
docs/rfcs/20211018_range_keys.md
line 705 at r29 (raw file):
a@100, [a,e)@50, a@30, b@100, [b,e)@50, b@40, b@30, [c,e)@50, c@40, [d,e)@50, d@40, e@30
this needs to be rephrased since we are exposing range keys together with point keys.
docs/rfcs/20211018_range_keys.md
line 797 at r29 (raw file):
- With shared sstables, `SeekPrefixGE` cannot use bloom filters to entirely eliminate sstables that contain range keys. Pebble does not use bloom filters
this "does not use bloom filters in L6" is not true anymore.
docs/rfcs/20211018_range_keys.md
line 828 at r29 (raw file):
We decide to share sstables, because preserving the LSM invariant between range keys and point keys is expected to be useful in the long-term.
We don't have this invariant now in CockroachDB, yes? I vaguely remember dt mentioning a case where a range key that covers a point key is written before the point key.
I thought we are doing this because its simpler from a code perspective to not maintain a separate LSM stack.
Describe design of range keys. Informs cockroachdb#1339 Co-authored-by: Jackson Owens <[email protected]>
ade6a02
to
d4c52b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 62 unresolved discussions (waiting on @20, @41, @50, @erikgrinaker, @jbowens, @nicktrav, @nvanbenschoten, @petermattis, @sumeerbhola, @t, and @v)
docs/rfcs/20211018_range_keys.md
line 797 at r29 (raw file):
Previously, sumeerbhola wrote…
this "does not use bloom filters in L6" is not true anymore.
changed to "does not always use bloom filters in L6", since most of the time we still don't use them.
docs/rfcs/20211018_range_keys.md
line 828 at r29 (raw file):
Previously, sumeerbhola wrote…
We don't have this invariant now in CockroachDB, yes? I vaguely remember dt mentioning a case where a range key that covers a point key is written before the point key.
I thought we are doing this because its simpler from a code perspective to not maintain a separate LSM stack.
We don't have it between suffixes and sequence numbers, but otherwise we have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamping myself, since Sumeer opened the pull request
Informs #1339