-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sstable: add meta block for range keys; support writing range keys #1400
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick pass: looks good! I think the big item remaining is the public interface, but we can also defer the interface to subsequent PRs.
Reviewed 5 of 14 files at r1.
Reviewable status: 5 of 14 files reviewed, 8 unresolved discussions (waiting on @jbowens and @nicktrav)
sstable/table.go, line 162 at r1 (raw file):
rocksDBFormatVersion2 = 2 // FIXME(travers): Are we ok with this naming?
sgtm
sstable/writer.go, line 245 at r1 (raw file):
} return w.addPoint(key, value) }
I alluded to this when we talked last week, but there's a complication to the interface that we export in that CockroachDB needs to be able to construct sstables for ingestion that contain range keys, but we don't want to expose the internals of fragmentation, coalescing or the value-encoding to CockroachDB.
When CockroachDB is constructing a sstable for ingest, it'll read individual fragmented range key sets (containing a single suffix, value pair and bounds). As it reads them, it'll need to pass them to the Writer
. However, we don't want to just write those verbatim to the sstable block as internal range keys: we'll want to fragment and coalesce the logical range keys, so that we write coalesced range keys to the block.
On the other hand, during a compaction Pebble will use a merging range key iter to produce coalesced, nonoverlapping CoalescedSpan
s that need to be written to the sstable as up to 1 key per range-key kind, in-order. This use case could make use of an interface like the one exposed here.
I think we'll want to support two separate interfaces for these (and perhaps prohibit range keys from being added through Add
to avoid accidental misuse). Maybe something like
(*Writer) AddInternalRangeKey(key base.InternalKey, value []byte)
and
(*Writer) RangeKeySet(start, end, suffix, value []byte)
(*Writer) RangeKeyUnset(start, end, suffix []byte)
(*Writer) RangeKeyDelete(start, end []byte)
sstable/writer.go, line 396 at r1 (raw file):
if !ok { // We panic here as we should have previously decoded and validated this // key and value when it was first added to teh range key block.
nit: s/teh/the/
sstable/writer.go, line 440 at r1 (raw file):
// NOTE: the inequality excludes zero, as we allow the end key of the // lower span be the same as the start key of the upper span, even // though the range end key is considered an exclusive bound.
nit: instead of "even though the range end key is considered an exclusive bound," should this say "because the range end key is considered an exclusive bound."? The exclusivity of the end bound is what prevents them from overlapping.
sstable/writer.go, line 464 at r1 (raw file):
// FIXME(travers): More granular properties for RANGEKEYSET, *UNSET, *DEL // counts?
I think more granular properties are a good idea: It might be useful for compaction heuristics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 14 files at r1.
Reviewable status: 6 of 14 files reviewed, 8 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)
sstable/writer.go, line 464 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I think more granular properties are a good idea: It might be useful for compaction heuristics.
+1
156a588
to
cf53b51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer the interface to subsequent PRs.
That would probably be good for directing focus towards internal implementation / external use. I will update this patch to add (*Writer) AddInternalRangeKey(key base.InternalKey, value []byte)
, as you suggested below. This would be for "internal" use (compactions, etc.).
Then in a follow up, I'll add the per-key-kind functions that cockroach can use, which will also contain the logic to fragment, etc.
Dismissed @jbowens from a discussion.
Reviewable status: 4 of 14 files reviewed, 2 unresolved discussions (waiting on @jbowens and @sumeerbhola)
sstable/table.go, line 162 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
sgtm
👍
sstable/writer.go, line 245 at r1 (raw file):
I think we'll want to support two separate interfaces for these (and perhaps prohibit range keys from being added through Add to avoid accidental misuse)
Sounds good! It should be straight forward to adapt what I have here to make use of the first, which will have more stricter requirements on ordering, etc.
I will tackle the remainder separately, as they will need to make use of the fragmenter, less restrictions on key order, new test cases, etc.
sstable/writer.go, line 390 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Note to self, this comment is old. Update it.
Done.
sstable/writer.go, line 405 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Note to self, fix the span in the error message.
Done.
sstable/writer.go, line 440 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: instead of "even though the range end key is considered an exclusive bound," should this say "because the range end key is considered an exclusive bound."? The exclusivity of the end bound is what prevents them from overlapping.
Updated.
sstable/writer.go, line 464 at r1 (raw file):
Previously, sumeerbhola wrote…
+1
Great. I'll add counters for the separate key kinds.
f2d0bb1
to
428046b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be good for a proper look now. Have addressed all FIXMEs and exposed the new function on the Writer
for Pebble-internal range key addition.
Reviewable status: 3 of 14 files reviewed, 2 unresolved discussions (waiting on @jbowens and @sumeerbhola)
sstable/writer.go, line 464 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Great. I'll add counters for the separate key kinds.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 14 files at r1, 1 of 3 files at r2, 2 of 7 files at r3.
Reviewable status: 7 of 14 files reviewed, 6 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)
sstable/properties.go, line 120 at r3 (raw file):
NumRangeKeyDels uint64 `prop:"pebble.num.range-key-dels"` // The number of range keys in this table. NumRangeKeys uint64 `prop:"pebble.num.range-keys"`
I think we should drop the aggregate NumRangeKeys
property, and we can compute it by summing the individual kinds if we need it.
sstable/writer.go, line 385 at r3 (raw file):
// AddInternalRangeKey adds a range key set, unset, or delete key/value pair to // the table being writer.
nit: s/writer/written/
sstable/writer.go, line 388 at r3 (raw file):
// // Range keys must be added ordered by their start key, and in their fragmented // span order. Other than spans that are perfectly aligned (same start and end
Wondering if we should adjust the phrasing to indicate that the key
parameter, indicating the Start
boundary, must be strictly ascending.
sstable/writer.go, line 417 at r3 (raw file):
} switch c := base.InternalCompare(w.compare, prevStartKey, curStartKey); {
I didn't catch this on my first pass, but this InternalCompare
comparison should always return -1
with properly ordered range keys. If the user keys are equal, it will compare on sequence numbers, and if those are equal, it will compare on kind. So the c == 0
case is always an error case because the user is adding a duplicate key.
We could replace the switch with:
if base.InternalCompare(w.compare, prevStartKey, curStartKey) >= 0 {
w.err = errors.Errorf("pebble: range keys starts must be added in ascending order: %s, %s",
prevStartKey.Pretty(w.formatKey), key.Pretty(w.formatKey))
}
and lift the fragmentation bounds checks from the default:
case into the main function block.
sstable/writer.go, line 844 at r3 (raw file):
} w.meta.LargestRangeKey = base.MakeInternalKey(endKey, base.InternalKeySeqNumMax, key.Kind()).Clone() rangeKeyBH, err = w.writeBlock(w.rangeKeyBlock.finish(), NoCompression)
@sumeerbhola — do you know why we don't compress the range-del block? I'm wondering if whatever reason also applies here.
Compression is more desirable for range keys, since they carry values but are still fragmented.
sstable/writer.go, line 1048 at r3 (raw file):
rangeDelBlock: blockWriter{ restartInterval: 1, },
I think we need to initialize the rangeKeyBlock
field and its restartInterval
to 1
. This disables key prefix compression. I think we should also update this blockIter
comment to include range keys in its description of key stability because we'll require that stability for range keys too:
// A blockIter provides an additional guarantee around key stability when a
// block has a restart interval of 1 (i.e. when there is no prefix
// compression). Key stability refers to whether the InternalKey.UserKey bytes
// returned by a positioning call will remain stable after a subsequent
// positioning call. The normal case is that a positioning call will invalidate
// any previously returned InternalKey.UserKey. If a block has a restart
// interval of 1 (no prefix compression), blockIter guarantees that
// InternalKey.UserKey will point to the key as stored in the block itself
// which will remain valid until the blockIter is closed. The key stability
// guarantee is used by the range tombstone code which knows that range
// tombstones are always encoded with a restart interval of 1. This per-block
// key stability guarantee is sufficient for range tombstones as they are
// always encoded in a single block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 14 files reviewed, 6 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)
sstable/writer.go, line 844 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
@sumeerbhola — do you know why we don't compress the range-del block? I'm wondering if whatever reason also applies here.
Compression is more desirable for range keys, since they carry values but are still fragmented.
Hmm, I suspect it's because we cache compressed blocks, so we want to avoid the CPU cost of decompressing the range-del block on every read in the sstable. That concern also applies to range keys, because every MVCC read will need to consult the range-key block if there is one...
I think we should keep it uncompressed for now, but I think this deserves more thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thought of something—We'll want to feed these range keys to the block property collectors, but probably through a new range-key specific method on the BlockPropertyCollector
. Some block-level property filters might want to adjust the table-level properties based on the range keys.
I think tackling that in a followup PR is fine too.
Reviewable status: 7 of 14 files reviewed, 6 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an item to the list for wiring up the block property collectors.
Reviewable status: 7 of 14 files reviewed, 5 unresolved discussions (waiting on @jbowens and @sumeerbhola)
sstable/properties.go, line 120 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I think we should drop the aggregate
NumRangeKeys
property, and we can compute it by summing the individual kinds if we need it.
Done.
sstable/writer.go, line 388 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Wondering if we should adjust the phrasing to indicate that the
key
parameter, indicating theStart
boundary, must be strictly ascending.
Done. Matched the verbiage that we use elsewhere.
sstable/writer.go, line 417 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I didn't catch this on my first pass, but this
InternalCompare
comparison should always return-1
with properly ordered range keys. If the user keys are equal, it will compare on sequence numbers, and if those are equal, it will compare on kind. So thec == 0
case is always an error case because the user is adding a duplicate key.We could replace the switch with:
if base.InternalCompare(w.compare, prevStartKey, curStartKey) >= 0 { w.err = errors.Errorf("pebble: range keys starts must be added in ascending order: %s, %s", prevStartKey.Pretty(w.formatKey), key.Pretty(w.formatKey)) }
and lift the fragmentation bounds checks from the
default:
case into the main function block.
Ah nice catch. That was left over from before making use of base.InternalCompare
. Cleaned this up.
sstable/writer.go, line 844 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Hmm, I suspect it's because we cache compressed blocks, so we want to avoid the CPU cost of decompressing the range-del block on every read in the sstable. That concern also applies to range keys, because every MVCC read will need to consult the range-key block if there is one...
I think we should keep it uncompressed for now, but I think this deserves more thought.
Added a TODO.
sstable/writer.go, line 1048 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I think we need to initialize the
rangeKeyBlock
field and itsrestartInterval
to1
. This disables key prefix compression. I think we should also update thisblockIter
comment to include range keys in its description of key stability because we'll require that stability for range keys too:// A blockIter provides an additional guarantee around key stability when a // block has a restart interval of 1 (i.e. when there is no prefix // compression). Key stability refers to whether the InternalKey.UserKey bytes // returned by a positioning call will remain stable after a subsequent // positioning call. The normal case is that a positioning call will invalidate // any previously returned InternalKey.UserKey. If a block has a restart // interval of 1 (no prefix compression), blockIter guarantees that // InternalKey.UserKey will point to the key as stored in the block itself // which will remain valid until the blockIter is closed. The key stability // guarantee is used by the range tombstone code which knows that range // tombstones are always encoded with a restart interval of 1. This per-block // key stability guarantee is sufficient for range tombstones as they are // always encoded in a single block.
Good catch. Updated.
428046b
to
394e6ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 14 files at r1, 3 of 9 files at r4, all commit messages.
Reviewable status: 8 of 15 files reviewed, all discussions resolved (waiting on @jbowens and @sumeerbhola)
Range keys (see cockroachdb#1341) will be stored in their own, single block of an sstable. Add a new, optional meta block, indexed as "pebble.range_key" in the metablock index, to the sstable structure. This block is only present when at least one range key has been written to the sstable. Add the ability to add range keys to an sstable via `(*sstable.Writer).Write`. Update existing data-driven tests to support printing of the range key summary. Add additional test coverage demonstrating writing of range keys with an `sstable.Writer`. Add minimal functionality to `sstable.Reader` to support writing the data-driven test cases for the writer. Additional read-oriented functionality will be added in a subsequent patch. Related to cockroachdb#1339.
394e6ba
to
fde3810
Compare
@sumeerbhola and I chatted offline - I'll merge this as-is, and will follow-up on any comments that come up post-merge. TFTRs! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 15 files reviewed, 1 unresolved discussion
sstable/table.go, line 89 at r5 (raw file):
[index block] (for single level index) [meta range key block] (optional) [meta properties block]
The meta properties block is after the rangedel block
The properties block is written after any range deletion block that may be present in the sstable. Fix the docs. Follow-up from cockroachdb#1400.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 15 files reviewed, 1 unresolved discussion
sstable/table.go, line 89 at r5 (raw file):
Previously, sumeerbhola wrote…
The meta properties block is after the rangedel block
Woops! Thanks for catching. Fix: #1413.
The properties block is written after any range deletion block that may be present in the sstable. Fix the docs. Follow-up from #1400.
Range keys (see #1341) will be stored in their own, single block of an
sstable. Add a new, optional meta block, indexed as "pebble.range_key"
in the metablock index, to the sstable structure. This block is only
present when at least one range key has been written to the sstable.
Add the ability to add range keys to an sstable via
(*sstable.Writer).Write
.Update existing data-driven tests to support printing of the range key
summary. Add additional test coverage demonstrating writing of range
keys with an
sstable.Writer
.Add minimal functionality to
sstable.Reader
to support writing thedata-driven test cases for the writer. Additional read-oriented
functionality will be added in a subsequent patch.
Related to #1339.