-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sstable: allow range key property collection #1426
sstable: allow range key property collection #1426
Conversation
Currently, when point keys are added to an sstable, the key is run through each of the block property collectors configured on the `sstable.Writer`. Range keys are not included. Allow range keys to run through the same block property collector pipeline by looping through all configured collectors on the writer and passing the range key. Any struct that implements `sstable.BlockPropertyCollector` is allowed to accept any type of key kind. It is up to the implementation to filter out keys that are not applicable. This allows for flexibility in crafting collectors that can take specific types of keys, or combinations of key kinds (e.g. point- and range-key specific collectors, or table-wide "global" collectors). Rename the existing `DataBlockIntervalCollector` interface to `BlockIntervalCollector`, and rename the `FinishDataBlock` method to `FinishBlock`. Both of these change are intended to make the interface applicable to keys other than point keys that may have dedicated blocks in the sstable, as is the case for range keys. Rename the existing `BlockIntervalCollector` struct to `DataBlockIntervalCollector`. To make the struct specific to point keys, ignore range key kinds in calls to `Add. Add a new `BlockPropertyCollector` helper implementation, `RangeKeyBlockIntervalCollector`, that operates exclusively on range keys. All other keys kinds are ignored. The collector is intended to support maintaining an upper and lower bound on the MVCC timestamps present in a range key block in an sstable. Add a test implementation of a `BlockIntervalCollector` and a data-driven that demonstrates maintaining upper and lower bounds on point and range keys with integer suffixes (e.g. `[email protected]:foo`, `[email protected]:bar [(@100=baz)]`, etc.). One downside with this implementation is that the `BlockPropertyCollector` interface contains methods such as `FinishDataBlock` and `FinishIndexBlock` that are not applicable to range keys (range keys are all in a single block, and do not have an index block, respectively). However, retaining these specific methods allows for implementations to be created that could support both point and range keys (for example, if creating a collector whose properties are intended to be applicable to all key kinds). Alternatives approaches considered, with relative downsides: - adding a new member field to `BlockIntervalCollector` for tracking the range key interval. The downside with this approach is that it requires that both the point and range key intervals be encoded into the properties for the table and block with the same name. This make it difficult to disambiguate the intervals on the read path. - have member fields for both point and range key collectors in `sstable.Writer`. The downside with this approach is that it requires more intrusive changes to the `Writer` to call the specific type of property collector on the write path, which does not scale nicely to support various block types that require property collection (i.e. potentially supporting range-dels in the future). There is also some nuance to managing separate collections of collectors with the `shortID` mapping that is used when encoding the properties in the properties and index blocks (i.e. need to be careful not to re-use the same ordinals, etc.).
5ab4d42
to
e9d6878
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 1 unresolved discussion (waiting on @jbowens and @nicktrav)
sstable/block_property.go, line 119 at r1 (raw file):
// maintains state for the current data block, and resets that state in // FinishDataBlock. This interface can be used to reduce parsing costs. type DataBlockIntervalCollector interface {
what I was suggesting in the VC was something like
type collectorKeyKind int8
const (
pointOnly collectorKeyKind = iota
rangeKeysOnly
pointAndRangeKeys
)
type BlockIntervalCollector struct {
name ...
dbic ...
keyKind collectorKeyKind
blockInterval ...
indexInterval ...
tableInterval ...
}
So the BlockIntervalCollector does the demultiplexing based on the keyKind it is configured with, instead of adding another implementation which is mostly a subset of the current one. The other downside of these two implementations is that neither supports the case of someone wanting to have the same property collector handle both point and range keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing a couple discussion points from a chat between @sumeerbhola and I:
-
specific to MVCC timestamp collection: currently I was assuming that we'd have two collectors tracking point and range keys separately. When the table is finalized, we'd be writing bounds for both point and range keys separately (i.e. two distinct properties). @sumeerbhola pointed out that a single collector may suffice - the argument for being that MVCC bounds represent a single property of the blocks in the table. For such a collector, the range key interval would be unioned with the point key interval when the table is finalized. A potential downside here is that we may increase the false positive rate for filtering point keys if the we had a range key spanning the width of the table, but only had a handful of relevant data blocks in the table.
-
documentation: how a
BlockPropertyCollector
is called - we're keeping the current interface as it is. However, there may be implementations that ignore range or point keys (as is the case in the test examples). We should document that an implementation should expect to receive keys of any kind, and it's up to the implementation to decide what to do with the keys. @jbowens - this is along the lines of our assumption earlier this week. -
documentation: the contract for a
BlockPropertyCollector
implementation is such thatAdd
,FinishDataBlock
will be called at least once.FinishTable
will be called exactly once. The implementation should also expect that the other methods may be called - i.e.AddPrevDataBlockToIndexBlock
andFinishIndexBlock
- though these may not be applicable to the specific implementation (builds on the previous point). -
ergonomics: make use of an enum for determining what type of keys a
BlockIntervalCollector
should collect. This allows for reverting to make use of a single struct for collection, rather than two. @sumeerbhola left a comment to this effect in the review.
Reviewable status: 0 of 5 files reviewed, 1 unresolved discussion (waiting on @jbowens and @sumeerbhola)
sstable/block_property.go, line 119 at r1 (raw file):
Previously, sumeerbhola wrote…
what I was suggesting in the VC was something like
type collectorKeyKind int8 const ( pointOnly collectorKeyKind = iota rangeKeysOnly pointAndRangeKeys ) type BlockIntervalCollector struct { name ... dbic ... keyKind collectorKeyKind blockInterval ... indexInterval ... tableInterval ... }
So the BlockIntervalCollector does the demultiplexing based on the keyKind it is configured with, instead of adding another implementation which is mostly a subset of the current one. The other downside of these two implementations is that neither supports the case of someone wanting to have the same property collector handle both point and range keys.
Thanks. I think this is strictly better. Will update.
Superseded by #1452. |
Note for reviewers: treat this as preliminary. It's possible that I'm missing something obvious that would allow for a simpler implementation.
Some assumptions on my side that I'd appreciate some confirmation on:
TODOs:
RangeKeyBlockIntervalCollector
Currently, when point keys are added to an sstable, the key is run
through each of the block property collectors configured on the
sstable.Writer
. Range keys are not included.Allow range keys to run through the same block property collector
pipeline by looping through all configured collectors on the writer and
passing the range key. Any struct that implements
sstable.BlockPropertyCollector
is allowed to accept any type of keykind. It is up to the implementation to filter out keys that are not
applicable. This allows for flexibility in crafting collectors that can
take specific types of keys, or combinations of key kinds (e.g. point-
and range-key specific collectors, or table-wide "global" collectors).
Rename the existing
DataBlockIntervalCollector
interface toBlockIntervalCollector
, and rename theFinishDataBlock
method toFinishBlock
. Both of these change are intended to make the interfaceapplicable to keys other than point keys that may have dedicated blocks
in the sstable, as is the case for range keys.
Rename the existing
BlockIntervalCollector
struct toDataBlockIntervalCollector
. To make the struct specific to point keys,ignore range key kinds in calls to `Add.
Add a new
BlockPropertyCollector
helper implementation,RangeKeyBlockIntervalCollector
, that operates exclusively on rangekeys. All other keys kinds are ignored. The collector is intended to
support maintaining an upper and lower bound on the MVCC timestamps
present in a range key block in an sstable.
Add a test implementation of a
BlockIntervalCollector
and adata-driven that demonstrates maintaining upper and lower bounds on
point and range keys with integer suffixes (e.g.
[email protected]:foo
,[email protected]:bar [(@100=baz)]
, etc.).One downside with this implementation is that the
BlockPropertyCollector
interface contains methods such asFinishDataBlock
andFinishIndexBlock
that are not applicable torange keys (range keys are all in a single block, and do not have an
index block, respectively). However, retaining these specific methods
allows for implementations to be created that could support both point
and range keys (for example, if creating a collector whose properties
are intended to be applicable to all key kinds).
Alternatives approaches considered, with relative downsides:
adding a new member field to
BlockIntervalCollector
for tracking therange key interval. The downside with this approach is that it
requires that both the point and range key intervals be encoded into
the properties for the table and block with the same name. This make
it difficult to disambiguate the intervals on the read path.
have member fields for both point and range key collectors in
sstable.Writer
. The downside with this approach is that it requiresmore intrusive changes to the
Writer
to call the specific type ofproperty collector on the write path, which does not scale nicely to
support various block types that require property collection (i.e.
potentially supporting range-dels in the future). There is also some
nuance to managing separate collections of collectors with the
shortID
mapping that is used when encoding the properties in theproperties and index blocks (i.e. need to be careful not to re-use the
same ordinals, etc.).