-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: Finer-grained Time-Bound Iteration (TBI) #1190
Comments
Once we have knowledge of MVCC in Pebble, we could explore compaction-time discarding of keys expired due to row-level TTLs instead. Like you mentioned it's a simpler case than general compaction-time GC. Maybe it would be a good stepping stone? MVCCStats are still an obstacle. |
Yes, that is possible, though (a) compaction-time TTL enforcement may miss things that we would need to delete by doing an efficient scan, so an efficient scan would still be useful, (b) there is (possibly) a requirement to have TTLs play nicely with CDC (normal GC is deleting obsoleted versions, while TTLs can delete the latest version), which is hard to satisfy with compaction-time TTL (there is a lot of discussion in the margins of https://docs.google.com/document/d/1aq4jft6LVZ4yb2M4v0mGHbIvTu6Xm3_bbAISFof_HG0/edit#heading=h.zgshrzthknjr) |
I am inclined towards sticking it in the usual single-level or two-level index, since this is per data block (group of data block for the two-level index) info that we already have a place for. I don't think we really need the ability to search in the index efficiently along both the key and time dimension. We would continue to binary search in the index only using the key dimension, and then linearly step forward/backward until we find a block that overlaps in time. |
Block interval annotations are an optional user-facing feature that can be used to filter data blocks from an Iterator before they are loaded. Interval annotations are of the form [lower,upper) where both lower and upper are uint64, and represent a set such that the block can contain keys that belong to this set, but no keys outside this set. That is, the set is not necessarily tight. The interval [0,x) for any x is reserved and represents the universal set. A [0,x) annotation is not written out to the sstable, and is useful in two cases: - These annotations are written as part of the BlockHandle in index blocks (first level and second level), where they refer to either data blocks (when written in a first level index block) or first level index blocks (when written in a second level index block). BlockHandles for other kinds of blocks default to the [0,0) annotation and avoid writing anything for the annotation. - Tables written prior to this feature default to blocks having a [0,0) annotation, which is correctly interpreted as the universal set since we do not have any useful filtering information. This is needed in order to turn on this feature without rewriting any data. The implementation requires [lower,upper) to satisfy upper >= lower. And when lower > 0, these are encoded as either: - lower, upper-lower when upper-lower > 0 - lower The implementation does not require any particular lower value for representation of the empty set [lower,lower), but from an encoding perspective [1,1) is the most efficient. Bi-directional compatibility is not a goal, and files written with block interval annotations cannot be read by old code that does not know about the existence of this feature. The main use case for block interval annotations is as a fine-grained version of CockroachDB's time-bound iterator, that is used to ignore blocks that do not contain MVCC timestamps that are relevant. See Fixes cockroachdb#1190
Block interval annotations are an optional user-facing feature that can be used to filter data blocks from an Iterator before they are loaded. Interval annotations are of the form [lower,upper) where both lower and upper are uint64, and represent a set such that the block can contain keys that belong to this set, but no keys outside this set. That is, the set is not necessarily tight. The interval [0,x) for any x is reserved and represents the universal set. A [0,x) annotation is not written out to the sstable, and is useful in two cases: - These annotations are written as part of the BlockHandle in index blocks (first level and second level), where they refer to either data blocks (when written in a first level index block) or first level index blocks (when written in a second level index block). BlockHandles for other kinds of blocks default to the [0,0) annotation and avoid writing anything for the annotation. - Tables written prior to this feature default to blocks having a [0,0) annotation, which is correctly interpreted as the universal set since we do not have any useful filtering information. This is needed in order to turn on this feature without rewriting any data. The implementation requires [lower,upper) to satisfy upper >= lower. And when lower > 0, these are encoded as either: - lower, upper-lower when upper-lower > 0 - lower The implementation does not require any particular lower value for representation of the empty set [lower,lower), but from an encoding perspective [1,1) is the most efficient. Bi-directional compatibility is not a goal, and files written with block interval annotations cannot be read by old code that does not know about the existence of this feature. The main use case for block interval annotations is as a fine-grained version of CockroachDB's time-bound iterator, that is used to ignore blocks that do not contain MVCC timestamps that are relevant. See Fixes cockroachdb#1190
Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The id is encoded in index entries. The name is encoded in the properties block for the sstable. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes cockroachdb#1190
Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The id is encoded in index entries. The name is encoded in the properties block for the sstable. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes cockroachdb#1190
Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The id is encoded in index entries. The name is encoded in the properties block for the sstable. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes cockroachdb#1190
Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The id is encoded in index entries. The name is encoded in the properties block for the sstable. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes cockroachdb#1190
Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The id is encoded in index entries. The name is encoded in the properties block for the sstable. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes cockroachdb#1190
Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The name => id mapping is specific to an sstable and stored in the user properties. The id is encoded in index entries instead of the name. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes cockroachdb#1190
Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The name => id mapping is specific to an sstable and stored in the user properties. The id is encoded in index entries instead of the name. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes cockroachdb#1190
Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The name => id mapping is specific to an sstable and stored in the user properties. The id is encoded in index entries instead of the name. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes #1190
The integration on the CockroachDB side is also done. The only remaining item is cockroachdb/cockroach#73768 |
Have we considered also pushing down filtering of keys within blocks? Filtering by timestamp at the level of a block or sstable iterator seems like it would be a win, avoiding unnecessary full-key comparisons during merging. |
I didn't consider it, but it seems worth experimenting with and benchmarking. If most data is in L6, which is where block property filters are more likely to have false positives, that creates a likelihood that the mergingIter heap will have fewer than O(number of levels) elements, and that the L6 level iterator will stay at the top of the heap, which would mean each step is 1 key comparison. That can be significant enough, if keys are long. |
In the CockroachDB context, we have used the
TablePropertyCollector
andIterOptions.TableFilter
hook to implement a time-bound Iterator (TBI) which only iterates over sstables that contain data that intersects with a a (Tstart, Tend] time interval of the MVCC timestamps. This is currently used for more efficient export requests (incremental backups) and planned to be used for rangefeed catchup scans, both of which need to look at only recent data wrt timestamps.Another use case that has come up is to find expired data when enforcing row-level TTLs, where (Tstart, Tend] will be in the distant past (imagine expiring 1 year old data, and doing this expiry every day, so we would be effectively looking for (now-366d, now-365d]. Since TBI filters at the granularity of sstables, and L6 contains 90% of the data, the expectation is that the current TBI will be ineffective for this use case (the finer-grained approach described here would likely be effective)
In 21.1 we introduced read-triggered compactions, which use spare compaction resources and knowledge of read amplification encountered by reads to prioritize compacting down sstables in parts of the key space that would most benefit from reduced read amplification. The assumption here is that compacted data is never harmful for reads. The above TBI optimization violates this assumption -- if all data is compacted down to L6, and the keys that have been updated in (Tstart, Tend] are sparse, but evenly distributed across the key space, every sstable will be included in the TBI. This bad interaction between TBI and read-triggered compactions has been observed in practice in CockroachDB's telemetry cluster. TBI is also fragile under manual compactions, which we sometimes do when restoring a poorly shaped LSM to health.
We cannot truly lay the blame for TBI's fragility to read-triggered compactions: using (spare) resources to reduce read amplification should always be a worthy goal of the system, and anything that goes against that should be fixed. One approach would be to explicitly keep older data in separate ssblocks. In addition to being a very significant change to how we layout data, it also means more merging when using regular iterators. The following documents a simpler, high-level approach derived from a detailed proposal by @jbowens. Note that all such approaches require Pebble to have some knowledge of timestamps (unlike the current TBI approach).
Assumption: Filtering at the granularity of sstables is too coarse, but filtering at the granularity of ssblocks should be sufficient if most rows in an sstable are not relevant to the TBI time interval. This assumption needs to be validated with normal workloads. In the worst case we could get unlucky and need 1 row from each ssblock. Note that we would continue to also filter by sstable.
Approach:
The text was updated successfully, but these errors were encountered: