db: Finer-grained Time-Bound Iteration (TBI) #1190

sumeerbhola · 2021-07-12T16:04:48Z

In the CockroachDB context, we have used the TablePropertyCollector and IterOptions.TableFilter hook to implement a time-bound Iterator (TBI) which only iterates over sstables that contain data that intersects with a a (Tstart, Tend] time interval of the MVCC timestamps. This is currently used for more efficient export requests (incremental backups) and planned to be used for rangefeed catchup scans, both of which need to look at only recent data wrt timestamps.
Another use case that has come up is to find expired data when enforcing row-level TTLs, where (Tstart, Tend] will be in the distant past (imagine expiring 1 year old data, and doing this expiry every day, so we would be effectively looking for (now-366d, now-365d]. Since TBI filters at the granularity of sstables, and L6 contains 90% of the data, the expectation is that the current TBI will be ineffective for this use case (the finer-grained approach described here would likely be effective)

In 21.1 we introduced read-triggered compactions, which use spare compaction resources and knowledge of read amplification encountered by reads to prioritize compacting down sstables in parts of the key space that would most benefit from reduced read amplification. The assumption here is that compacted data is never harmful for reads. The above TBI optimization violates this assumption -- if all data is compacted down to L6, and the keys that have been updated in (Tstart, Tend] are sparse, but evenly distributed across the key space, every sstable will be included in the TBI. This bad interaction between TBI and read-triggered compactions has been observed in practice in CockroachDB's telemetry cluster. TBI is also fragile under manual compactions, which we sometimes do when restoring a poorly shaped LSM to health.

We cannot truly lay the blame for TBI's fragility to read-triggered compactions: using (spare) resources to reduce read amplification should always be a worthy goal of the system, and anything that goes against that should be fixed. One approach would be to explicitly keep older data in separate ssblocks. In addition to being a very significant change to how we layout data, it also means more merging when using regular iterators. The following documents a simpler, high-level approach derived from a detailed proposal by @jbowens. Note that all such approaches require Pebble to have some knowledge of timestamps (unlike the current TBI approach).

Assumption: Filtering at the granularity of sstables is too coarse, but filtering at the granularity of ssblocks should be sufficient if most rows in an sstable are not relevant to the TBI time interval. This assumption needs to be validated with normal workloads. In the worst case we could get unlucky and need 1 row from each ssblock. Note that we would continue to also filter by sstable.

Approach:

Maintain (in a time index in the sstable) what is conceptually a set containing [min,max] timestamps for each ssblock.
When seeking to block i, based on the key index, look for the first block >= i whose timestamp interval overlaps with the TBI interval, and seek to that block.
- When a seek is not able to use a Next optimization, both the key index block, and data block is seeked. We are not able to reduce the number of such seeks with this finer-grained TBI optimization. But note that we will continue to use the coarse-grained optimization too, and when the coarse-grained optimization is ineffective, it is often due to more compactions, which means we have fewer files and therefore fewer seeks.
- Most of the use cases involve few seeks and mostly Next/Prev since they are scanning over a whole CockroachDB range, so the value of this optimization is in not loading irrelevant blocks and not having CockroachDB iterate over key-value pairs in these irrelevant blocks.
When Next goes past the end of a block, skip blocks that don’t overlap with the TBI interval.

The text was updated successfully, but these errors were encountered:

jbowens · 2021-07-13T15:25:52Z

Another use case that has come up is to find expired data when enforcing row-level TTLs, where (Tstart, Tend] will be in the distant past (imagine expiring 1 year old data, and doing this expiry every day, so we would be effectively looking for (now-366d, now-365d]. Since TBI filters at the granularity of sstables, and L6 contains 90% of the data, the expectation is that the current TBI will be ineffective for this use case (the finer-grained approach described here would likely be effective)

Once we have knowledge of MVCC in Pebble, we could explore compaction-time discarding of keys expired due to row-level TTLs instead. Like you mentioned it's a simpler case than general compaction-time GC. Maybe it would be a good stepping stone? MVCCStats are still an obstacle.

sumeerbhola · 2021-07-13T16:59:10Z

Once we have knowledge of MVCC in Pebble, we could explore compaction-time discarding of keys expired due to row-level TTLs instead. Like you mentioned it's a simpler case than general compaction-time GC. Maybe it would be a good stepping stone? MVCCStats are still an obstacle.

Yes, that is possible, though (a) compaction-time TTL enforcement may miss things that we would need to delete by doing an efficient scan, so an efficient scan would still be useful, (b) there is (possibly) a requirement to have TTLs play nicely with CDC (normal GC is deleting obsoleted versions, while TTLs can delete the latest version), which is hard to satisfy with compaction-time TTL (there is a lot of discussion in the margins of https://docs.google.com/document/d/1aq4jft6LVZ4yb2M4v0mGHbIvTu6Xm3_bbAISFof_HG0/edit#heading=h.zgshrzthknjr)

sumeerbhola · 2021-08-05T13:43:22Z

Maintain (in a time index in the sstable) what is conceptually a set containing [min,max] timestamps for each ssblock.

I am inclined towards sticking it in the usual single-level or two-level index, since this is per data block (group of data block for the two-level index) info that we already have a place for. I don't think we really need the ability to search in the index efficiently along both the key and time dimension. We would continue to binary search in the index only using the key dimension, and then linearly step forward/backward until we find a block that overlaps in time.

Block interval annotations are an optional user-facing feature that can be used to filter data blocks from an Iterator before they are loaded. Interval annotations are of the form [lower,upper) where both lower and upper are uint64, and represent a set such that the block can contain keys that belong to this set, but no keys outside this set. That is, the set is not necessarily tight. The interval [0,x) for any x is reserved and represents the universal set. A [0,x) annotation is not written out to the sstable, and is useful in two cases: - These annotations are written as part of the BlockHandle in index blocks (first level and second level), where they refer to either data blocks (when written in a first level index block) or first level index blocks (when written in a second level index block). BlockHandles for other kinds of blocks default to the [0,0) annotation and avoid writing anything for the annotation. - Tables written prior to this feature default to blocks having a [0,0) annotation, which is correctly interpreted as the universal set since we do not have any useful filtering information. This is needed in order to turn on this feature without rewriting any data. The implementation requires [lower,upper) to satisfy upper >= lower. And when lower > 0, these are encoded as either: - lower, upper-lower when upper-lower > 0 - lower The implementation does not require any particular lower value for representation of the empty set [lower,lower), but from an encoding perspective [1,1) is the most efficient. Bi-directional compatibility is not a goal, and files written with block interval annotations cannot be read by old code that does not know about the existence of this feature. The main use case for block interval annotations is as a fine-grained version of CockroachDB's time-bound iterator, that is used to ignore blocks that do not contain MVCC timestamps that are relevant. See Fixes cockroachdb#1190

Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The id is encoded in index entries. The name is encoded in the properties block for the sstable. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes cockroachdb#1190

Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The name => id mapping is specific to an sstable and stored in the user properties. The id is encoded in index entries instead of the name. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes cockroachdb#1190

Block property collectors (and the corresponding filters) are an optional user-facing feature that can be used to ignore data in the context of an Iterator. The finest granularity of operation is a data block, and the properties are aggregated into the upper level index block, and the sstable, so an entire lower level index block (and its data blocks), or an entire sstable can be ignored. Multiple collectors can be configured in a DB, with recommendations to keep the total size of properties of a block to < 50-100 bytes. The collector can extract properties from the key or value. The first use case for this will be to extract the MVCC timestamp in CockroachDB, and use it for fine grained time-bound iteration. One can imagine extracting min-max values of other columns either in the key or value, when the key-value represents tabular data. To efficiently handle multiple collectors, a collector needs to be identified with a short integer id, in addition to a unique name. The name => id mapping is specific to an sstable and stored in the user properties. The id is encoded in index entries instead of the name. Helper collector and filter implementations are provided for properties represented as a set of the form [lower,upper), where both lower and upper are uint64. The feature is incompatible with older versions of Pebble since block handles can be longer and older versions have checks regarding no extra bytes in a block handle, which will fail. Fixes #1190

sumeerbhola · 2022-01-07T13:42:01Z

The integration on the CockroachDB side is also done. The only remaining item is cockroachdb/cockroach#73768

jbowens · 2022-01-07T15:12:24Z

Have we considered also pushing down filtering of keys within blocks? Filtering by timestamp at the level of a block or sstable iterator seems like it would be a win, avoiding unnecessary full-key comparisons during merging.

sumeerbhola · 2022-01-10T14:30:04Z

Have we considered also pushing down filtering of keys within blocks? Filtering by timestamp at the level of a block or sstable iterator seems like it would be a win, avoiding unnecessary full-key comparisons during merging.

I didn't consider it, but it seems worth experimenting with and benchmarking. If most data is in L6, which is where block property filters are more likely to have false positives, that creates a likelihood that the mergingIter heap will have fewer than O(number of levels) elements, and that the L6 level iterator will stay at the top of the heap, which would mean each step is 1 key comparison. That can be significant enough, if keys are long.
Do you want to create an issue?

sumeerbhola self-assigned this Aug 6, 2021

sumeerbhola mentioned this issue Aug 19, 2021

db: block interval annotations for fine-grained time-bound iteration #1231

Closed

sumeerbhola mentioned this issue Oct 5, 2021

db,sstable: block property collectors and filters #1315

Merged

sumeerbhola closed this as completed in #1315 Oct 19, 2021

sumeerbhola mentioned this issue Nov 1, 2021

kv/storage: disable observed timestamps for async resolved intents that are pushed cockroachdb/cockroach#72121

Closed

jbowens mentioned this issue Jan 10, 2022

db: push down time-bound iterator filtering #1446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: Finer-grained Time-Bound Iteration (TBI) #1190

db: Finer-grained Time-Bound Iteration (TBI) #1190

sumeerbhola commented Jul 12, 2021 •

edited

Loading

jbowens commented Jul 13, 2021

sumeerbhola commented Jul 13, 2021

sumeerbhola commented Aug 5, 2021

sumeerbhola commented Jan 7, 2022

jbowens commented Jan 7, 2022

sumeerbhola commented Jan 10, 2022

db: Finer-grained Time-Bound Iteration (TBI) #1190

db: Finer-grained Time-Bound Iteration (TBI) #1190

Comments

sumeerbhola commented Jul 12, 2021 • edited Loading

jbowens commented Jul 13, 2021

sumeerbhola commented Jul 13, 2021

sumeerbhola commented Aug 5, 2021

sumeerbhola commented Jan 7, 2022

jbowens commented Jan 7, 2022

sumeerbhola commented Jan 10, 2022

sumeerbhola commented Jul 12, 2021 •

edited

Loading