Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: Finer-grained Time-Bound Iteration (TBI) #1190

Closed
sumeerbhola opened this issue Jul 12, 2021 · 6 comments · Fixed by #1315
Closed

db: Finer-grained Time-Bound Iteration (TBI) #1190

sumeerbhola opened this issue Jul 12, 2021 · 6 comments · Fixed by #1315
Assignees

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Jul 12, 2021

In the CockroachDB context, we have used the TablePropertyCollector and IterOptions.TableFilter hook to implement a time-bound Iterator (TBI) which only iterates over sstables that contain data that intersects with a a (Tstart, Tend] time interval of the MVCC timestamps. This is currently used for more efficient export requests (incremental backups) and planned to be used for rangefeed catchup scans, both of which need to look at only recent data wrt timestamps.
Another use case that has come up is to find expired data when enforcing row-level TTLs, where (Tstart, Tend] will be in the distant past (imagine expiring 1 year old data, and doing this expiry every day, so we would be effectively looking for (now-366d, now-365d]. Since TBI filters at the granularity of sstables, and L6 contains 90% of the data, the expectation is that the current TBI will be ineffective for this use case (the finer-grained approach described here would likely be effective)

In 21.1 we introduced read-triggered compactions, which use spare compaction resources and knowledge of read amplification encountered by reads to prioritize compacting down sstables in parts of the key space that would most benefit from reduced read amplification. The assumption here is that compacted data is never harmful for reads. The above TBI optimization violates this assumption -- if all data is compacted down to L6, and the keys that have been updated in (Tstart, Tend] are sparse, but evenly distributed across the key space, every sstable will be included in the TBI. This bad interaction between TBI and read-triggered compactions has been observed in practice in CockroachDB's telemetry cluster. TBI is also fragile under manual compactions, which we sometimes do when restoring a poorly shaped LSM to health.

We cannot truly lay the blame for TBI's fragility to read-triggered compactions: using (spare) resources to reduce read amplification should always be a worthy goal of the system, and anything that goes against that should be fixed. One approach would be to explicitly keep older data in separate ssblocks. In addition to being a very significant change to how we layout data, it also means more merging when using regular iterators. The following documents a simpler, high-level approach derived from a detailed proposal by @jbowens. Note that all such approaches require Pebble to have some knowledge of timestamps (unlike the current TBI approach).

Assumption: Filtering at the granularity of sstables is too coarse, but filtering at the granularity of ssblocks should be sufficient if most rows in an sstable are not relevant to the TBI time interval. This assumption needs to be validated with normal workloads. In the worst case we could get unlucky and need 1 row from each ssblock. Note that we would continue to also filter by sstable.

Approach:

  • Maintain (in a time index in the sstable) what is conceptually a set containing [min,max] timestamps for each ssblock.
  • When seeking to block i, based on the key index, look for the first block >= i whose timestamp interval overlaps with the TBI interval, and seek to that block.
    • When a seek is not able to use a Next optimization, both the key index block, and data block is seeked. We are not able to reduce the number of such seeks with this finer-grained TBI optimization. But note that we will continue to use the coarse-grained optimization too, and when the coarse-grained optimization is ineffective, it is often due to more compactions, which means we have fewer files and therefore fewer seeks.
    • Most of the use cases involve few seeks and mostly Next/Prev since they are scanning over a whole CockroachDB range, so the value of this optimization is in not loading irrelevant blocks and not having CockroachDB iterate over key-value pairs in these irrelevant blocks.
  • When Next goes past the end of a block, skip blocks that don’t overlap with the TBI interval.
@jbowens
Copy link
Collaborator

jbowens commented Jul 13, 2021

Another use case that has come up is to find expired data when enforcing row-level TTLs, where (Tstart, Tend] will be in the distant past (imagine expiring 1 year old data, and doing this expiry every day, so we would be effectively looking for (now-366d, now-365d]. Since TBI filters at the granularity of sstables, and L6 contains 90% of the data, the expectation is that the current TBI will be ineffective for this use case (the finer-grained approach described here would likely be effective)

Once we have knowledge of MVCC in Pebble, we could explore compaction-time discarding of keys expired due to row-level TTLs instead. Like you mentioned it's a simpler case than general compaction-time GC. Maybe it would be a good stepping stone? MVCCStats are still an obstacle.

@sumeerbhola
Copy link
Collaborator Author

Once we have knowledge of MVCC in Pebble, we could explore compaction-time discarding of keys expired due to row-level TTLs instead. Like you mentioned it's a simpler case than general compaction-time GC. Maybe it would be a good stepping stone? MVCCStats are still an obstacle.

Yes, that is possible, though (a) compaction-time TTL enforcement may miss things that we would need to delete by doing an efficient scan, so an efficient scan would still be useful, (b) there is (possibly) a requirement to have TTLs play nicely with CDC (normal GC is deleting obsoleted versions, while TTLs can delete the latest version), which is hard to satisfy with compaction-time TTL (there is a lot of discussion in the margins of https://docs.google.com/document/d/1aq4jft6LVZ4yb2M4v0mGHbIvTu6Xm3_bbAISFof_HG0/edit#heading=h.zgshrzthknjr)

@sumeerbhola
Copy link
Collaborator Author

Maintain (in a time index in the sstable) what is conceptually a set containing [min,max] timestamps for each ssblock.

I am inclined towards sticking it in the usual single-level or two-level index, since this is per data block (group of data block for the two-level index) info that we already have a place for. I don't think we really need the ability to search in the index efficiently along both the key and time dimension. We would continue to binary search in the index only using the key dimension, and then linearly step forward/backward until we find a block that overlaps in time.

@sumeerbhola sumeerbhola self-assigned this Aug 6, 2021
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Aug 19, 2021
Block interval annotations are an optional user-facing feature that can be
used to filter data blocks from an Iterator before they are loaded.

Interval annotations are of the form [lower,upper) where both lower and
upper are uint64, and represent a set such that the block can contain keys
that belong to this set, but no keys outside this set. That is, the set is
not necessarily tight.

The interval [0,x) for any x is reserved and represents the universal set.
A [0,x) annotation is not written out to the sstable, and is useful in
two cases:
- These annotations are written as part of the BlockHandle in index blocks
  (first level and second level), where they refer to either data blocks
  (when written in a first level index block) or first level index blocks
  (when written in a second level index block). BlockHandles for other
   kinds of blocks default to the [0,0) annotation and avoid writing
  anything for the annotation.
- Tables written prior to this feature default to blocks having a [0,0)
  annotation, which is correctly interpreted as the universal set
  since we do not have any useful filtering information. This is needed
  in order to turn on this feature without rewriting any data.

The implementation requires [lower,upper) to satisfy upper >= lower. And
when lower > 0, these are encoded as either:
- lower, upper-lower when upper-lower > 0
- lower

The implementation does not require any particular lower value for
representation of the empty set [lower,lower), but from an encoding
perspective [1,1) is the most efficient.

Bi-directional compatibility is not a goal, and files written with block
interval annotations cannot be read by old code that does not know
about the existence of this feature.

The main use case for block interval annotations is as a fine-grained
version of CockroachDB's time-bound iterator, that is used to ignore
blocks that do not contain MVCC timestamps that are relevant. See

Fixes cockroachdb#1190
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Sep 28, 2021
Block interval annotations are an optional user-facing feature that can be
used to filter data blocks from an Iterator before they are loaded.

Interval annotations are of the form [lower,upper) where both lower and
upper are uint64, and represent a set such that the block can contain keys
that belong to this set, but no keys outside this set. That is, the set is
not necessarily tight.

The interval [0,x) for any x is reserved and represents the universal set.
A [0,x) annotation is not written out to the sstable, and is useful in
two cases:
- These annotations are written as part of the BlockHandle in index blocks
  (first level and second level), where they refer to either data blocks
  (when written in a first level index block) or first level index blocks
  (when written in a second level index block). BlockHandles for other
   kinds of blocks default to the [0,0) annotation and avoid writing
  anything for the annotation.
- Tables written prior to this feature default to blocks having a [0,0)
  annotation, which is correctly interpreted as the universal set
  since we do not have any useful filtering information. This is needed
  in order to turn on this feature without rewriting any data.

The implementation requires [lower,upper) to satisfy upper >= lower. And
when lower > 0, these are encoded as either:
- lower, upper-lower when upper-lower > 0
- lower

The implementation does not require any particular lower value for
representation of the empty set [lower,lower), but from an encoding
perspective [1,1) is the most efficient.

Bi-directional compatibility is not a goal, and files written with block
interval annotations cannot be read by old code that does not know
about the existence of this feature.

The main use case for block interval annotations is as a fine-grained
version of CockroachDB's time-bound iterator, that is used to ignore
blocks that do not contain MVCC timestamps that are relevant. See

Fixes cockroachdb#1190
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Oct 5, 2021
Block property collectors (and the corresponding filters) are an optional
user-facing feature that can be used to ignore data in the context of
an Iterator. The finest granularity of operation is a data block, and the
properties are aggregated into the upper level index block, and the
sstable, so an entire lower level index block (and its data blocks), or
an entire sstable can be ignored.

Multiple collectors can be configured in a DB, with recommendations to
keep the total size of properties of a block to < 50-100 bytes. The
collector can extract properties from the key or value. The first use case
for this will be to extract the MVCC timestamp in CockroachDB, and use
it for fine grained time-bound iteration. One can imagine extracting
min-max values of other columns either in the key or value, when the
key-value represents tabular data.

To efficiently handle multiple collectors, a collector needs to be
identified with a short integer id, in addition to a unique name. The
id is encoded in index entries. The name is encoded in the properties
block for the sstable.

Helper collector and filter implementations are provided for properties
represented as a set of the form [lower,upper), where both lower and
upper are uint64.

The feature is incompatible with older versions of Pebble since block
handles can be longer and older versions have checks regarding no
extra bytes in a block handle, which will fail.

Fixes cockroachdb#1190
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Oct 6, 2021
Block property collectors (and the corresponding filters) are an optional
user-facing feature that can be used to ignore data in the context of
an Iterator. The finest granularity of operation is a data block, and the
properties are aggregated into the upper level index block, and the
sstable, so an entire lower level index block (and its data blocks), or
an entire sstable can be ignored.

Multiple collectors can be configured in a DB, with recommendations to
keep the total size of properties of a block to < 50-100 bytes. The
collector can extract properties from the key or value. The first use case
for this will be to extract the MVCC timestamp in CockroachDB, and use
it for fine grained time-bound iteration. One can imagine extracting
min-max values of other columns either in the key or value, when the
key-value represents tabular data.

To efficiently handle multiple collectors, a collector needs to be
identified with a short integer id, in addition to a unique name. The
id is encoded in index entries. The name is encoded in the properties
block for the sstable.

Helper collector and filter implementations are provided for properties
represented as a set of the form [lower,upper), where both lower and
upper are uint64.

The feature is incompatible with older versions of Pebble since block
handles can be longer and older versions have checks regarding no
extra bytes in a block handle, which will fail.

Fixes cockroachdb#1190
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Oct 7, 2021
Block property collectors (and the corresponding filters) are an optional
user-facing feature that can be used to ignore data in the context of
an Iterator. The finest granularity of operation is a data block, and the
properties are aggregated into the upper level index block, and the
sstable, so an entire lower level index block (and its data blocks), or
an entire sstable can be ignored.

Multiple collectors can be configured in a DB, with recommendations to
keep the total size of properties of a block to < 50-100 bytes. The
collector can extract properties from the key or value. The first use case
for this will be to extract the MVCC timestamp in CockroachDB, and use
it for fine grained time-bound iteration. One can imagine extracting
min-max values of other columns either in the key or value, when the
key-value represents tabular data.

To efficiently handle multiple collectors, a collector needs to be
identified with a short integer id, in addition to a unique name. The
id is encoded in index entries. The name is encoded in the properties
block for the sstable.

Helper collector and filter implementations are provided for properties
represented as a set of the form [lower,upper), where both lower and
upper are uint64.

The feature is incompatible with older versions of Pebble since block
handles can be longer and older versions have checks regarding no
extra bytes in a block handle, which will fail.

Fixes cockroachdb#1190
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Oct 11, 2021
Block property collectors (and the corresponding filters) are an optional
user-facing feature that can be used to ignore data in the context of
an Iterator. The finest granularity of operation is a data block, and the
properties are aggregated into the upper level index block, and the
sstable, so an entire lower level index block (and its data blocks), or
an entire sstable can be ignored.

Multiple collectors can be configured in a DB, with recommendations to
keep the total size of properties of a block to < 50-100 bytes. The
collector can extract properties from the key or value. The first use case
for this will be to extract the MVCC timestamp in CockroachDB, and use
it for fine grained time-bound iteration. One can imagine extracting
min-max values of other columns either in the key or value, when the
key-value represents tabular data.

To efficiently handle multiple collectors, a collector needs to be
identified with a short integer id, in addition to a unique name. The
id is encoded in index entries. The name is encoded in the properties
block for the sstable.

Helper collector and filter implementations are provided for properties
represented as a set of the form [lower,upper), where both lower and
upper are uint64.

The feature is incompatible with older versions of Pebble since block
handles can be longer and older versions have checks regarding no
extra bytes in a block handle, which will fail.

Fixes cockroachdb#1190
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Oct 14, 2021
Block property collectors (and the corresponding filters) are an optional
user-facing feature that can be used to ignore data in the context of
an Iterator. The finest granularity of operation is a data block, and the
properties are aggregated into the upper level index block, and the
sstable, so an entire lower level index block (and its data blocks), or
an entire sstable can be ignored.

Multiple collectors can be configured in a DB, with recommendations to
keep the total size of properties of a block to < 50-100 bytes. The
collector can extract properties from the key or value. The first use case
for this will be to extract the MVCC timestamp in CockroachDB, and use
it for fine grained time-bound iteration. One can imagine extracting
min-max values of other columns either in the key or value, when the
key-value represents tabular data.

To efficiently handle multiple collectors, a collector needs to be
identified with a short integer id, in addition to a unique name. The
id is encoded in index entries. The name is encoded in the properties
block for the sstable.

Helper collector and filter implementations are provided for properties
represented as a set of the form [lower,upper), where both lower and
upper are uint64.

The feature is incompatible with older versions of Pebble since block
handles can be longer and older versions have checks regarding no
extra bytes in a block handle, which will fail.

Fixes cockroachdb#1190
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Oct 15, 2021
Block property collectors (and the corresponding filters) are an optional
user-facing feature that can be used to ignore data in the context of
an Iterator. The finest granularity of operation is a data block, and the
properties are aggregated into the upper level index block, and the
sstable, so an entire lower level index block (and its data blocks), or
an entire sstable can be ignored.

Multiple collectors can be configured in a DB, with recommendations to
keep the total size of properties of a block to < 50-100 bytes. The
collector can extract properties from the key or value. The first use case
for this will be to extract the MVCC timestamp in CockroachDB, and use
it for fine grained time-bound iteration. One can imagine extracting
min-max values of other columns either in the key or value, when the
key-value represents tabular data.

To efficiently handle multiple collectors, a collector needs to be
identified with a short integer id, in addition to a unique name. The
name => id mapping is specific to an sstable and stored in the user
properties. The id is encoded in index entries instead of the name.

Helper collector and filter implementations are provided for properties
represented as a set of the form [lower,upper), where both lower and
upper are uint64.

The feature is incompatible with older versions of Pebble since block
handles can be longer and older versions have checks regarding no
extra bytes in a block handle, which will fail.

Fixes cockroachdb#1190
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Oct 19, 2021
Block property collectors (and the corresponding filters) are an optional
user-facing feature that can be used to ignore data in the context of
an Iterator. The finest granularity of operation is a data block, and the
properties are aggregated into the upper level index block, and the
sstable, so an entire lower level index block (and its data blocks), or
an entire sstable can be ignored.

Multiple collectors can be configured in a DB, with recommendations to
keep the total size of properties of a block to < 50-100 bytes. The
collector can extract properties from the key or value. The first use case
for this will be to extract the MVCC timestamp in CockroachDB, and use
it for fine grained time-bound iteration. One can imagine extracting
min-max values of other columns either in the key or value, when the
key-value represents tabular data.

To efficiently handle multiple collectors, a collector needs to be
identified with a short integer id, in addition to a unique name. The
name => id mapping is specific to an sstable and stored in the user
properties. The id is encoded in index entries instead of the name.

Helper collector and filter implementations are provided for properties
represented as a set of the form [lower,upper), where both lower and
upper are uint64.

The feature is incompatible with older versions of Pebble since block
handles can be longer and older versions have checks regarding no
extra bytes in a block handle, which will fail.

Fixes cockroachdb#1190
sumeerbhola added a commit that referenced this issue Oct 19, 2021
Block property collectors (and the corresponding filters) are an optional
user-facing feature that can be used to ignore data in the context of
an Iterator. The finest granularity of operation is a data block, and the
properties are aggregated into the upper level index block, and the
sstable, so an entire lower level index block (and its data blocks), or
an entire sstable can be ignored.

Multiple collectors can be configured in a DB, with recommendations to
keep the total size of properties of a block to < 50-100 bytes. The
collector can extract properties from the key or value. The first use case
for this will be to extract the MVCC timestamp in CockroachDB, and use
it for fine grained time-bound iteration. One can imagine extracting
min-max values of other columns either in the key or value, when the
key-value represents tabular data.

To efficiently handle multiple collectors, a collector needs to be
identified with a short integer id, in addition to a unique name. The
name => id mapping is specific to an sstable and stored in the user
properties. The id is encoded in index entries instead of the name.

Helper collector and filter implementations are provided for properties
represented as a set of the form [lower,upper), where both lower and
upper are uint64.

The feature is incompatible with older versions of Pebble since block
handles can be longer and older versions have checks regarding no
extra bytes in a block handle, which will fail.

Fixes #1190
@sumeerbhola
Copy link
Collaborator Author

The integration on the CockroachDB side is also done. The only remaining item is cockroachdb/cockroach#73768

@jbowens
Copy link
Collaborator

jbowens commented Jan 7, 2022

Have we considered also pushing down filtering of keys within blocks? Filtering by timestamp at the level of a block or sstable iterator seems like it would be a win, avoiding unnecessary full-key comparisons during merging.

@sumeerbhola
Copy link
Collaborator Author

Have we considered also pushing down filtering of keys within blocks? Filtering by timestamp at the level of a block or sstable iterator seems like it would be a win, avoiding unnecessary full-key comparisons during merging.

I didn't consider it, but it seems worth experimenting with and benchmarking. If most data is in L6, which is where block property filters are more likely to have false positives, that creates a likelihood that the mergingIter heap will have fewer than O(number of levels) elements, and that the L6 level iterator will stay at the top of the heap, which would mean each step is 1 key comparison. That can be significant enough, if keys are long.
Do you want to create an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants