db: block interval annotations for fine-grained time-bound iteration #1231

sumeerbhola · 2021-08-19T15:38:32Z

Block interval annotations are an optional user-facing feature that can be
used to filter data blocks from an Iterator before they are loaded.

Interval annotations are of the form [lower,upper) where both lower and
upper are uint64, and represent a set such that the block can contain keys
that belong to this set, but no keys outside this set. That is, the set is
not necessarily tight.

The interval [0,x) for any x is reserved and represents the universal set.
A [0,x) annotation is not written out to the sstable, and is useful in
two cases:

These annotations are written as part of the BlockHandle in index blocks
(first level and second level), where they refer to either data blocks
(when written in a first level index block) or first level index blocks
(when written in a second level index block). BlockHandles for other
kinds of blocks default to the [0,0) annotation and avoid writing
anything for the annotation.
Tables written prior to this feature default to blocks having a [0,0)
annotation, which is correctly interpreted as the universal set
since we do not have any useful filtering information. This is needed
in order to turn on this feature without rewriting any data.

The implementation requires [lower,upper) to satisfy upper >= lower. And
when lower > 0, these are encoded as either:

lower, upper-lower when upper-lower > 0
lower

The implementation does not require any particular lower value for
representation of the empty set [lower,lower), but from an encoding
perspective [1,1) is the most efficient.

Bi-directional compatibility is not a goal, and files written with block
interval annotations cannot be read by old code that does not know
about the existence of this feature.

The main use case for block interval annotations is as a fine-grained
version of CockroachDB's time-bound iterator, that is used to ignore
blocks that do not contain MVCC timestamps that are relevant. See

Fixes #1190

cockroach-teamcity · 2021-08-19T15:38:39Z

This change is

sumeerbhola

This is lacking comprehensive tests and benchmarks. Sending it out for initial opinions.

Reviewable status: 0 of 14 files reviewed, all discussions resolved (waiting on @itsbilal, @jbowens, and @petermattis)

itsbilal

General idea looks great to me! Only left some minor comments. Are you thinking of stuffing the annotator name in the sstable properties block?

Reviewed 14 of 14 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jbowens, @petermattis, and @sumeerbhola)

internal/base/block_interval_annotator.go, line 18 at r1 (raw file):

// Interval annotations are of the form [lower,upper) where both lower and
// upper are uint64, and represent a set such that the block can contain keys
// that belong to this set, but no keys outside this set. That is, the set is

It took me a while to understand this. Maybe a phrase like If block annotations are happening, for a given annotator function / format, all keys in a block will be present in the set would help.

internal/base/block_interval_annotator.go, line 26 at r1 (raw file):

// - These annotations are written as part of the BlockHandle in index blocks
//   (first level and second level), where they refer to either data blocks
//   (when written in a first level index block) or first level index blocks

Nit: Could also use lower/only level and upper level (or top level as in the twoLevelIterator code). The first/second thing is a little unclear. Feel free to dismiss.

internal/base/block_interval_annotator.go, line 60 at r1 (raw file):

// block. Represents [Lower, Upper).
type BlockInterval struct {
	Lower uint64

I imagine we're using uint64s instead of the more generic []bytes here so that packing it into the block handle is easier / reliable, correct?

The other extreme could be moving logic about merging / etc outside of Pebble completely, and only handling one []byte here . I'm thinking about a type BlockInterval interface that implements the below methods plus Encode, a struct type that specifies decode / creator / annotator functions (similar to Comparer), and a max size for how long an Interval (or maybe more generally a BlockFilter could be). That might be cleaner but it might actually be less performant due to a greater number of UVarint encodes/decodes so feel free to dismiss this.

The reason why I thought of that was because the two uint64s seem to pigeonhole us into handling timestamps well and little else, but maybe that kind of specialization is worthwhile here.

internal/base/block_interval_annotator.go, line 93 at r1 (raw file):

// block in an sstable. This is not precisely set intersection (see the code
// below).
func BlockIntervalsIntersect(iter, block BlockInterval) bool {

Why not make this func (bi *BlockInterval) Intersect(other BlockInterval) bool to match the Union function?

internal/base/block_interval_annotator.go, line 95 at r1 (raw file):

func BlockIntervalsIntersect(iter, block BlockInterval) bool {
	if isEmptySet(iter) {
		// Bizarre iterator that is interested in nothing.

😂

internal/base/block_interval_annotator.go, line 112 at r1 (raw file):

	}
	// INVARIANT: Neither set is empty or universal set.
	return !(iter.Upper <= block.Lower || iter.Lower >= block.Upper)

Nit: This seems more readable when written as return iter.Upper > block.Lower && iter.Lower < block.Upper, but either works.

internal/base/block_interval_annotator.go, line 115 at r1 (raw file):

}

func isEmptySet(x BlockInterval) bool {

Nit: can take x as receiver.

sstable/writer.go, line 104 at r1 (raw file):

	compression             Compression
	separator               Separator
	successor               Successor

Nit: indents should match

petermattis

Would the term block handle annotations be better than block interval annotations? The former connotes to me that the annotation is on the block handle, while the latter connotes the type of annotation, but not where it applies.

Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

options.go, line 85 at r1 (raw file):

	// function can be used by multiple iterators, if the iterator is cloned.
	TableFilter func(userProps map[string]string) bool
	// BlockInterval represents the set the caller is interested in, and

I think this comment could use a little bit more detail. the set is vague. Perhaps we can point to the comment in block_interal_annotator.go.

If we generalize this functionality more, this could become a BlockFilter func() which takes a BlockHandle and returns a bool. That is, we move the BlockIntervalIntersects call into this callback.

options.go, line 462 at r1 (raw file):

	// TODO: Also plumb this into the ingested sstable construction code in
	// CockroachDB.
	BlockIntervalAnnotatorFunc BlockIntervalAnnotatorFunc

Nit: The Func suffix naming is a bit inconsistent with some of the other options. This seems to have been introduced with KeyValidationFunc, but contrast that to DebugCheck and the function pointers in Comparer.

internal/base/block_interval_annotator.go, line 17 at r1 (raw file):

//
// Interval annotations are of the form [lower,upper) where both lower and
// upper are uint64, and represent a set such that the block can contain keys

represent a set of 0-based block indexes?

internal/base/block_interval_annotator.go, line 60 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I imagine we're using uint64s instead of the more generic []bytes here so that packing it into the block handle is easier / reliable, correct?

The other extreme could be moving logic about merging / etc outside of Pebble completely, and only handling one []byte here . I'm thinking about a type BlockInterval interface that implements the below methods plus Encode, a struct type that specifies decode / creator / annotator functions (similar to Comparer), and a max size for how long an Interval (or maybe more generally a BlockFilter could be). That might be cleaner but it might actually be less performant due to a greater number of UVarint encodes/decodes so feel free to dismiss this.

The reason why I thought of that was because the two uint64s seem to pigeonhole us into handling timestamps well and little else, but maybe that kind of specialization is worthwhile here.

Ditto on this comment. I made a similar comment elsewhere about further abstracting from a concrete BlockInterval.

Related, are we at all worried that Lower and Upper contain only the wall time and not a full HLC timestamp?

sstable/filter.go, line 39 at r1 (raw file):

type BlockHandle struct {
	Offset, Length uint64
	BlockInterval base.BlockInterval

I wonder if we should make this a more general facility for BlockHandle annotations. Some of the infrastructure being added here is already generalized, yet then tied concretely to BlockInterval. I'm imagining something where the BlockHandle annotation is encoded in []byte and there is an options callback to decode and process these annotations. I haven't thought through this in detail, but it seems feasible to do this.

Block interval annotations are an optional user-facing feature that can be used to filter data blocks from an Iterator before they are loaded. Interval annotations are of the form [lower,upper) where both lower and upper are uint64, and represent a set such that the block can contain keys that belong to this set, but no keys outside this set. That is, the set is not necessarily tight. The interval [0,x) for any x is reserved and represents the universal set. A [0,x) annotation is not written out to the sstable, and is useful in two cases: - These annotations are written as part of the BlockHandle in index blocks (first level and second level), where they refer to either data blocks (when written in a first level index block) or first level index blocks (when written in a second level index block). BlockHandles for other kinds of blocks default to the [0,0) annotation and avoid writing anything for the annotation. - Tables written prior to this feature default to blocks having a [0,0) annotation, which is correctly interpreted as the universal set since we do not have any useful filtering information. This is needed in order to turn on this feature without rewriting any data. The implementation requires [lower,upper) to satisfy upper >= lower. And when lower > 0, these are encoded as either: - lower, upper-lower when upper-lower > 0 - lower The implementation does not require any particular lower value for representation of the empty set [lower,lower), but from an encoding perspective [1,1) is the most efficient. Bi-directional compatibility is not a goal, and files written with block interval annotations cannot be read by old code that does not know about the existence of this feature. The main use case for block interval annotations is as a fine-grained version of CockroachDB's time-bound iterator, that is used to ignore blocks that do not contain MVCC timestamps that are relevant. See Fixes cockroachdb#1190

sumeerbhola

TFTRs! I've responded to the comments here. I'm going to open a new PR that addresses all these comments.

Are you thinking of stuffing the annotator name in the sstable properties block?

Good point. I will be renaming this to BlockPropertyCollector and requiring that the block property name be unique across all user properties, and then put the aggregated value across all blocks into the user properties. Then BlockPropertyCollector effectively subsumes table properties. And we can get rid of all the special casing regarding what is interpreted as an empty or universal set, since the presence of the block property name in the sstable properties happens if and only if the sstable was written with that property collector. If it isn't written with that property collector, a BlockPropertyFilter specified for an iterator is a noop.

Would the term block handle annotations be better than block interval annotations? The former connotes to me that the annotation is on the block handle, while the latter connotes the type of annotation, but not where it applies.

I have a mild preference that the name specify what it is describing, which is a block interval, and not where it is stored, since that can happen at various levels, both index levels, and the sstable (after aggregation). So it is really describing the finest granularity at which this interval applies.
Based on other comments, these will now be called BlockPropertyCollector and BlockPropertyFilter. The interval case is a special case, where the set is described using a [uint64, uint64) interval, and we will provide a helper class in Pebble that does most of the heavy lifting for that case.

Reviewable status: 9 of 14 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, and @petermattis)

options.go, line 85 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think this comment could use a little bit more detail. the set is vague. Perhaps we can point to the comment in block_interal_annotator.go.

If we generalize this functionality more, this could become a BlockFilter func() which takes a BlockHandle and returns a bool. That is, we move the BlockIntervalIntersects call into this callback.

Good point about generalizing. This will be generalized to BlockPropertyFilters []BlockPropertyFilter, where all filters need to match. There can be multiple properties on a block, if there were multiple BlockPropertyCollectors.

options.go, line 462 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: The Func suffix naming is a bit inconsistent with some of the other options. This seems to have been introduced with KeyValidationFunc, but contrast that to DebugCheck and the function pointers in Comparer.

This func will go away and be replaced by BlockPropertyCollectors []func() BlockPropertyCollector.

internal/base/block_interval_annotator.go, line 17 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

represent a set of 0-based block indexes?

I didn't quite understand the questions. Maybe it is to do with the fact that the comment below talks about [0,x) being a special case, which was all rather complicated and unnecessary. That complexity goes away in the new code.

internal/base/block_interval_annotator.go, line 18 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

It took me a while to understand this. Maybe a phrase like If block annotations are happening, for a given annotator function / format, all keys in a block will be present in the set would help.

Done (in the new code).

internal/base/block_interval_annotator.go, line 26 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: Could also use lower/only level and upper level (or top level as in the twoLevelIterator code). The first/second thing is a little unclear. Feel free to dismiss.

Done

internal/base/block_interval_annotator.go, line 60 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto on this comment. I made a similar comment elsewhere about further abstracting from a concrete BlockInterval.

Related, are we at all worried that Lower and Upper contain only the wall time and not a full HLC timestamp?

Generalization is a good point. It goes together with not limiting this to a single block property. I've gone ahead and done that, since I believe the cost can be contained -- we just need to encode and decode an extra varint for the short-id of the block property. I've added significant helper implementations that only outsource the core logic for property extraction, so that we can test this thoroughly, and keep the encoding/decoding logic for the integer set case inside these Pebble provided helpers.

internal/base/block_interval_annotator.go, line 93 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Why not make this func (bi *BlockInterval) Intersect(other BlockInterval) bool to match the Union function?

This was a peculiar implementation because of the special cases, which go away in the new code, and this becomes a method on the interval struct.

internal/base/block_interval_annotator.go, line 112 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: This seems more readable when written as return iter.Upper > block.Lower && iter.Lower < block.Upper, but either works.

Done

internal/base/block_interval_annotator.go, line 115 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: can take x as receiver.

Will be removed

sstable/filter.go, line 39 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder if we should make this a more general facility for BlockHandle annotations. Some of the infrastructure being added here is already generalized, yet then tied concretely to BlockInterval. I'm imagining something where the BlockHandle annotation is encoded in []byte and there is an options callback to decode and process these annotations. I haven't thought through this in detail, but it seems feasible to do this.

Done

sstable/writer.go, line 104 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: indents should match

Done. I was relying on goland, which didn't seem to mind.

sumeerbhola · 2021-10-05T18:54:35Z

#1315 is the replacement

sumeerbhola requested review from itsbilal, jbowens and petermattis August 19, 2021 15:38

sumeerbhola commented Aug 19, 2021

View reviewed changes

itsbilal reviewed Aug 23, 2021

View reviewed changes

petermattis reviewed Aug 26, 2021

View reviewed changes

sumeerbhola force-pushed the block_tbi branch from 8115164 to 1c69944 Compare September 28, 2021 13:03

sumeerbhola commented Oct 5, 2021

View reviewed changes

sumeerbhola closed this Oct 5, 2021

sumeerbhola deleted the block_tbi branch October 26, 2021 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: block interval annotations for fine-grained time-bound iteration #1231

db: block interval annotations for fine-grained time-bound iteration #1231

sumeerbhola commented Aug 19, 2021

cockroach-teamcity commented Aug 19, 2021

sumeerbhola left a comment

itsbilal left a comment

petermattis left a comment

sumeerbhola left a comment

sumeerbhola commented Oct 5, 2021

db: block interval annotations for fine-grained time-bound iteration #1231

db: block interval annotations for fine-grained time-bound iteration #1231

Conversation

sumeerbhola commented Aug 19, 2021

cockroach-teamcity commented Aug 19, 2021

sumeerbhola left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola commented Oct 5, 2021