Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: consider reworking rangedel, megingIter and levelIter interactions #2863

Open
jbowens opened this issue Aug 30, 2023 · 1 comment · May be fixed by #3600
Open

db: consider reworking rangedel, megingIter and levelIter interactions #2863

jbowens opened this issue Aug 30, 2023 · 1 comment · May be fixed by #3600

Comments

@jbowens
Copy link
Collaborator

jbowens commented Aug 30, 2023

Currently, there's an intricate coordination dance performed between mergingIter and levelIter in order to make range deletions work. This dance has several complexities that make it fragile and difficult to grok. The introduction of range-key masking and "truthful" block-property filters also forces us into some awkward contortions to deal with the fact that a file's point iterator and range deletion iterator may become exhausted at different times.

I suspect we could significantly clean things up by using the keyspan.InterleavingIter within the levelIter implementation. It would work something like:

  • If a file has range deletions, the levelIter initializes a keyspan.InterleavingIter with the file's point iterator and rangedel iterator. It configures the InterleavingIter to interleave both start and end boundaries of range deletions.
  • The levelIter keeps a file open until it exhausts the InterleavingIter. Since range deletions' start/end boundaries are interleaved, this ensures that we don't prematurely close the range deletion iterator during iteration. We can remove the code that manufactures "ignorable boundary keys." The interleaving iterator handles interleaving the rangedel end boundary if necessary.
  • The merging iterator no longer keeps a direct handle on a file's range deletion iterator. Instead, every mergingIterLevel has a getRangeDel() *keyspan.Span func that returns the range deletion covering that level's iterator's current position, if any. When a levelIter is setup, this is set to levelIter.Span which in turn calls InterelavingIter.Span.
  • The filteredIter interface and related tracking within the sstable iterator is removed. It's no longer necessary because range deletions are unconditionally interleaved, which ensures we never miss a range deletion due to block-property filters excluding point keys that exist beyond the smallest/largest range deletion bound.
  • The merging iterator iteration is adjusted to skip over range deletion boundaries that are interleaved and rise to the top of the heap, much like the existing code skips over the "ignorable boundary keys".
  • The mechanisms of applying range deletions also change. The mergingIterLevel no longer buffers a tombstone. It instead calls getRangeDel() which accesses the InterleavingIter's buffered tombstone instead. We remove mergingIter.init[Min,Max]RangeDelIters, because the InterleavingIter is transparently ensuring the range deletion iterators are appropriately positioned.

The memtable and indexed batch levels will also have to be adjusted to make use of a keyspan.InterleavingIter.

Jira issue: PEBBLE-82

@jbowens
Copy link
Collaborator Author

jbowens commented Aug 30, 2023

cc @sumeerbhola

jbowens added a commit to jbowens/pebble that referenced this issue Jan 25, 2024
For a couple years now, Pebble has not allowed the creation of untruncated
range tombstones outside the context of virtual sstables. Additionally, since
v22.2 untruncated range tombstones should all have been compacted and rewritten
as truncated range tombstones. Virtual SSTables (which similarly allow a
physical range deletion to exceed the bounds of the containing logical sstable)
handle this case by truncating the tombstones at iteration time in the iterator
returned by keyspan.Truncate.

This leaves the merging iterator's delicate handling of untruncated range
tombstones obsolete. This commit removes that complexity. In addition, as an
extra safety precaution, the table stats collector's invariants-build assertion
that range deletions are contained within their files' bounds is lifted into
production builds as well. This provides a guarantee during store open that all
tombstones are appropriately truncated.

Relates to cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue Jan 25, 2024
For a couple years now, Pebble has not allowed the creation of untruncated
range tombstones outside the context of virtual sstables. Additionally, since
v22.2 untruncated range tombstones should all have been compacted and rewritten
as truncated range tombstones. Virtual SSTables (which similarly allow a
physical range deletion to exceed the bounds of the containing logical sstable)
handle this case by truncating the tombstones at iteration time in the iterator
returned by keyspan.Truncate.

This leaves the merging iterator's delicate handling of untruncated range
tombstones obsolete. This commit removes that complexity. In addition, as an
extra safety precaution, the table stats collector's invariants-build assertion
that range deletions are contained within their files' bounds is lifted into
production builds as well. This provides a guarantee during store open that all
tombstones are appropriately truncated.

Relates to cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue Jan 26, 2024
For a couple years now, Pebble has not allowed the creation of untruncated
range tombstones outside the context of virtual sstables. Additionally, since
v22.2 untruncated range tombstones should all have been compacted and rewritten
as truncated range tombstones. Virtual SSTables (which similarly allow a
physical range deletion to exceed the bounds of the containing logical sstable)
handle this case by truncating the tombstones at iteration time in the iterator
returned by keyspan.Truncate.

This leaves the merging iterator's delicate handling of untruncated range
tombstones obsolete. This commit removes that complexity. In addition, as an
extra safety precaution, the table stats collector's invariants-build assertion
that range deletions are contained within their files' bounds is lifted into
production builds as well. This provides a guarantee during store open that all
tombstones are appropriately truncated.

Relates to cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue Jan 26, 2024
For a couple years now, Pebble has not allowed the creation of untruncated
range tombstones outside the context of virtual sstables. Additionally, since
v22.2 untruncated range tombstones should all have been compacted and rewritten
as truncated range tombstones. Virtual SSTables (which similarly allow a
physical range deletion to exceed the bounds of the containing logical sstable)
handle this case by truncating the tombstones at iteration time in the iterator
returned by keyspan.Truncate.

This leaves the merging iterator's delicate handling of untruncated range
tombstones obsolete. This commit removes that complexity. In addition, as an
extra safety precaution, the table stats collector's invariants-build assertion
that range deletions are contained within their files' bounds is lifted into
production builds as well. This provides a guarantee during store open that all
tombstones are appropriately truncated.

Relates to cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue Jan 26, 2024
This commit refactors the interface of the table cache `newIters` method to
support creation of any of a sstable's iterators, including its point iterator,
range deletion iterator and range key iterator. This is motivated by the
refactor planned in cockroachdb#2863, which in turn requires refactoring some of the
newIters call sites that determine whether ingested sstables overlap existing
tables.

There's also a minor benefit that a compaction setup no longer needs to open a
point iterator and then immediately close it when initializing the range
deletion iterators.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
              │ old-NewItersAlloc.txt │       new-NewItersAlloc.txt        │
              │        sec/op         │   sec/op     vs base               │
NewItersAlloc             427.0n ± 1%   412.1n ± 1%  -3.47% (p=0.000 n=10)

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │         B/op          │    B/op     vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │       allocs/op       │ allocs/op   vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal
```
jbowens added a commit to jbowens/pebble that referenced this issue Jan 26, 2024
This commit refactors the interface of the table cache `newIters` method to
support creation of any of a sstable's iterators, including its point iterator,
range deletion iterator and range key iterator. This is motivated by the
refactor planned in cockroachdb#2863, which in turn requires refactoring some of the
newIters call sites that determine whether ingested sstables overlap existing
tables.

There's also a minor benefit that a compaction setup no longer needs to open a
point iterator and then immediately close it when initializing the range
deletion iterators.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
              │ old-NewItersAlloc.txt │       new-NewItersAlloc.txt        │
              │        sec/op         │   sec/op     vs base               │
NewItersAlloc             427.0n ± 1%   412.1n ± 1%  -3.47% (p=0.000 n=10)

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │         B/op          │    B/op     vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │       allocs/op       │ allocs/op   vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal
```
jbowens added a commit to jbowens/pebble that referenced this issue Jan 26, 2024
This commit refactors the interface of the table cache `newIters` method to
support creation of any of a sstable's iterators, including its point iterator,
range deletion iterator and range key iterator. This is motivated by the
refactor planned in cockroachdb#2863, which in turn requires refactoring some of the
newIters call sites that determine whether ingested sstables overlap existing
tables.

There's also a minor benefit that a compaction setup no longer needs to open a
point iterator and then immediately close it when initializing the range
deletion iterators.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
              │ old-NewItersAlloc.txt │       new-NewItersAlloc.txt        │
              │        sec/op         │   sec/op     vs base               │
NewItersAlloc             427.0n ± 1%   412.1n ± 1%  -3.47% (p=0.000 n=10)

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │         B/op          │    B/op     vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │       allocs/op       │ allocs/op   vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal
```
jbowens added a commit to jbowens/pebble that referenced this issue Jan 26, 2024
This commit refactors the interface of the table cache `newIters` method to
support creation of any of a sstable's iterators, including its point iterator,
range deletion iterator and range key iterator. This is motivated by the
refactor planned in cockroachdb#2863, which in turn requires refactoring some of the
newIters call sites that determine whether ingested sstables overlap existing
tables.

There's also a minor benefit that a compaction setup no longer needs to open a
point iterator and then immediately close it when initializing the range
deletion iterators.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
              │ old-NewItersAlloc.txt │       new-NewItersAlloc.txt        │
              │        sec/op         │   sec/op     vs base               │
NewItersAlloc             427.0n ± 1%   412.1n ± 1%  -3.47% (p=0.000 n=10)

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │         B/op          │    B/op     vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │       allocs/op       │ allocs/op   vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal
```
jbowens added a commit that referenced this issue Jan 29, 2024
For a couple years now, Pebble has not allowed the creation of untruncated
range tombstones outside the context of virtual sstables. Additionally, since
v22.2 untruncated range tombstones should all have been compacted and rewritten
as truncated range tombstones. Virtual SSTables (which similarly allow a
physical range deletion to exceed the bounds of the containing logical sstable)
handle this case by truncating the tombstones at iteration time in the iterator
returned by keyspan.Truncate.

This leaves the merging iterator's delicate handling of untruncated range
tombstones obsolete. This commit removes that complexity. In addition, as an
extra safety precaution, the table stats collector's invariants-build assertion
that range deletions are contained within their files' bounds is lifted into
production builds as well. This provides a guarantee during store open that all
tombstones are appropriately truncated.

Relates to #2863.
jbowens added a commit to jbowens/pebble that referenced this issue Jan 31, 2024
This commit refactors the interface of the table cache `newIters` method to
support creation of any of a sstable's iterators, including its point iterator,
range deletion iterator and range key iterator. This is motivated by the
refactor planned in cockroachdb#2863, which in turn requires refactoring some of the
newIters call sites that determine whether ingested sstables overlap existing
tables.

There's also a minor benefit that a compaction setup no longer needs to open a
point iterator and then immediately close it when initializing the range
deletion iterators.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
              │ old-NewItersAlloc.txt │       new-NewItersAlloc.txt        │
              │        sec/op         │   sec/op     vs base               │
NewItersAlloc             427.0n ± 1%   412.1n ± 1%  -3.47% (p=0.000 n=10)

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │         B/op          │    B/op     vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │       allocs/op       │ allocs/op   vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal
```
jbowens added a commit to jbowens/pebble that referenced this issue Jan 31, 2024
This commit refactors the interface of the table cache `newIters` method to
support creation of any of a sstable's iterators, including its point iterator,
range deletion iterator and range key iterator. This is motivated by the
refactor planned in cockroachdb#2863, which in turn requires refactoring some of the
newIters call sites that determine whether ingested sstables overlap existing
tables.

There's also a minor benefit that a compaction setup no longer needs to open a
point iterator and then immediately close it when initializing the range
deletion iterators.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
              │ old-NewItersAlloc.txt │       new-NewItersAlloc.txt        │
              │        sec/op         │   sec/op     vs base               │
NewItersAlloc             427.0n ± 1%   412.1n ± 1%  -3.47% (p=0.000 n=10)

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │         B/op          │    B/op     vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │       allocs/op       │ allocs/op   vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal
```
jbowens added a commit that referenced this issue Jan 31, 2024
This commit refactors the interface of the table cache `newIters` method to
support creation of any of a sstable's iterators, including its point iterator,
range deletion iterator and range key iterator. This is motivated by the
refactor planned in #2863, which in turn requires refactoring some of the
newIters call sites that determine whether ingested sstables overlap existing
tables.

There's also a minor benefit that a compaction setup no longer needs to open a
point iterator and then immediately close it when initializing the range
deletion iterators.

```
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
              │ old-NewItersAlloc.txt │       new-NewItersAlloc.txt        │
              │        sec/op         │   sec/op     vs base               │
NewItersAlloc             427.0n ± 1%   412.1n ± 1%  -3.47% (p=0.000 n=10)

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │         B/op          │    B/op     vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

              │ old-NewItersAlloc.txt │     new-NewItersAlloc.txt      │
              │       allocs/op       │ allocs/op   vs base            │
NewItersAlloc              0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal
```
jbowens added a commit to jbowens/pebble that referenced this issue Mar 29, 2024
The levelIter has a concept of ignorable boundary keys. When a levelIter
exhausts a file's point keys, it's possible that the same file still contains
range deletions relevant to other levels of the LSM. The levelIter will, under
some conditions, interleave the largest boundary key of the sstable into
iteration as an 'ignorable boundary key,' so that the file's range deletions
remain accessible until all other levels progress beyound the file's boundary.

When block-property filters are in use, a file's point key iterator may become
exhausted early, before the file's range deletions are irrelevant, even if the
file's largest key is not a range deletion. To work around this subtlety, the
sstable iterator previously would surface knowledge of whether any point keys
may have been skipped through a MaybeFilteredKeys method. The levelIter used
this method to determine when to interleave an ignorable largest boundary in
case there may be additional relevant range deletions.

This commit removes the conditioning on the output of MaybeFilteredKeys,
instead unconditionally interleaving a synthetic boundary at a file's largest
point key if the levelIter user is using the file's range deletion iterator.
This simplifies the logic and removes a fragile reliance on the sstable's
iterators accounting of when it may have filtered keys.

Future work (cockroachdb#2863) will remove the need to interleave synthetic boundaries at
all, instead interleaving the range deletion bounds themselves among point
keys.

Informs cockroachdb#2863.
Close cockroachdb#3334.
jbowens added a commit to jbowens/pebble that referenced this issue Mar 29, 2024
The levelIter has a concept of ignorable boundary keys. When a levelIter
exhausts a file's point keys, it's possible that the same file still contains
range deletions relevant to other levels of the LSM. The levelIter will, under
some conditions, interleave the largest boundary key of the sstable into
iteration as an 'ignorable boundary key,' so that the file's range deletions
remain accessible until all other levels progress beyound the file's boundary.

When block-property filters are in use, a file's point key iterator may become
exhausted early, before the file's range deletions are irrelevant, even if the
file's largest key is not a range deletion. To work around this subtlety, the
sstable iterator previously would surface knowledge of whether any point keys
may have been skipped through a MaybeFilteredKeys method. The levelIter used
this method to determine when to interleave an ignorable largest boundary in
case there may be additional relevant range deletions.

This commit removes the conditioning on the output of MaybeFilteredKeys,
instead unconditionally interleaving a synthetic boundary at a file's largest
point key if the levelIter user is using the file's range deletion iterator.
This simplifies the logic and removes a fragile reliance on the sstable's
iterators accounting of when it may have filtered keys.

Future work (cockroachdb#2863) will remove the need to interleave synthetic boundaries at
all, instead interleaving the range deletion bounds themselves among point
keys.

Informs cockroachdb#2863.
Close cockroachdb#3334.
jbowens added a commit to jbowens/pebble that referenced this issue Mar 29, 2024
The levelIter has a concept of ignorable boundary keys. When a levelIter
exhausts a file's point keys, it's possible that the same file still contains
range deletions relevant to other levels of the LSM. The levelIter will, under
some conditions, interleave the largest boundary key of the sstable into
iteration as an 'ignorable boundary key,' so that the file's range deletions
remain accessible until all other levels progress beyound the file's boundary.

When block-property filters are in use, a file's point key iterator may become
exhausted early, before the file's range deletions are irrelevant, even if the
file's largest key is not a range deletion. To work around this subtlety, the
sstable iterator previously would surface knowledge of whether any point keys
may have been skipped through a MaybeFilteredKeys method. The levelIter used
this method to determine when to interleave an ignorable largest boundary in
case there may be additional relevant range deletions.

This commit removes the conditioning on the output of MaybeFilteredKeys,
instead unconditionally interleaving a synthetic boundary at a file's largest
point key if the levelIter user is using the file's range deletion iterator.
This simplifies the logic and removes a fragile reliance on the sstable's
iterators accounting of when it may have filtered keys.

Future work (cockroachdb#2863) will remove the need to interleave synthetic boundaries at
all, instead interleaving the range deletion bounds themselves among point
keys.

Informs cockroachdb#2863.
Close cockroachdb#3334.
jbowens added a commit to jbowens/pebble that referenced this issue Apr 1, 2024
The levelIter has a concept of ignorable boundary keys. When a levelIter
exhausts a file's point keys, it's possible that the same file still contains
range deletions relevant to other levels of the LSM. The levelIter will, under
some conditions, interleave the largest boundary key of the sstable into
iteration as an 'ignorable boundary key,' so that the file's range deletions
remain accessible until all other levels progress beyound the file's boundary.

When block-property filters are in use, a file's point key iterator may become
exhausted early, before the file's range deletions are irrelevant, even if the
file's largest key is not a range deletion. To work around this subtlety, the
sstable iterator previously would surface knowledge of whether any point keys
may have been skipped through a MaybeFilteredKeys method. The levelIter used
this method to determine when to interleave an ignorable largest boundary in
case there may be additional relevant range deletions.

This commit removes the conditioning on the output of MaybeFilteredKeys,
instead unconditionally interleaving a synthetic boundary at a file's largest
point key if the levelIter user is using the file's range deletion iterator.
This simplifies the logic and removes a fragile reliance on the sstable's
iterators accounting of when it may have filtered keys.

Future work (cockroachdb#2863) will remove the need to interleave synthetic boundaries at
all, instead interleaving the range deletion bounds themselves among point
keys.

Informs cockroachdb#2863.
Close cockroachdb#3334.
jbowens added a commit that referenced this issue Apr 5, 2024
The levelIter has a concept of ignorable boundary keys. When a levelIter
exhausts a file's point keys, it's possible that the same file still contains
range deletions relevant to other levels of the LSM. The levelIter will, under
some conditions, interleave the largest boundary key of the sstable into
iteration as an 'ignorable boundary key,' so that the file's range deletions
remain accessible until all other levels progress beyound the file's boundary.

When block-property filters are in use, a file's point key iterator may become
exhausted early, before the file's range deletions are irrelevant, even if the
file's largest key is not a range deletion. To work around this subtlety, the
sstable iterator previously would surface knowledge of whether any point keys
may have been skipped through a MaybeFilteredKeys method. The levelIter used
this method to determine when to interleave an ignorable largest boundary in
case there may be additional relevant range deletions.

This commit removes the conditioning on the output of MaybeFilteredKeys,
instead unconditionally interleaving a synthetic boundary at a file's largest
point key if the levelIter user is using the file's range deletion iterator.
This simplifies the logic and removes a fragile reliance on the sstable's
iterators accounting of when it may have filtered keys.

Future work (#2863) will remove the need to interleave synthetic boundaries at
all, instead interleaving the range deletion bounds themselves among point
keys.

Informs #2863.
Close #3334.
jbowens added a commit to jbowens/pebble that referenced this issue Apr 12, 2024
This commit refactors the implementation of getIter to be a little more
understandable and avoid the unnecessary use of levelIter. Current supported
format major versions guarantee that a user key is not split across sstables
within a level. This ensures that Get (which only retrieves one individual user
key) need only consult 1 sstable per level.

This is somewhat motivated by cockroachdb#2863. Removing getIter's dependency on levelIter
will make that refactor easier.
jbowens added a commit to jbowens/pebble that referenced this issue Apr 12, 2024
This commit refactors the implementation of getIter to be a little more
understandable and avoid the unnecessary use of levelIter. Current supported
format major versions guarantee that a user key is not split across sstables
within a level. This ensures that Get (which only retrieves one individual user
key) need only consult 1 sstable per level.

This is somewhat motivated by cockroachdb#2863. Removing getIter's dependency on levelIter
will make that refactor easier.
jbowens added a commit to jbowens/pebble that referenced this issue Apr 12, 2024
This commit refactors the implementation of getIter to be a little more
understandable and avoid the unnecessary use of levelIter. Current supported
format major versions guarantee that a user key is not split across sstables
within a level. This ensures that Get (which only retrieves one individual user
key) need only consult 1 sstable per level.

This is somewhat motivated by cockroachdb#2863. Removing getIter's dependency on levelIter
will make that refactor easier.
jbowens added a commit to jbowens/pebble that referenced this issue Apr 22, 2024
Adapt the InterleavingIter to support a mode that interleaves span end keys.

Informs cockroachdb#2863.
jbowens added a commit that referenced this issue Apr 22, 2024
Adapt the InterleavingIter to support a mode that interleaves span end keys.

Informs #2863.
jbowens added a commit to jbowens/pebble that referenced this issue Apr 24, 2024
The merging iterator and the level iterator are more tightly coupled than
desired. Most of this coupling is a consequence of range deletions. The merging
iterator is responsible for applying range deletions across levels and requires
that the level iterator not progress to a new file until the file's range
deletions are no longer relevant to other levels. It does this by interleaving
"synthetic" keys. When these keys reach the top of the heap, it signals that
all other levels have progressed to a point where the current file's range
deletions are no longer relevant.

Previously, knowledge of which keys were interleaved "synthetic" keys was
propagated indirectly—outside the internal iterator interface—through a pointer
to a levelIterBoundaryContext struct with boolean fields. This commit instead
always interleaves synthetic keys as range deletion exclusive sentinels. The
merging iterator can infer which keys are synthetic by calling
InternalKey.IsExclusiveSentinel. Propagating this information directly through
the internal iterator interface is more direct and understandable.

Beyond needing to know which keys are synthetic, the merging iterator also must
know when user-imposed iteration bounds have been reached. Ordinarily when
bounds have been reached, internal iterators become exhausted. Since the
levelIter's file's range deletions may still be relevant, it cannot and it
interleaves a "synthetic" key at the iteration bound. When this key reaches the
top of the merging iterator's heap, there are no more keys within bounds and
the merging iterator is exhausted. To make this determination, we add a simple
key comparison. The use of a key comparison is expected to be acceptable,
because it's only performed when a synthetic key reaches the top of the heap.
This additional key comparison should ultimately be eliminated when cockroachdb#2863 is
complete.

Informs cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue Apr 24, 2024
The merging iterator and the level iterator are more tightly coupled than
desired. Most of this coupling is a consequence of range deletions. The merging
iterator is responsible for applying range deletions across levels and requires
that the level iterator not progress to a new file until the file's range
deletions are no longer relevant to other levels. It does this by interleaving
"synthetic" keys. When these keys reach the top of the heap, it signals that
all other levels have progressed to a point where the current file's range
deletions are no longer relevant.

Previously, knowledge of which keys were interleaved "synthetic" keys was
propagated indirectly—outside the internal iterator interface—through a pointer
to a levelIterBoundaryContext struct with boolean fields. This commit instead
always interleaves synthetic keys as range deletion exclusive sentinels. The
merging iterator can infer which keys are synthetic by calling
InternalKey.IsExclusiveSentinel. Propagating this information directly through
the internal iterator interface is more direct and understandable.

Beyond needing to know which keys are synthetic, the merging iterator also must
know when user-imposed iteration bounds have been reached. Ordinarily when
bounds have been reached, internal iterators become exhausted. Since the
levelIter's file's range deletions may still be relevant, it cannot and it
interleaves a "synthetic" key at the iteration bound. When this key reaches the
top of the merging iterator's heap, there are no more keys within bounds and
the merging iterator is exhausted. To make this determination, we add a simple
key comparison. The use of a key comparison is expected to be acceptable,
because it's only performed when a synthetic key reaches the top of the heap.
This additional key comparison should ultimately be eliminated when cockroachdb#2863 is
complete.

Informs cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue Apr 25, 2024
The merging iterator and the level iterator are more tightly coupled than
desired. Most of this coupling is a consequence of range deletions. The merging
iterator is responsible for applying range deletions across levels and requires
that the level iterator not progress to a new file until the file's range
deletions are no longer relevant to other levels. It does this by interleaving
"synthetic" keys. When these keys reach the top of the heap, it signals that
all other levels have progressed to a point where the current file's range
deletions are no longer relevant.

Previously, knowledge of which keys were interleaved "synthetic" keys was
propagated indirectly—outside the internal iterator interface—through a pointer
to a levelIterBoundaryContext struct with boolean fields. This commit instead
always interleaves synthetic keys as range deletion exclusive sentinels. The
merging iterator can infer which keys are synthetic by calling
InternalKey.IsExclusiveSentinel. Propagating this information directly through
the internal iterator interface is more direct and understandable.

Beyond needing to know which keys are synthetic, the merging iterator also must
know when user-imposed iteration bounds have been reached. Ordinarily when
bounds have been reached, internal iterators become exhausted. Since the
levelIter's file's range deletions may still be relevant, it cannot and it
interleaves a "synthetic" key at the iteration bound. When this key reaches the
top of the merging iterator's heap, there are no more keys within bounds and
the merging iterator is exhausted. To make this determination, we add a simple
key comparison. The use of a key comparison is expected to be acceptable,
because it's only performed when a synthetic key reaches the top of the heap.
This additional key comparison should ultimately be eliminated when cockroachdb#2863 is
complete.

Informs cockroachdb#2863.
jbowens added a commit that referenced this issue Apr 26, 2024
The merging iterator and the level iterator are more tightly coupled than
desired. Most of this coupling is a consequence of range deletions. The merging
iterator is responsible for applying range deletions across levels and requires
that the level iterator not progress to a new file until the file's range
deletions are no longer relevant to other levels. It does this by interleaving
"synthetic" keys. When these keys reach the top of the heap, it signals that
all other levels have progressed to a point where the current file's range
deletions are no longer relevant.

Previously, knowledge of which keys were interleaved "synthetic" keys was
propagated indirectly—outside the internal iterator interface—through a pointer
to a levelIterBoundaryContext struct with boolean fields. This commit instead
always interleaves synthetic keys as range deletion exclusive sentinels. The
merging iterator can infer which keys are synthetic by calling
InternalKey.IsExclusiveSentinel. Propagating this information directly through
the internal iterator interface is more direct and understandable.

Beyond needing to know which keys are synthetic, the merging iterator also must
know when user-imposed iteration bounds have been reached. Ordinarily when
bounds have been reached, internal iterators become exhausted. Since the
levelIter's file's range deletions may still be relevant, it cannot and it
interleaves a "synthetic" key at the iteration bound. When this key reaches the
top of the merging iterator's heap, there are no more keys within bounds and
the merging iterator is exhausted. To make this determination, we add a simple
key comparison. The use of a key comparison is expected to be acceptable,
because it's only performed when a synthetic key reaches the top of the heap.
This additional key comparison should ultimately be eliminated when #2863 is
complete.

Informs #2863.
@jbowens jbowens self-assigned this May 2, 2024
jbowens added a commit to jbowens/pebble that referenced this issue May 2, 2024
Refactor the overlap checking to avoid using the levelIter to populate the
range deletion iterator. This was subtle and confusing. This refactor will also
benefit the implementation of cockroachdb#2112.

Informs cockroachdb#2112.
Informs cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 2, 2024
Refactor the overlap checking to avoid using the levelIter to populate the
range deletion iterator. This was subtle and confusing. This refactor will also
benefit the implementation of cockroachdb#2112.

Informs cockroachdb#2112.
Informs cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 3, 2024
Refactor the overlap checking to avoid using the levelIter to populate the
range deletion iterator. This was subtle and confusing. This refactor will also
benefit the implementation of cockroachdb#2112.

Informs cockroachdb#2112.
Informs cockroachdb#2863.
jbowens added a commit that referenced this issue May 3, 2024
Refactor the overlap checking to avoid using the levelIter to populate the
range deletion iterator. This was subtle and confusing. This refactor will also
benefit the implementation of #2112.

Informs #2112.
Informs #2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 6, 2024
Previously when the interleaving iterator interleaved a span boundary, the
semantics of calling NextPrefix were indeterminate. Ordinarily NextPrefix
guarantees that it advances to a key with a new prefix. The existing
interleaving iterator implementation didn't guarantee this when positioned on a
span boundary. Guaranteeing it would require otherwise unnecessary key
comparisons.

In existing use cases (for range keys, which always begin and end at prefix
keys), we never call NextPrefix while positioned at a span boundary. With the
planned refactor of cockroachdb#2863, we'll begin to use the interleaving iterator to
interleave range deletions which may have bounds within the keys of a prefix.
This assertion will ensure we don't unintentionally rely on currently undefined
behavior of calling NextPrefix on a span boundary.
jbowens added a commit to jbowens/pebble that referenced this issue May 6, 2024
Previously, levelIter would sometimes interleave fake keys at file boundaries
or user-provided iteration boundaries when a file contains range deletions.
This commit reworks the levelIter to instead interleave the individual bounds
of the range deletions themselves, using an interleaving iterator. This
simplifies the levelIter. Because range deletions are now read in two places:
once for interleaving bounds and once to expose a rangeDelIter to the
mergingIter, this commit requires opening a file's range deletion iterator
twice. This is an intermediary state until cockroachdb#2863 is completed when the
mergingIter will consult the levelIter's interleaving iterator to determine
which range deletions overlap the current iterator position.

Informs cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 6, 2024
Previously, levelIter would sometimes interleave fake keys at file boundaries
or user-provided iteration boundaries when a file contains range deletions.
This commit reworks the levelIter to instead interleave the individual bounds
of the range deletions themselves, using an interleaving iterator. This
simplifies the levelIter. Because range deletions are now read in two places:
once for interleaving bounds and once to expose a rangeDelIter to the
mergingIter, this commit requires opening a file's range deletion iterator
twice. This is an intermediary state until cockroachdb#2863 is completed when the
mergingIter will consult the levelIter's interleaving iterator to determine
which range deletions overlap the current iterator position.

Informs cockroachdb#2863.
jbowens added a commit that referenced this issue May 6, 2024
Previously when the interleaving iterator interleaved a span boundary, the
semantics of calling NextPrefix were indeterminate. Ordinarily NextPrefix
guarantees that it advances to a key with a new prefix. The existing
interleaving iterator implementation didn't guarantee this when positioned on a
span boundary. Guaranteeing it would require otherwise unnecessary key
comparisons.

In existing use cases (for range keys, which always begin and end at prefix
keys), we never call NextPrefix while positioned at a span boundary. With the
planned refactor of #2863, we'll begin to use the interleaving iterator to
interleave range deletions which may have bounds within the keys of a prefix.
This assertion will ensure we don't unintentionally rely on currently undefined
behavior of calling NextPrefix on a span boundary.
jbowens added a commit to jbowens/pebble that referenced this issue May 7, 2024
Previously, levelIter would sometimes interleave fake keys at file boundaries
or user-provided iteration boundaries when a file contains range deletions.
This commit reworks the levelIter to instead interleave the individual bounds
of the range deletions themselves, using an interleaving iterator. This
simplifies the levelIter. Because range deletions are now read in two places:
once for interleaving bounds and once to expose a rangeDelIter to the
mergingIter, this commit requires opening a file's range deletion iterator
twice. This is an intermediary state until cockroachdb#2863 is completed when the
mergingIter will consult the levelIter's interleaving iterator to determine
which range deletions overlap the current iterator position.

Informs cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 8, 2024
Previously, levelIter would sometimes interleave fake keys at file boundaries
or user-provided iteration boundaries when a file contains range deletions.
This commit reworks the levelIter to instead interleave the individual bounds
of the range deletions themselves, using an interleaving iterator. This
simplifies the levelIter. Because range deletions are now read in two places:
once for interleaving bounds and once to expose a rangeDelIter to the
mergingIter, this commit requires opening a file's range deletion iterator
twice. This is an intermediary state until cockroachdb#2863 is completed when the
mergingIter will consult the levelIter's interleaving iterator to determine
which range deletions overlap the current iterator position.

Informs cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 9, 2024
Previously, levelIter would sometimes interleave fake keys at file boundaries
or user-provided iteration boundaries when a file contains range deletions.
This commit reworks the levelIter to instead interleave the individual bounds
of the range deletions themselves, using an interleaving iterator. This
simplifies the levelIter. Because range deletions are now read in two places:
once for interleaving bounds and once to expose a rangeDelIter to the
mergingIter, this commit requires opening a file's range deletion iterator
twice. This is an intermediary state until cockroachdb#2863 is completed when the
mergingIter will consult the levelIter's interleaving iterator to determine
which range deletions overlap the current iterator position.

Informs cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 10, 2024
Previously, levelIter would sometimes interleave fake keys at file boundaries
or user-provided iteration boundaries when a file contains range deletions.
This commit reworks the levelIter to instead interleave the individual bounds
of the range deletions themselves, using an interleaving iterator. This
simplifies the levelIter. Because range deletions are now read in two places:
once for interleaving bounds and once to expose a rangeDelIter to the
mergingIter, this commit requires opening a file's range deletion iterator
twice. This is an intermediary state until cockroachdb#2863 is completed when the
mergingIter will consult the levelIter's interleaving iterator to determine
which range deletions overlap the current iterator position.

Informs cockroachdb#2863.
jbowens added a commit that referenced this issue May 10, 2024
Previously, levelIter would sometimes interleave fake keys at file boundaries
or user-provided iteration boundaries when a file contains range deletions.
This commit reworks the levelIter to instead interleave the individual bounds
of the range deletions themselves, using an interleaving iterator. This
simplifies the levelIter. Because range deletions are now read in two places:
once for interleaving bounds and once to expose a rangeDelIter to the
mergingIter, this commit requires opening a file's range deletion iterator
twice. This is an intermediary state until #2863 is completed when the
mergingIter will consult the levelIter's interleaving iterator to determine
which range deletions overlap the current iterator position.

Informs #2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 10, 2024
Rework the merging iterator's handling of range deletions to use the
interleaved range deletion boundary keys to determine when the iterator is
positioned within a level's range deletion. This removes the direct
manipulation of a range deletion keyspan.FragmentIterator from the mergingIter,
delegating that to the child iterator's keyspan.InterleavingIter.

This factoring is a little cleaner and decouples the mergingIter from the
details of range deletion iterators, and in particular, the levelIter's
individual sstables. It also should reduce key comparisons, especially during
scans, by avoiding unnecessary key comparisons with range deletions that are
loaded but outside the keyspace being iterated over.

Close cockroachdb#2863.
@jbowens jbowens linked a pull request May 10, 2024 that will close this issue
jbowens added a commit to jbowens/pebble that referenced this issue May 10, 2024
Rework the merging iterator's handling of range deletions to use the
interleaved range deletion boundary keys to determine when the iterator is
positioned within a level's range deletion. This removes the direct
manipulation of a range deletion keyspan.FragmentIterator from the mergingIter,
delegating that to the child iterator's keyspan.InterleavingIter.

This factoring is a little cleaner and decouples the mergingIter from the
details of range deletion iterators, and in particular, the levelIter's
individual sstables. It also should reduce key comparisons, especially during
scans, by avoiding unnecessary key comparisons with range deletions that are
loaded but outside the keyspace being iterated over.

Close cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 13, 2024
Rework the merging iterator's handling of range deletions to use the
interleaved range deletion boundary keys to determine when the iterator is
positioned within a level's range deletion. This removes the direct
manipulation of a range deletion keyspan.FragmentIterator from the mergingIter,
delegating that to the child iterator's keyspan.InterleavingIter.

This factoring is a little cleaner and decouples the mergingIter from the
details of range deletion iterators, and in particular, the levelIter's
individual sstables. It also should reduce key comparisons, especially during
scans, by avoiding unnecessary key comparisons with range deletions that are
loaded but outside the keyspace being iterated over.

Close cockroachdb#2863.
@jbowens jbowens moved this to 24.2 candidates in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 24.2 candidates
Development

Successfully merging a pull request may close this issue.

1 participant