Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Set isIgnorableBoundaryKey=true for rangedel boundaries in levelIter #2123

Merged

Conversation

itsbilal
Copy link
Contributor

@itsbilal itsbilal commented Nov 11, 2022

Merge keys put the top level Iterator's position to iterPosNext, and
in the process of Next()ing to the next user key, one of the child
levelIters could have advanced past the sstable that contained the
Merge that is being returned.

If that sstable contained a range delete covering a key in an sstable
in a lower level, and the lower level's key was not exposed to the
mergingIter due to a bloom filter mismatch, and we do two consecutive
SeekPrefixGEs where the second one gets TrySeekUsingNext, we could
end up surfacing the deleted key in the lower sstable as the
higher level's levelIter has already moved past the sstable
containing the range deletion.

This change sets isIgnorableBoundaryKey = true when returning
largestBoundary in the levelIter to prevent the mergingIter from
next()ing past it in the first SeekPrefixGE.

Fixes #2118.

@itsbilal itsbilal requested review from jbowens and a team November 11, 2022 21:11
@itsbilal itsbilal self-assigned this Nov 11, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal
Copy link
Contributor Author

@jbowens happy to hear your thoughts on this; maybe this is an overly-defensive disabling of the optimization when we can do something like setting isIgnorableBoundaryKey to true in the levelIter.skipEmptyFileForward case that pauses at the rangedel sentinel key. For some reason levelIter flips that flag to true in SeekPrefixGE's rangedel-sentinel pausing case but not in Next()/skipEmptyFileForward().

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time wrapping my head around what the bug is. It looks like the unit test doesn't exercise the bug, because the test still passes after reverting iterator.go?

Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @itsbilal)

@itsbilal itsbilal force-pushed the fix-seekprefixge-merge-usingnext branch from e7ba212 to 1e881f1 Compare November 13, 2022 15:16
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbowens you were right. I found the last missing piece of the puzzle; the lower level SST has to have a bloom filter mismatch in the first SeekPrefixGE, otherwise the deleted key gets surfaced to the mergingIter and we don't double-next() the higher level sstable. The test fails now if you revert the other change. I also made the isIgnorableBoundaryKey change as that passes the test too.

Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @itsbilal)

@itsbilal itsbilal changed the title *: Don't TrySeekUsingNext after a Merge with prefix seek *: levelIter: Don't iterate past largestBoundary in Next() Nov 13, 2022
@itsbilal
Copy link
Contributor Author

Nevermind, this trips up another test. I'll revert to the previous fix soon.

@itsbilal itsbilal force-pushed the fix-seekprefixge-merge-usingnext branch from 1e881f1 to f740894 Compare November 14, 2022 03:23
@itsbilal itsbilal changed the title *: levelIter: Don't iterate past largestBoundary in Next() *: Don't TrySeekUsingNext after a Merge with prefix seek Nov 14, 2022
@itsbilal itsbilal requested a review from jbowens November 14, 2022 03:24
@itsbilal
Copy link
Contributor Author

Reverted to the original fix and test improved, PTAL!

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @jbowens)


iterator.go line 1298 at r3 (raw file):

	}
	if lastPositioningOp == seekPrefixGELastPositioningOp && i.pos != iterPosNext {
		// iterPosNext is unsafe to optimize for TrySeekUsingNext. This is because

I'm having some difficulty with this change, both in understanding and whether this ought to be the fix. Ideally we should be able to state an invariant about the InternalIterator tree and use that to decide on the optimization here (like we do in our current optimizations in iterator.go) and not have to reason about the details about the InternalIterator implementation. I worry that with the latter we will move much closer to not being able to understand or improve these optimizations in the future.
I don't quite understand the gap in the current pausing logic in level_iter.go and merging_iter.go that is causing this -- I suspect the fix belongs there and not in iterator.go.

btw, I'm unsure about how this works in the presence of InterleavingIter. nextPointCurrentUserKey can set i.pos = iterPosNext because the range key iterator is the current position and the point iter it at the next position. I guess this is harmless from a correctness perspective, but will mean we don't use TrySeekUsingNext?

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)


iterator.go line 1298 at r3 (raw file):

Previously, sumeerbhola wrote…

I'm having some difficulty with this change, both in understanding and whether this ought to be the fix. Ideally we should be able to state an invariant about the InternalIterator tree and use that to decide on the optimization here (like we do in our current optimizations in iterator.go) and not have to reason about the details about the InternalIterator implementation. I worry that with the latter we will move much closer to not being able to understand or improve these optimizations in the future.
I don't quite understand the gap in the current pausing logic in level_iter.go and merging_iter.go that is causing this -- I suspect the fix belongs there and not in iterator.go.

btw, I'm unsure about how this works in the presence of InterleavingIter. nextPointCurrentUserKey can set i.pos = iterPosNext because the range key iterator is the current position and the point iter it at the next position. I guess this is harmless from a correctness perspective, but will mean we don't use TrySeekUsingNext?

Ditto. Is the bug that we don't do this logic when the level iter gets Next-ed during prefix iteration mode?

And currently we don't have knowledge of whether we're in prefix mode during levelIter.Next, so we can't readily just add that logic to the Next exhaustion case?

@itsbilal itsbilal force-pushed the fix-seekprefixge-merge-usingnext branch from f740894 to 7ca96e0 Compare November 14, 2022 16:21
@itsbilal
Copy link
Contributor Author

iterator.go line 1298 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Ditto. Is the bug that we don't do this logic when the level iter gets Next-ed during prefix iteration mode?

And currently we don't have knowledge of whether we're in prefix mode during levelIter.Next, so we can't readily just add that logic to the Next exhaustion case?

Just pushed a detailed example into the comments, hope that helps. Also reverted to the isIgnorableBoundaryKey fix that I had yesterday, which is in line with Sumeer's comment. It changed some other datadriven tests like I had mentioned earlier but it seems safe to me (more eyes here would help).

Not sure if levelIter needs to know about prefix iteration, but we shouldn't be double-Nexting past the sstable containing the range delete in any case (I think).

@itsbilal itsbilal changed the title *: Don't TrySeekUsingNext after a Merge with prefix seek *: Set isIgnorableBoundaryKey=true for rangedel boundaries in levelIter Nov 14, 2022
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


level_iter.go line 960 at r4 (raw file):

				if l.boundaryContext != nil {
					l.boundaryContext.isIgnorableBoundaryKey = true
				}

This isIgnorableBoundaryKey change makes sense to me.

If I'm not mistaken, we could adjust the handling of range tombstones to always rely on isIgnorableBoundaryKey and remove mergingIter.elideRangeTombstones. We've be relying on the levelIter always setting isIgnorableBoundaryKey = true whenever it returns one of these boundary keys (including in skipEmptyFileBackward). The mergingIter would not need to perform any checks for range deletion key kinds. When using within an Iterator, the mergingIter will never return range tombstone keys, as it does today with elideRangeTombstones=true. When used in compactions, it will continue to return interleaved range deletions, just not any of the artificial boundary keys.


iterator.go line 1298 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Just pushed a detailed example into the comments, hope that helps. Also reverted to the isIgnorableBoundaryKey fix that I had yesterday, which is in line with Sumeer's comment. It changed some other datadriven tests like I had mentioned earlier but it seems safe to me (more eyes here would help).

Not sure if levelIter needs to know about prefix iteration, but we shouldn't be double-Nexting past the sstable containing the range delete in any case (I think).

Thanks, this example helped.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @jbowens)


level_iter.go line 960 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

This isIgnorableBoundaryKey change makes sense to me.

If I'm not mistaken, we could adjust the handling of range tombstones to always rely on isIgnorableBoundaryKey and remove mergingIter.elideRangeTombstones. We've be relying on the levelIter always setting isIgnorableBoundaryKey = true whenever it returns one of these boundary keys (including in skipEmptyFileBackward). The mergingIter would not need to perform any checks for range deletion key kinds. When using within an Iterator, the mergingIter will never return range tombstone keys, as it does today with elideRangeTombstones=true. When used in compactions, it will continue to return interleaved range deletions, just not any of the artificial boundary keys.

I went and reread #1725 (comment). Is the improvement you are suggesting changing the bullet "Pausing when reach a file bound that is a rangedel: mergingIter skips these keys since it ignores based on InternalKeyKindRangeDelete"?

btw, while we are here I think we should improve the commentary around isSyntheticIterBoundsKey and isIgnorableBoundaryKey. I tend to forget why both concepts are needed, and maybe others do too.
My understanding is that the former (a) plays a role in switching directions that needs it to be a separate concept (but the details for that I have forgotten), (b) allows for stating an invariant for the whole mergingIter when the synthetic key is at the top, since the bounds that are reached apply to the whole iterator heap (in comparison isIgnorableBoundaryKey is a local pause for a level). (b) is why we can break here

pebble/merging_iter.go

Lines 687 to 688 in 1bda21f

if m.levels[item.index].isSyntheticIterBoundsKey {
break
while we can't do so for isIgnorableBoundaryKey

@jbowens
Copy link
Collaborator

jbowens commented Nov 14, 2022

I went and reread #1725 (comment). Is the improvement you are suggesting changing the bullet "Pausing when reach a file bound that is a rangedel: mergingIter skips these keys since it ignores based on InternalKeyKindRangeDelete"?

Yeah. Instead of skipping them based on InternalKeyKindRangeDelete, it'd skip them based on isIgnorableBoundaryKey.

My understanding is that isIgnorableBoundaryKey is set to indicate that a key is an artificial boundary key that exists only to prevent the level iterator from proceeding to the next file before the merging iterator is done with its range deletions. The merging iterator skips these ignorable boundary keys. Previously, before the introduction of isIgnorableBoundaryKey, we would always use Kind() == InternalKeyKindRangeDelete to determine whether a way as a boundary key that could be skipped. Currently, on master, we use both, inconsistently.


I'm having a hard time recalling when isSyntheticIterBoundsKey (879f3bf) is necessary.

@itsbilal
Copy link
Contributor Author

itsbilal commented Nov 14, 2022

My understanding is that isIgnorableBoundaryKey is set to indicate that a key is an artificial boundary key that exists only to prevent the level iterator from proceeding to the next file before the merging iterator is done with its range deletions. The merging iterator skips these ignorable boundary keys. Previously, before the introduction of isIgnorableBoundaryKey, we would always use Kind() == InternalKeyKindRangeDelete to determine whether a way as a boundary key that could be skipped. Currently, on master, we use both, inconsistently.

There's one additional use of isIgnorableBoundaryKey and it's in mergingIter.findNextEntry where we check if we're in prefix iteration and at an ignorable boundary key that's at a different prefix than the prefix we're searching for. See:

if m.prefix != nil && (reseeked || m.levels[item.index].isIgnorableBoundaryKey) {

This is the case we want to go into to avoid this bug, but weren't before this change.

I'll make the change to remove elideRangeTombstones in a separate commit just to keep the bugfix isolated in history.

Merge keys put the top level Iterator's position to `iterPosNext`, and
in the process of Next()ing to the next user key, one of the child
levelIters could have advanced past the sstable that contained the
Merge that is being returned.

If that sstable contained a range delete covering a key in an sstable
in a lower level, and the lower level's key was not exposed to the
mergingIter due to a bloom filter mismatch, and we do two consecutive
SeekPrefixGEs where the second one gets TrySeekUsingNext, we could
end up surfacing the deleted key in the lower sstable as the
higher level's levelIter has already moved past the sstable
containing the range deletion.

This change sets isIgnorableBoundaryKey = true when returning
largestBoundary in the levelIter to prevent the mergingIter from
next()ing past it in the first SeekPrefixGE.

Fixes cockroachdb#2118.
@itsbilal itsbilal force-pushed the fix-seekprefixge-merge-usingnext branch from 7ca96e0 to f9835a2 Compare November 14, 2022 20:52
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


level_iter.go line 960 at r4 (raw file):

Previously, sumeerbhola wrote…

I went and reread #1725 (comment). Is the improvement you are suggesting changing the bullet "Pausing when reach a file bound that is a rangedel: mergingIter skips these keys since it ignores based on InternalKeyKindRangeDelete"?

btw, while we are here I think we should improve the commentary around isSyntheticIterBoundsKey and isIgnorableBoundaryKey. I tend to forget why both concepts are needed, and maybe others do too.
My understanding is that the former (a) plays a role in switching directions that needs it to be a separate concept (but the details for that I have forgotten), (b) allows for stating an invariant for the whole mergingIter when the synthetic key is at the top, since the bounds that are reached apply to the whole iterator heap (in comparison isIgnorableBoundaryKey is a local pause for a level). (b) is why we can break here

pebble/merging_iter.go

Lines 687 to 688 in 1bda21f

if m.levels[item.index].isSyntheticIterBoundsKey {
break
while we can't do so for isIgnorableBoundaryKey

@jbowens made the change, PTAL!

Simplify the mergingIter range tombstone elision logic to not
need to know whether it's part of an Iterator or a compaction
iterator. Previously if elideRangeTombstone was true, the mergingIter
was part of an `Iterator` and we'd skip any RANGEDEL keys coming from
child iterators, which really only happened to be synthetic bound
keys or ignorable bound keys (set as such in i.boundaryContext).
For compaction iterators, we'd continue to surface all RANGEDELs
observed (and elideRangeTombstone was false).

Now, rangedels from child iterators that are
`!isIgnorableBoundaryKey` are surfaced by the merging iter, and
the last remaining case where we wanted to surface a key
but `isIgnorableBoundaryKey=false` was in skipEmptyFileBackward
and that has also been changed to isIgnorableBoundaryKey=true.
This removes the need for elideRangeTombstone so that has been
cleared up, and any tests that relied on elideRangeTombstone
have been updated.
@itsbilal itsbilal force-pushed the fix-seekprefixge-merge-usingnext branch from f9835a2 to 116ecc4 Compare November 14, 2022 22:06
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

@itsbilal
Copy link
Contributor Author

TFTR @jbowens !

Should we backport the first commit to 22.2? I think we should.

@itsbilal itsbilal merged commit fa09809 into cockroachdb:master Nov 16, 2022
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r2, 1 of 4 files at r4, 1 of 1 files at r5, 6 of 7 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions


merging_iter.go line 736 at r7 (raw file):

		if item.key.Visible(m.snapshot, m.batchSnapshot) &&
			(!m.levels[item.index].isIgnorableBoundaryKey) &&
			(item.key.Kind() != InternalKeyKindRangeDelete || !m.elideRangeTombstones) {

liking the simplification!

jbowens added a commit to jbowens/pebble that referenced this pull request Dec 7, 2022
Builds off of cockroachdb#2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization uses the successively larger seek keys
to attempt to use next to find the appropriate key, instead of seeking.
In cockroachdb#2048, we introduced an optimization that would avoid re-seeking
within a level's file metadata, instead using the presence of the
TrySeekUsingNext flag to signal that we can Next through the file
metadata. This saves key comparisons, especially in larger LSMs with
many files within a level.

Ever since the introduction of cockroachdb#2048, we've encountered a range of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123,

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (cockroachdb#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

  a) the heap root is deleted by a range tombstone
  b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

Fix cockroachdb#2174.
jbowens added a commit to jbowens/pebble that referenced this pull request Dec 8, 2022
Builds off of cockroachdb#2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization uses successively larger seek keys to
attempt to use next to find the appropriate key, instead of seeking.  In
a level's file metadata, instead using the presence of the
TrySeekUsingNext flag to signal that we can Next through the file
metadata. This saves key comparisons, especially in larger LSMs with
many files within a level.

Ever since the introduction of cockroachdb#2048, we've encountered a range of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123.

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (cockroachdb#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

  a) the heap root is deleted by a range tombstone
  b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

Fix cockroachdb#2174.
jbowens added a commit to jbowens/pebble that referenced this pull request Dec 8, 2022
Builds off of cockroachdb#2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization uses successively larger seek keys to
attempt to use next to find the appropriate key, instead of seeking.  In
a level's file metadata, instead using the presence of the
TrySeekUsingNext flag to signal that we can Next through the file
metadata. This saves key comparisons, especially in larger LSMs with
many files within a level.

Ever since the introduction of cockroachdb#2048, we've encountered a range of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123.

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (cockroachdb#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

  a) the heap root is deleted by a range tombstone
  b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

An adjustment to avoid skipping masked points beyond the prefix provides
a large benefit to prefix-iteration within a large swath of masked point
keys.

```
name                                                                     old time/op  new time/op  delta
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-16   52.2ms ±10%  50.7ms ± 2%     ~     (p=0.690 n=5+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-16         4.10ms ± 2%  4.07ms ± 2%     ~     (p=0.690 n=5+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-16             6.09ms ± 1%  6.02ms ± 1%   -1.19%  (p=0.016 n=5+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-16   52.5ms ± 2%  54.3ms ± 5%   +3.36%  (p=0.032 n=5+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-16         3.77ms ± 5%  4.47ms ± 6%  +18.46%  (p=0.008 n=5+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-16             5.21ms ± 3%  5.40ms ± 3%   +3.71%  (p=0.016 n=5+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-16   59.7ms ±12%  54.2ms ± 7%   -9.09%  (p=0.032 n=5+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-16         2.03ms ± 4%  2.08ms ±22%     ~     (p=0.690 n=5+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-16             2.80ms ± 7%  2.63ms ± 4%   -6.04%  (p=0.032 n=5+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-16   1.70s ± 1%   0.06s ± 6%  -96.76%  (p=0.016 n=4+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-16         280µs ± 2%   278µs ± 1%     ~     (p=0.548 n=5+5)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-16             514µs ± 3%   505µs ± 0%     ~     (p=0.151 n=5+5)
```

Fix cockroachdb#2174.
jbowens added a commit to jbowens/pebble that referenced this pull request Dec 8, 2022
Builds off of cockroachdb#2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization uses successively larger seek keys to
attempt to use next to find the appropriate key, instead of seeking.  In
a level's file metadata, instead using the presence of the
TrySeekUsingNext flag to signal that we can Next through the file
metadata. This saves key comparisons, especially in larger LSMs with
many files within a level.

Ever since the introduction of cockroachdb#2048, we've encountered a range of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123.

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (cockroachdb#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

  a) the heap root is deleted by a range tombstone
  b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

An adjustment to avoid skipping masked points beyond the prefix provides
a large benefit to prefix-iteration within a large swath of masked point
keys.

```
name                                                                     old time/op  new time/op  delta
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24   66.0ms ± 1%  65.6ms ± 2%     ~     (p=0.182 n=9+10)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24         4.99ms ± 2%  5.01ms ± 3%     ~     (p=0.842 n=9+10)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24             6.51ms ± 6%  6.64ms ± 6%     ~     (p=0.190 n=10+10)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24   68.2ms ± 1%  66.6ms ± 1%   -2.39%  (p=0.000 n=10+9)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24         4.54ms ± 1%  4.48ms ± 2%   -1.32%  (p=0.009 n=10+10)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24             5.60ms ± 3%  5.45ms ± 1%   -2.76%  (p=0.006 n=10+8)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24   74.8ms ± 1%  66.6ms ± 1%  -11.03%  (p=0.000 n=10+9)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24         2.72ms ± 2%  2.75ms ± 1%   +1.38%  (p=0.013 n=10+9)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24             3.46ms ± 1%  3.43ms ± 2%   -0.96%  (p=0.023 n=10+10)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24   2.29s ± 1%   0.07s ± 2%  -97.01%  (p=0.000 n=9+10)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24         356µs ± 1%   359µs ± 1%   +0.86%  (p=0.003 n=10+8)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24             716µs ± 1%   717µs ± 1%     ~     (p=0.739 n=10+10)
```

Fix cockroachdb#2174.
jbowens added a commit to jbowens/pebble that referenced this pull request Dec 9, 2022
Builds off of cockroachdb#2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization notes successively larger seek keys,
and attempts to use next to find the appropriate key in the later seeks,
avoiding seeking from scratch. In a level's file metadata, we now use
the presence of the TrySeekUsingNext flag to signal that we can Next
through the file metadata. This saves key comparisons, especially in
larger LSMs with many files within a level.

Ever since the introduction of cockroachdb#2048, we've encountered a slew of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123
and cockroachdb#2151.

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (cockroachdb#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

  a) the heap root is deleted by a range tombstone
  b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

Additionally, to obey these new stricter invariants, the interleaving
iterator is adjusted to exhaust the iterator when the remainder of a
prefix is masked by a range key. Besides ensuring invariants are
respected, this change provides a large benefit to prefix iteration
within a large swath of masked point keys.

```
name                                                                     old time/op  new time/op  delta
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24   59.8ms ± 2%  60.3ms ± 2%   +0.92%  (p=0.013 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24         4.50ms ± 6%  4.52ms ± 3%     ~     (p=0.388 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24             5.64ms ± 7%  5.51ms ± 3%   -2.26%  (p=0.001 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24   64.0ms ± 5%  60.2ms ± 4%   -5.89%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24         4.11ms ± 3%  4.16ms ± 5%   +1.06%  (p=0.040 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24             4.94ms ± 3%  4.89ms ± 3%   -0.97%  (p=0.011 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24   69.0ms ± 2%  60.1ms ± 2%  -12.88%  (p=0.000 n=22+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24         2.55ms ± 2%  2.59ms ± 3%   +1.90%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24             3.20ms ± 2%  3.21ms ± 3%     ~     (p=0.418 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24   2.19s ± 3%   0.06s ± 2%  -97.27%  (p=0.000 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24         346µs ± 2%   346µs ± 2%     ~     (p=0.898 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24             687µs ± 3%   696µs ± 2%   +1.31%  (p=0.000 n=25+25)
```

Fix cockroachdb#2174.
jbowens added a commit to jbowens/pebble that referenced this pull request Dec 9, 2022
Builds off of cockroachdb#2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization notes successively larger seek keys,
and attempts to use next to find the appropriate key in the later seeks,
avoiding seeking from scratch. In a level's file metadata, we now use
the presence of the TrySeekUsingNext flag to signal that we can Next
through the file metadata. This saves key comparisons, especially in
larger LSMs with many files within a level.

Ever since the introduction of cockroachdb#2048, we've encountered a slew of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123
and cockroachdb#2151.

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (cockroachdb#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

  a) the heap root is deleted by a range tombstone
  b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

Additionally, to obey these new stricter invariants, the interleaving
iterator is adjusted to exhaust the iterator when the remainder of a
prefix is masked by a range key. Besides ensuring invariants are
respected, this change provides a large benefit to prefix iteration
within a large swath of masked point keys.

```
name                                                                     old time/op  new time/op  delta
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24   59.8ms ± 2%  60.3ms ± 2%   +0.92%  (p=0.013 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24         4.50ms ± 6%  4.52ms ± 3%     ~     (p=0.388 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24             5.64ms ± 7%  5.51ms ± 3%   -2.26%  (p=0.001 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24   64.0ms ± 5%  60.2ms ± 4%   -5.89%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24         4.11ms ± 3%  4.16ms ± 5%   +1.06%  (p=0.040 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24             4.94ms ± 3%  4.89ms ± 3%   -0.97%  (p=0.011 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24   69.0ms ± 2%  60.1ms ± 2%  -12.88%  (p=0.000 n=22+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24         2.55ms ± 2%  2.59ms ± 3%   +1.90%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24             3.20ms ± 2%  3.21ms ± 3%     ~     (p=0.418 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24   2.19s ± 3%   0.06s ± 2%  -97.27%  (p=0.000 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24         346µs ± 2%   346µs ± 2%     ~     (p=0.898 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24             687µs ± 3%   696µs ± 2%   +1.31%  (p=0.000 n=25+25)
```

Fix cockroachdb#2174.
jbowens added a commit to jbowens/pebble that referenced this pull request Dec 12, 2022
Builds off of cockroachdb#2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization notes successively larger seek keys,
and attempts to use next to find the appropriate key in the later seeks,
avoiding seeking from scratch. In a level's file metadata, we now use
the presence of the TrySeekUsingNext flag to signal that we can Next
through the file metadata. This saves key comparisons, especially in
larger LSMs with many files within a level.

Ever since the introduction of cockroachdb#2048, we've encountered a slew of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123
and cockroachdb#2151.

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (cockroachdb#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

  a) the heap root is deleted by a range tombstone
  b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

Additionally, to obey these new stricter invariants, the interleaving
iterator is adjusted to exhaust the iterator when the remainder of a
prefix is masked by a range key. Besides ensuring invariants are
respected, this change provides a large benefit to prefix iteration
within a large swath of masked point keys.

```
name                                                                     old time/op  new time/op  delta
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24   59.8ms ± 2%  60.3ms ± 2%   +0.92%  (p=0.013 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24         4.50ms ± 6%  4.52ms ± 3%     ~     (p=0.388 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24             5.64ms ± 7%  5.51ms ± 3%   -2.26%  (p=0.001 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24   64.0ms ± 5%  60.2ms ± 4%   -5.89%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24         4.11ms ± 3%  4.16ms ± 5%   +1.06%  (p=0.040 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24             4.94ms ± 3%  4.89ms ± 3%   -0.97%  (p=0.011 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24   69.0ms ± 2%  60.1ms ± 2%  -12.88%  (p=0.000 n=22+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24         2.55ms ± 2%  2.59ms ± 3%   +1.90%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24             3.20ms ± 2%  3.21ms ± 3%     ~     (p=0.418 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24   2.19s ± 3%   0.06s ± 2%  -97.27%  (p=0.000 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24         346µs ± 2%   346µs ± 2%     ~     (p=0.898 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24             687µs ± 3%   696µs ± 2%   +1.31%  (p=0.000 n=25+25)
```

Fix cockroachdb#2174.
jbowens added a commit that referenced this pull request Dec 12, 2022
Builds off of #2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization notes successively larger seek keys,
and attempts to use next to find the appropriate key in the later seeks,
avoiding seeking from scratch. In a level's file metadata, we now use
the presence of the TrySeekUsingNext flag to signal that we can Next
through the file metadata. This saves key comparisons, especially in
larger LSMs with many files within a level.

Ever since the introduction of #2048, we've encountered a slew of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in #2087, #2111, #2123
and #2151.

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

  a) the heap root is deleted by a range tombstone
  b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

Additionally, to obey these new stricter invariants, the interleaving
iterator is adjusted to exhaust the iterator when the remainder of a
prefix is masked by a range key. Besides ensuring invariants are
respected, this change provides a large benefit to prefix iteration
within a large swath of masked point keys.

```
name                                                                     old time/op  new time/op  delta
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24   59.8ms ± 2%  60.3ms ± 2%   +0.92%  (p=0.013 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24         4.50ms ± 6%  4.52ms ± 3%     ~     (p=0.388 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24             5.64ms ± 7%  5.51ms ± 3%   -2.26%  (p=0.001 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24   64.0ms ± 5%  60.2ms ± 4%   -5.89%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24         4.11ms ± 3%  4.16ms ± 5%   +1.06%  (p=0.040 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24             4.94ms ± 3%  4.89ms ± 3%   -0.97%  (p=0.011 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24   69.0ms ± 2%  60.1ms ± 2%  -12.88%  (p=0.000 n=22+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24         2.55ms ± 2%  2.59ms ± 3%   +1.90%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24             3.20ms ± 2%  3.21ms ± 3%     ~     (p=0.418 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24   2.19s ± 3%   0.06s ± 2%  -97.27%  (p=0.000 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24         346µs ± 2%   346µs ± 2%     ~     (p=0.898 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24             687µs ± 3%   696µs ± 2%   +1.31%  (p=0.000 n=25+25)
```

Fix #2174.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

github.com/cockroachdb/pebble/internal/metamorphic: TestMeta failed
4 participants