-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: Set isIgnorableBoundaryKey=true for rangedel boundaries in levelIter #2123
*: Set isIgnorableBoundaryKey=true for rangedel boundaries in levelIter #2123
Conversation
@jbowens happy to hear your thoughts on this; maybe this is an overly-defensive disabling of the optimization when we can do something like setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time wrapping my head around what the bug is. It looks like the unit test doesn't exercise the bug, because the test still passes after reverting iterator.go
?
Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @itsbilal)
e7ba212
to
1e881f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbowens you were right. I found the last missing piece of the puzzle; the lower level SST has to have a bloom filter mismatch in the first SeekPrefixGE, otherwise the deleted key gets surfaced to the mergingIter and we don't double-next() the higher level sstable. The test fails now if you revert the other change. I also made the isIgnorableBoundaryKey
change as that passes the test too.
Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @itsbilal)
Nevermind, this trips up another test. I'll revert to the previous fix soon. |
1e881f1
to
f740894
Compare
Reverted to the original fix and test improved, PTAL! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @jbowens)
iterator.go
line 1298 at r3 (raw file):
} if lastPositioningOp == seekPrefixGELastPositioningOp && i.pos != iterPosNext { // iterPosNext is unsafe to optimize for TrySeekUsingNext. This is because
I'm having some difficulty with this change, both in understanding and whether this ought to be the fix. Ideally we should be able to state an invariant about the InternalIterator
tree and use that to decide on the optimization here (like we do in our current optimizations in iterator.go) and not have to reason about the details about the InternalIterator
implementation. I worry that with the latter we will move much closer to not being able to understand or improve these optimizations in the future.
I don't quite understand the gap in the current pausing logic in level_iter.go and merging_iter.go that is causing this -- I suspect the fix belongs there and not in iterator.go.
btw, I'm unsure about how this works in the presence of InterleavingIter
. nextPointCurrentUserKey
can set i.pos = iterPosNext
because the range key iterator is the current position and the point iter it at the next position. I guess this is harmless from a correctness perspective, but will mean we don't use TrySeekUsingNext
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)
iterator.go
line 1298 at r3 (raw file):
Previously, sumeerbhola wrote…
I'm having some difficulty with this change, both in understanding and whether this ought to be the fix. Ideally we should be able to state an invariant about the
InternalIterator
tree and use that to decide on the optimization here (like we do in our current optimizations in iterator.go) and not have to reason about the details about theInternalIterator
implementation. I worry that with the latter we will move much closer to not being able to understand or improve these optimizations in the future.
I don't quite understand the gap in the current pausing logic in level_iter.go and merging_iter.go that is causing this -- I suspect the fix belongs there and not in iterator.go.btw, I'm unsure about how this works in the presence of
InterleavingIter
.nextPointCurrentUserKey
can seti.pos = iterPosNext
because the range key iterator is the current position and the point iter it at the next position. I guess this is harmless from a correctness perspective, but will mean we don't useTrySeekUsingNext
?
Ditto. Is the bug that we don't do this logic when the level iter gets Next-ed during prefix iteration mode?
And currently we don't have knowledge of whether we're in prefix mode during levelIter.Next
, so we can't readily just add that logic to the Next
exhaustion case?
f740894
to
7ca96e0
Compare
Previously, jbowens (Jackson Owens) wrote…
Just pushed a detailed example into the comments, hope that helps. Also reverted to the Not sure if levelIter needs to know about prefix iteration, but we shouldn't be double-Nexting past the sstable containing the range delete in any case (I think). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
level_iter.go
line 960 at r4 (raw file):
if l.boundaryContext != nil { l.boundaryContext.isIgnorableBoundaryKey = true }
This isIgnorableBoundaryKey
change makes sense to me.
If I'm not mistaken, we could adjust the handling of range tombstones to always rely on isIgnorableBoundaryKey
and remove mergingIter.elideRangeTombstones
. We've be relying on the levelIter always setting isIgnorableBoundaryKey = true
whenever it returns one of these boundary keys (including in skipEmptyFileBackward
). The mergingIter would not need to perform any checks for range deletion key kinds. When using within an Iterator
, the mergingIter will never return range tombstone keys, as it does today with elideRangeTombstones=true
. When used in compactions, it will continue to return interleaved range deletions, just not any of the artificial boundary keys.
iterator.go
line 1298 at r3 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Just pushed a detailed example into the comments, hope that helps. Also reverted to the
isIgnorableBoundaryKey
fix that I had yesterday, which is in line with Sumeer's comment. It changed some other datadriven tests like I had mentioned earlier but it seems safe to me (more eyes here would help).Not sure if levelIter needs to know about prefix iteration, but we shouldn't be double-Nexting past the sstable containing the range delete in any case (I think).
Thanks, this example helped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @jbowens)
level_iter.go
line 960 at r4 (raw file):
Previously, jbowens (Jackson Owens) wrote…
This
isIgnorableBoundaryKey
change makes sense to me.If I'm not mistaken, we could adjust the handling of range tombstones to always rely on
isIgnorableBoundaryKey
and removemergingIter.elideRangeTombstones
. We've be relying on the levelIter always settingisIgnorableBoundaryKey = true
whenever it returns one of these boundary keys (including inskipEmptyFileBackward
). The mergingIter would not need to perform any checks for range deletion key kinds. When using within anIterator
, the mergingIter will never return range tombstone keys, as it does today withelideRangeTombstones=true
. When used in compactions, it will continue to return interleaved range deletions, just not any of the artificial boundary keys.
I went and reread #1725 (comment). Is the improvement you are suggesting changing the bullet "Pausing when reach a file bound that is a rangedel: mergingIter skips these keys since it ignores based on InternalKeyKindRangeDelete"?
btw, while we are here I think we should improve the commentary around isSyntheticIterBoundsKey
and isIgnorableBoundaryKey
. I tend to forget why both concepts are needed, and maybe others do too.
My understanding is that the former (a) plays a role in switching directions that needs it to be a separate concept (but the details for that I have forgotten), (b) allows for stating an invariant for the whole mergingIter when the synthetic key is at the top, since the bounds that are reached apply to the whole iterator heap (in comparison isIgnorableBoundaryKey
is a local pause for a level). (b) is why we can break here
Lines 687 to 688 in 1bda21f
if m.levels[item.index].isSyntheticIterBoundsKey { | |
break |
isIgnorableBoundaryKey
Yeah. Instead of skipping them based on My understanding is that I'm having a hard time recalling when |
There's one additional use of Line 722 in 1bda21f
This is the case we want to go into to avoid this bug, but weren't before this change. I'll make the change to remove |
Merge keys put the top level Iterator's position to `iterPosNext`, and in the process of Next()ing to the next user key, one of the child levelIters could have advanced past the sstable that contained the Merge that is being returned. If that sstable contained a range delete covering a key in an sstable in a lower level, and the lower level's key was not exposed to the mergingIter due to a bloom filter mismatch, and we do two consecutive SeekPrefixGEs where the second one gets TrySeekUsingNext, we could end up surfacing the deleted key in the lower sstable as the higher level's levelIter has already moved past the sstable containing the range deletion. This change sets isIgnorableBoundaryKey = true when returning largestBoundary in the levelIter to prevent the mergingIter from next()ing past it in the first SeekPrefixGE. Fixes cockroachdb#2118.
7ca96e0
to
f9835a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
level_iter.go
line 960 at r4 (raw file):
Previously, sumeerbhola wrote…
I went and reread #1725 (comment). Is the improvement you are suggesting changing the bullet "Pausing when reach a file bound that is a rangedel: mergingIter skips these keys since it ignores based on InternalKeyKindRangeDelete"?
btw, while we are here I think we should improve the commentary around
isSyntheticIterBoundsKey
andisIgnorableBoundaryKey
. I tend to forget why both concepts are needed, and maybe others do too.
My understanding is that the former (a) plays a role in switching directions that needs it to be a separate concept (but the details for that I have forgotten), (b) allows for stating an invariant for the whole mergingIter when the synthetic key is at the top, since the bounds that are reached apply to the whole iterator heap (in comparisonisIgnorableBoundaryKey
is a local pause for a level). (b) is why we can break herewhile we can't do so forLines 687 to 688 in 1bda21f
if m.levels[item.index].isSyntheticIterBoundsKey { break isIgnorableBoundaryKey
@jbowens made the change, PTAL!
Simplify the mergingIter range tombstone elision logic to not need to know whether it's part of an Iterator or a compaction iterator. Previously if elideRangeTombstone was true, the mergingIter was part of an `Iterator` and we'd skip any RANGEDEL keys coming from child iterators, which really only happened to be synthetic bound keys or ignorable bound keys (set as such in i.boundaryContext). For compaction iterators, we'd continue to surface all RANGEDELs observed (and elideRangeTombstone was false). Now, rangedels from child iterators that are `!isIgnorableBoundaryKey` are surfaced by the merging iter, and the last remaining case where we wanted to surface a key but `isIgnorableBoundaryKey=false` was in skipEmptyFileBackward and that has also been changed to isIgnorableBoundaryKey=true. This removes the need for elideRangeTombstone so that has been cleared up, and any tests that relied on elideRangeTombstone have been updated.
f9835a2
to
116ecc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
TFTR @jbowens ! Should we backport the first commit to 22.2? I think we should. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r2, 1 of 4 files at r4, 1 of 1 files at r5, 6 of 7 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions
merging_iter.go
line 736 at r7 (raw file):
if item.key.Visible(m.snapshot, m.batchSnapshot) && (!m.levels[item.index].isIgnorableBoundaryKey) && (item.key.Kind() != InternalKeyKindRangeDelete || !m.elideRangeTombstones) {
liking the simplification!
Builds off of cockroachdb#2151. This commit refactors the merging iterator's handling of prefix-related invariants for enabling TrySeekUsingNext optimizations when using SeekPrefixGE. The TrySeekUsingNext optimization uses the successively larger seek keys to attempt to use next to find the appropriate key, instead of seeking. In cockroachdb#2048, we introduced an optimization that would avoid re-seeking within a level's file metadata, instead using the presence of the TrySeekUsingNext flag to signal that we can Next through the file metadata. This saves key comparisons, especially in larger LSMs with many files within a level. Ever since the introduction of cockroachdb#2048, we've encountered a range of metamorphic test failures due to interactions between this optimization and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123, Most of these failures revolve around the progression of a level iterator beyond the iterator's current iteration prefix. This is problematic, because the incomplete view of the LSM during prefix iteration can result in inconsistent positioning outside of the current iterator prefix. The internal iterators do not guarantee that keys or range deletions outside the prefix are observed. This commit fixes the latest instance of this (cockroachdb#2174), and attempts to restructure the code to clarify the required invariants. Now, any method that may advance the iterator during prefix iteration (eg, nextEntry or isNextEntryDeleted) check that the iteration prefix has not already been exceeded. This check is only performed when the merging iterator is advancing internal iterators of its own volition (eg, due to non-visible keys or keys deleted by range tombstones). It does NOT perform this check when the top-level Iterator Nexts the merging iterator because of a user Next, a point-delete, etc. This means these prefix checks are only redundant with the top-level Iterator Nexts in two cases: a) the heap root is deleted by a range tombstone b) the heap root is not visible at the iterator's sequence number Both of these cases are expected to be rare, and the redundant prefix check is expected to be insignificant. Fix cockroachdb#2174.
Builds off of cockroachdb#2151. This commit refactors the merging iterator's handling of prefix-related invariants for enabling TrySeekUsingNext optimizations when using SeekPrefixGE. The TrySeekUsingNext optimization uses successively larger seek keys to attempt to use next to find the appropriate key, instead of seeking. In a level's file metadata, instead using the presence of the TrySeekUsingNext flag to signal that we can Next through the file metadata. This saves key comparisons, especially in larger LSMs with many files within a level. Ever since the introduction of cockroachdb#2048, we've encountered a range of metamorphic test failures due to interactions between this optimization and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123. Most of these failures revolve around the progression of a level iterator beyond the iterator's current iteration prefix. This is problematic, because the incomplete view of the LSM during prefix iteration can result in inconsistent positioning outside of the current iterator prefix. The internal iterators do not guarantee that keys or range deletions outside the prefix are observed. This commit fixes the latest instance of this (cockroachdb#2174), and attempts to restructure the code to clarify the required invariants. Now, any method that may advance the iterator during prefix iteration (eg, nextEntry or isNextEntryDeleted) check that the iteration prefix has not already been exceeded. This check is only performed when the merging iterator is advancing internal iterators of its own volition (eg, due to non-visible keys or keys deleted by range tombstones). It does NOT perform this check when the top-level Iterator Nexts the merging iterator because of a user Next, a point-delete, etc. This means these prefix checks are only redundant with the top-level Iterator Nexts in two cases: a) the heap root is deleted by a range tombstone b) the heap root is not visible at the iterator's sequence number Both of these cases are expected to be rare, and the redundant prefix check is expected to be insignificant. Fix cockroachdb#2174.
Builds off of cockroachdb#2151. This commit refactors the merging iterator's handling of prefix-related invariants for enabling TrySeekUsingNext optimizations when using SeekPrefixGE. The TrySeekUsingNext optimization uses successively larger seek keys to attempt to use next to find the appropriate key, instead of seeking. In a level's file metadata, instead using the presence of the TrySeekUsingNext flag to signal that we can Next through the file metadata. This saves key comparisons, especially in larger LSMs with many files within a level. Ever since the introduction of cockroachdb#2048, we've encountered a range of metamorphic test failures due to interactions between this optimization and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123. Most of these failures revolve around the progression of a level iterator beyond the iterator's current iteration prefix. This is problematic, because the incomplete view of the LSM during prefix iteration can result in inconsistent positioning outside of the current iterator prefix. The internal iterators do not guarantee that keys or range deletions outside the prefix are observed. This commit fixes the latest instance of this (cockroachdb#2174), and attempts to restructure the code to clarify the required invariants. Now, any method that may advance the iterator during prefix iteration (eg, nextEntry or isNextEntryDeleted) check that the iteration prefix has not already been exceeded. This check is only performed when the merging iterator is advancing internal iterators of its own volition (eg, due to non-visible keys or keys deleted by range tombstones). It does NOT perform this check when the top-level Iterator Nexts the merging iterator because of a user Next, a point-delete, etc. This means these prefix checks are only redundant with the top-level Iterator Nexts in two cases: a) the heap root is deleted by a range tombstone b) the heap root is not visible at the iterator's sequence number Both of these cases are expected to be rare, and the redundant prefix check is expected to be insignificant. An adjustment to avoid skipping masked points beyond the prefix provides a large benefit to prefix-iteration within a large swath of masked point keys. ``` name old time/op new time/op delta Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-16 52.2ms ±10% 50.7ms ± 2% ~ (p=0.690 n=5+5) Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-16 4.10ms ± 2% 4.07ms ± 2% ~ (p=0.690 n=5+5) Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-16 6.09ms ± 1% 6.02ms ± 1% -1.19% (p=0.016 n=5+5) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-16 52.5ms ± 2% 54.3ms ± 5% +3.36% (p=0.032 n=5+5) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-16 3.77ms ± 5% 4.47ms ± 6% +18.46% (p=0.008 n=5+5) Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-16 5.21ms ± 3% 5.40ms ± 3% +3.71% (p=0.016 n=5+5) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-16 59.7ms ±12% 54.2ms ± 7% -9.09% (p=0.032 n=5+5) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-16 2.03ms ± 4% 2.08ms ±22% ~ (p=0.690 n=5+5) Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-16 2.80ms ± 7% 2.63ms ± 4% -6.04% (p=0.032 n=5+5) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-16 1.70s ± 1% 0.06s ± 6% -96.76% (p=0.016 n=4+5) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-16 280µs ± 2% 278µs ± 1% ~ (p=0.548 n=5+5) Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-16 514µs ± 3% 505µs ± 0% ~ (p=0.151 n=5+5) ``` Fix cockroachdb#2174.
Builds off of cockroachdb#2151. This commit refactors the merging iterator's handling of prefix-related invariants for enabling TrySeekUsingNext optimizations when using SeekPrefixGE. The TrySeekUsingNext optimization uses successively larger seek keys to attempt to use next to find the appropriate key, instead of seeking. In a level's file metadata, instead using the presence of the TrySeekUsingNext flag to signal that we can Next through the file metadata. This saves key comparisons, especially in larger LSMs with many files within a level. Ever since the introduction of cockroachdb#2048, we've encountered a range of metamorphic test failures due to interactions between this optimization and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123. Most of these failures revolve around the progression of a level iterator beyond the iterator's current iteration prefix. This is problematic, because the incomplete view of the LSM during prefix iteration can result in inconsistent positioning outside of the current iterator prefix. The internal iterators do not guarantee that keys or range deletions outside the prefix are observed. This commit fixes the latest instance of this (cockroachdb#2174), and attempts to restructure the code to clarify the required invariants. Now, any method that may advance the iterator during prefix iteration (eg, nextEntry or isNextEntryDeleted) check that the iteration prefix has not already been exceeded. This check is only performed when the merging iterator is advancing internal iterators of its own volition (eg, due to non-visible keys or keys deleted by range tombstones). It does NOT perform this check when the top-level Iterator Nexts the merging iterator because of a user Next, a point-delete, etc. This means these prefix checks are only redundant with the top-level Iterator Nexts in two cases: a) the heap root is deleted by a range tombstone b) the heap root is not visible at the iterator's sequence number Both of these cases are expected to be rare, and the redundant prefix check is expected to be insignificant. An adjustment to avoid skipping masked points beyond the prefix provides a large benefit to prefix-iteration within a large swath of masked point keys. ``` name old time/op new time/op delta Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24 66.0ms ± 1% 65.6ms ± 2% ~ (p=0.182 n=9+10) Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24 4.99ms ± 2% 5.01ms ± 3% ~ (p=0.842 n=9+10) Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24 6.51ms ± 6% 6.64ms ± 6% ~ (p=0.190 n=10+10) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24 68.2ms ± 1% 66.6ms ± 1% -2.39% (p=0.000 n=10+9) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24 4.54ms ± 1% 4.48ms ± 2% -1.32% (p=0.009 n=10+10) Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24 5.60ms ± 3% 5.45ms ± 1% -2.76% (p=0.006 n=10+8) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24 74.8ms ± 1% 66.6ms ± 1% -11.03% (p=0.000 n=10+9) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24 2.72ms ± 2% 2.75ms ± 1% +1.38% (p=0.013 n=10+9) Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24 3.46ms ± 1% 3.43ms ± 2% -0.96% (p=0.023 n=10+10) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24 2.29s ± 1% 0.07s ± 2% -97.01% (p=0.000 n=9+10) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24 356µs ± 1% 359µs ± 1% +0.86% (p=0.003 n=10+8) Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24 716µs ± 1% 717µs ± 1% ~ (p=0.739 n=10+10) ``` Fix cockroachdb#2174.
Builds off of cockroachdb#2151. This commit refactors the merging iterator's handling of prefix-related invariants for enabling TrySeekUsingNext optimizations when using SeekPrefixGE. The TrySeekUsingNext optimization notes successively larger seek keys, and attempts to use next to find the appropriate key in the later seeks, avoiding seeking from scratch. In a level's file metadata, we now use the presence of the TrySeekUsingNext flag to signal that we can Next through the file metadata. This saves key comparisons, especially in larger LSMs with many files within a level. Ever since the introduction of cockroachdb#2048, we've encountered a slew of metamorphic test failures due to interactions between this optimization and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123 and cockroachdb#2151. Most of these failures revolve around the progression of a level iterator beyond the iterator's current iteration prefix. This is problematic, because the incomplete view of the LSM during prefix iteration can result in inconsistent positioning outside of the current iterator prefix. The internal iterators do not guarantee that keys or range deletions outside the prefix are observed. This commit fixes the latest instance of this (cockroachdb#2174), and attempts to restructure the code to clarify the required invariants. Now, any method that may advance the iterator during prefix iteration (eg, nextEntry or isNextEntryDeleted) check that the iteration prefix has not already been exceeded. This check is only performed when the merging iterator is advancing internal iterators of its own volition (eg, due to non-visible keys or keys deleted by range tombstones). It does NOT perform this check when the top-level Iterator Nexts the merging iterator because of a user Next, a point-delete, etc. This means these prefix checks are only redundant with the top-level Iterator Nexts in two cases: a) the heap root is deleted by a range tombstone b) the heap root is not visible at the iterator's sequence number Both of these cases are expected to be rare, and the redundant prefix check is expected to be insignificant. Additionally, to obey these new stricter invariants, the interleaving iterator is adjusted to exhaust the iterator when the remainder of a prefix is masked by a range key. Besides ensuring invariants are respected, this change provides a large benefit to prefix iteration within a large swath of masked point keys. ``` name old time/op new time/op delta Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24 59.8ms ± 2% 60.3ms ± 2% +0.92% (p=0.013 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24 4.50ms ± 6% 4.52ms ± 3% ~ (p=0.388 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24 5.64ms ± 7% 5.51ms ± 3% -2.26% (p=0.001 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24 64.0ms ± 5% 60.2ms ± 4% -5.89% (p=0.000 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24 4.11ms ± 3% 4.16ms ± 5% +1.06% (p=0.040 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24 4.94ms ± 3% 4.89ms ± 3% -0.97% (p=0.011 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24 69.0ms ± 2% 60.1ms ± 2% -12.88% (p=0.000 n=22+24) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24 2.55ms ± 2% 2.59ms ± 3% +1.90% (p=0.000 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24 3.20ms ± 2% 3.21ms ± 3% ~ (p=0.418 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24 2.19s ± 3% 0.06s ± 2% -97.27% (p=0.000 n=24+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24 346µs ± 2% 346µs ± 2% ~ (p=0.898 n=24+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24 687µs ± 3% 696µs ± 2% +1.31% (p=0.000 n=25+25) ``` Fix cockroachdb#2174.
Builds off of cockroachdb#2151. This commit refactors the merging iterator's handling of prefix-related invariants for enabling TrySeekUsingNext optimizations when using SeekPrefixGE. The TrySeekUsingNext optimization notes successively larger seek keys, and attempts to use next to find the appropriate key in the later seeks, avoiding seeking from scratch. In a level's file metadata, we now use the presence of the TrySeekUsingNext flag to signal that we can Next through the file metadata. This saves key comparisons, especially in larger LSMs with many files within a level. Ever since the introduction of cockroachdb#2048, we've encountered a slew of metamorphic test failures due to interactions between this optimization and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123 and cockroachdb#2151. Most of these failures revolve around the progression of a level iterator beyond the iterator's current iteration prefix. This is problematic, because the incomplete view of the LSM during prefix iteration can result in inconsistent positioning outside of the current iterator prefix. The internal iterators do not guarantee that keys or range deletions outside the prefix are observed. This commit fixes the latest instance of this (cockroachdb#2174), and attempts to restructure the code to clarify the required invariants. Now, any method that may advance the iterator during prefix iteration (eg, nextEntry or isNextEntryDeleted) check that the iteration prefix has not already been exceeded. This check is only performed when the merging iterator is advancing internal iterators of its own volition (eg, due to non-visible keys or keys deleted by range tombstones). It does NOT perform this check when the top-level Iterator Nexts the merging iterator because of a user Next, a point-delete, etc. This means these prefix checks are only redundant with the top-level Iterator Nexts in two cases: a) the heap root is deleted by a range tombstone b) the heap root is not visible at the iterator's sequence number Both of these cases are expected to be rare, and the redundant prefix check is expected to be insignificant. Additionally, to obey these new stricter invariants, the interleaving iterator is adjusted to exhaust the iterator when the remainder of a prefix is masked by a range key. Besides ensuring invariants are respected, this change provides a large benefit to prefix iteration within a large swath of masked point keys. ``` name old time/op new time/op delta Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24 59.8ms ± 2% 60.3ms ± 2% +0.92% (p=0.013 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24 4.50ms ± 6% 4.52ms ± 3% ~ (p=0.388 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24 5.64ms ± 7% 5.51ms ± 3% -2.26% (p=0.001 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24 64.0ms ± 5% 60.2ms ± 4% -5.89% (p=0.000 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24 4.11ms ± 3% 4.16ms ± 5% +1.06% (p=0.040 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24 4.94ms ± 3% 4.89ms ± 3% -0.97% (p=0.011 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24 69.0ms ± 2% 60.1ms ± 2% -12.88% (p=0.000 n=22+24) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24 2.55ms ± 2% 2.59ms ± 3% +1.90% (p=0.000 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24 3.20ms ± 2% 3.21ms ± 3% ~ (p=0.418 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24 2.19s ± 3% 0.06s ± 2% -97.27% (p=0.000 n=24+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24 346µs ± 2% 346µs ± 2% ~ (p=0.898 n=24+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24 687µs ± 3% 696µs ± 2% +1.31% (p=0.000 n=25+25) ``` Fix cockroachdb#2174.
Builds off of cockroachdb#2151. This commit refactors the merging iterator's handling of prefix-related invariants for enabling TrySeekUsingNext optimizations when using SeekPrefixGE. The TrySeekUsingNext optimization notes successively larger seek keys, and attempts to use next to find the appropriate key in the later seeks, avoiding seeking from scratch. In a level's file metadata, we now use the presence of the TrySeekUsingNext flag to signal that we can Next through the file metadata. This saves key comparisons, especially in larger LSMs with many files within a level. Ever since the introduction of cockroachdb#2048, we've encountered a slew of metamorphic test failures due to interactions between this optimization and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123 and cockroachdb#2151. Most of these failures revolve around the progression of a level iterator beyond the iterator's current iteration prefix. This is problematic, because the incomplete view of the LSM during prefix iteration can result in inconsistent positioning outside of the current iterator prefix. The internal iterators do not guarantee that keys or range deletions outside the prefix are observed. This commit fixes the latest instance of this (cockroachdb#2174), and attempts to restructure the code to clarify the required invariants. Now, any method that may advance the iterator during prefix iteration (eg, nextEntry or isNextEntryDeleted) check that the iteration prefix has not already been exceeded. This check is only performed when the merging iterator is advancing internal iterators of its own volition (eg, due to non-visible keys or keys deleted by range tombstones). It does NOT perform this check when the top-level Iterator Nexts the merging iterator because of a user Next, a point-delete, etc. This means these prefix checks are only redundant with the top-level Iterator Nexts in two cases: a) the heap root is deleted by a range tombstone b) the heap root is not visible at the iterator's sequence number Both of these cases are expected to be rare, and the redundant prefix check is expected to be insignificant. Additionally, to obey these new stricter invariants, the interleaving iterator is adjusted to exhaust the iterator when the remainder of a prefix is masked by a range key. Besides ensuring invariants are respected, this change provides a large benefit to prefix iteration within a large swath of masked point keys. ``` name old time/op new time/op delta Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24 59.8ms ± 2% 60.3ms ± 2% +0.92% (p=0.013 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24 4.50ms ± 6% 4.52ms ± 3% ~ (p=0.388 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24 5.64ms ± 7% 5.51ms ± 3% -2.26% (p=0.001 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24 64.0ms ± 5% 60.2ms ± 4% -5.89% (p=0.000 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24 4.11ms ± 3% 4.16ms ± 5% +1.06% (p=0.040 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24 4.94ms ± 3% 4.89ms ± 3% -0.97% (p=0.011 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24 69.0ms ± 2% 60.1ms ± 2% -12.88% (p=0.000 n=22+24) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24 2.55ms ± 2% 2.59ms ± 3% +1.90% (p=0.000 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24 3.20ms ± 2% 3.21ms ± 3% ~ (p=0.418 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24 2.19s ± 3% 0.06s ± 2% -97.27% (p=0.000 n=24+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24 346µs ± 2% 346µs ± 2% ~ (p=0.898 n=24+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24 687µs ± 3% 696µs ± 2% +1.31% (p=0.000 n=25+25) ``` Fix cockroachdb#2174.
Builds off of #2151. This commit refactors the merging iterator's handling of prefix-related invariants for enabling TrySeekUsingNext optimizations when using SeekPrefixGE. The TrySeekUsingNext optimization notes successively larger seek keys, and attempts to use next to find the appropriate key in the later seeks, avoiding seeking from scratch. In a level's file metadata, we now use the presence of the TrySeekUsingNext flag to signal that we can Next through the file metadata. This saves key comparisons, especially in larger LSMs with many files within a level. Ever since the introduction of #2048, we've encountered a slew of metamorphic test failures due to interactions between this optimization and prefix iteration. We've made various fixes in #2087, #2111, #2123 and #2151. Most of these failures revolve around the progression of a level iterator beyond the iterator's current iteration prefix. This is problematic, because the incomplete view of the LSM during prefix iteration can result in inconsistent positioning outside of the current iterator prefix. The internal iterators do not guarantee that keys or range deletions outside the prefix are observed. This commit fixes the latest instance of this (#2174), and attempts to restructure the code to clarify the required invariants. Now, any method that may advance the iterator during prefix iteration (eg, nextEntry or isNextEntryDeleted) check that the iteration prefix has not already been exceeded. This check is only performed when the merging iterator is advancing internal iterators of its own volition (eg, due to non-visible keys or keys deleted by range tombstones). It does NOT perform this check when the top-level Iterator Nexts the merging iterator because of a user Next, a point-delete, etc. This means these prefix checks are only redundant with the top-level Iterator Nexts in two cases: a) the heap root is deleted by a range tombstone b) the heap root is not visible at the iterator's sequence number Both of these cases are expected to be rare, and the redundant prefix check is expected to be insignificant. Additionally, to obey these new stricter invariants, the interleaving iterator is adjusted to exhaust the iterator when the remainder of a prefix is masked by a range key. Besides ensuring invariants are respected, this change provides a large benefit to prefix iteration within a large swath of masked point keys. ``` name old time/op new time/op delta Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24 59.8ms ± 2% 60.3ms ± 2% +0.92% (p=0.013 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24 4.50ms ± 6% 4.52ms ± 3% ~ (p=0.388 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24 5.64ms ± 7% 5.51ms ± 3% -2.26% (p=0.001 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24 64.0ms ± 5% 60.2ms ± 4% -5.89% (p=0.000 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24 4.11ms ± 3% 4.16ms ± 5% +1.06% (p=0.040 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24 4.94ms ± 3% 4.89ms ± 3% -0.97% (p=0.011 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24 69.0ms ± 2% 60.1ms ± 2% -12.88% (p=0.000 n=22+24) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24 2.55ms ± 2% 2.59ms ± 3% +1.90% (p=0.000 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24 3.20ms ± 2% 3.21ms ± 3% ~ (p=0.418 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24 2.19s ± 3% 0.06s ± 2% -97.27% (p=0.000 n=24+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24 346µs ± 2% 346µs ± 2% ~ (p=0.898 n=24+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24 687µs ± 3% 696µs ± 2% +1.31% (p=0.000 n=25+25) ``` Fix #2174.
Merge keys put the top level Iterator's position to
iterPosNext
, andin the process of Next()ing to the next user key, one of the child
levelIters could have advanced past the sstable that contained the
Merge that is being returned.
If that sstable contained a range delete covering a key in an sstable
in a lower level, and the lower level's key was not exposed to the
mergingIter due to a bloom filter mismatch, and we do two consecutive
SeekPrefixGEs where the second one gets TrySeekUsingNext, we could
end up surfacing the deleted key in the lower sstable as the
higher level's levelIter has already moved past the sstable
containing the range deletion.
This change sets isIgnorableBoundaryKey = true when returning
largestBoundary in the levelIter to prevent the mergingIter from
next()ing past it in the first SeekPrefixGE.
Fixes #2118.