-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: implement iter-before-seek optimization for LockTableIterator #110754
storage: implement iter-before-seek optimization for LockTableIterator #110754
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 6 of 7 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/lock_table_iterator.go
line 320 at r2 (raw file):
// NOTE: Recall that shared locks are ordered last for a given key. if i.matchTxnID != uuid.Nil && bytes.Compare(txnID.GetBytes(), i.matchTxnID.GetBytes()) > 0 { ltKey.TxnUUID = i.matchTxnID
This has left the ltKey.Strength
as Shared
, so we will seek within the shared key space for the current ltKey.Key
. Seems wasteful if MatchMinStr
is also specified and is > Shared
.
pkg/storage/lock_table_iterator.go
line 327 at r2 (raw file):
// exposed an interface that called NextPrefix(), we could use that // instead. This might require adding a NextPrefixWithLimit() method // to pebble.
s/might/will/
I had overlooked this need in our slack discussion. It's non trivial, since the WithLimit
functions are currently implemented inside pebble.Iterator
and not pushed down to the InternalIterator
s, but NextPrefix
is pushed down.
pkg/storage/lock_table_iterator.go
line 332 at r2 (raw file):
seekKey = EngineKey{Key: seekKeyPrefix} } state, err = i.iter.SeekEngineKeyGEWithLimit(seekKey, limit)
I think this is incorrect in prefix iteration mode.
LockTableIterator.SeekEngineKeyGE*
specifies the prefix we are interested in. Then all Next*
calls until the next SeekEngineKeyGE*
calls are expected to stay within that prefix. When we have landed on the shared strength keys and gone past the matchTxnID
or there was no matchTxnID
(so this lack of match was because the shared strength was not needed), we shouldn't be calling SeekEngineKeyGEWithLimit
again and instead just declare that we are done -- by calling SeekEngineKeyGEWithLimit
with the next roachpb.Key
we are now putting it in prefix iteration mode for that key. It's possible this bug won't get tickled in practice since there may never be a real key that only differs in the \0
.
A randomized test that turns off this seek optimization may catch this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/storage/lock_table_iterator.go
line 327 at r2 (raw file):
Previously, sumeerbhola wrote…
s/might/will/
I had overlooked this need in our slack discussion. It's non trivial, since the
WithLimit
functions are currently implemented insidepebble.Iterator
and not pushed down to theInternalIterator
s, butNextPrefix
is pushed down.
I'm probably overlooking something, but shouldn't that be okay? In the worst case of many point tombstones both within the prefix and beyond the prefix, the internalIterator.NextPrefix
will fall back to a seek to the immediate successor prefix at the leaves (~ equivalent if not marginally better than the comparable SeekGE). Regardless of whether the next key is a tombstone or not, it'll bubble up to the pebble.Iterator
which will stop advancing if the limits been exceeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @nvanbenschoten)
pkg/storage/lock_table_iterator.go
line 327 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I'm probably overlooking something, but shouldn't that be okay? In the worst case of many point tombstones both within the prefix and beyond the prefix, the
internalIterator.NextPrefix
will fall back to a seek to the immediate successor prefix at the leaves (~ equivalent if not marginally better than the comparable SeekGE). Regardless of whether the next key is a tombstone or not, it'll bubble up to thepebble.Iterator
which will stop advancing if the limits been exceeded.
You are correct. I am not sure what I was thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @sumeerbhola)
pkg/storage/lock_table_iterator.go
line 320 at r2 (raw file):
Previously, sumeerbhola wrote…
This has left the
ltKey.Strength
asShared
, so we will seek within the shared key space for the currentltKey.Key
. Seems wasteful ifMatchMinStr
is also specified and is >Shared
.
The contract of the LockTableIterator is that a key is returned if it matches either of the filters, not both of the filters. So if MatchTxnID
is set, we need to scan to /lock/<key>/shared/<txnid>
, even if MatchMinStr
is also specified and is > Shared
.
pkg/storage/lock_table_iterator.go
line 327 at r2 (raw file):
I'm probably overlooking something, but shouldn't that be okay?
To clarify, you're saying that this will still require a new pebble API, but that it won't be "non trivial" like originally thought, right?
pkg/storage/lock_table_iterator.go
line 332 at r2 (raw file):
Nice catch!
I'm curious how "wrapping" iterators usually implement this logic. To start, it sounds like they need to remember whether they are configured in prefix mode or not, so that would mean adding another field to LockTableIterator
that's configured on creation. That's straightforward enough.
Once they do that, how do they handle the "just declare that we are done" part? I'm not aware of a way to poison the wrapped iterator, so do they:
- maintain state about the iterator being exhausted until the next seek and then surface this in the various
UnsafeKey
/UnsafeValue
/etc. methods? It looks like we do this inintentInterleavingIter
with theiterValid
field, but that's pretty invasive and may only be needed becauseintentInterleavingIter
is joining two iterators together. - not maintain this state and just return the
pebble.IterExhausted
error, assuming that users won't call any of the accessor methods when an error is returned from an iteration method.
Put differently, are the results of accessor methods on iterators undefined when the iterators are not valid, or are they defined to detect the invalid state and throw an error?
A randomized test that turns off this seek optimization may catch this.
I modified TestLockTableIteratorEquivalence
to generate keys that differ by \0
, but it was unable to catch this bug because once these user keys are wrapped in a lock table key, they are no longer offset by .Next()
, as you suspected. I'll keep the testing change anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/storage/lock_table_iterator.go
line 327 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm probably overlooking something, but shouldn't that be okay?
To clarify, you're saying that this will still require a new pebble API, but that it won't be "non trivial" like originally thought, right?
Yeah, that's right
pkg/storage/lock_table_iterator.go
line 332 at r2 (raw file):
Put differently, are the results of accessor methods on iterators undefined when the iterators are not valid, or are they defined to detect the invalid state and throw an error?
They're undefined
// Valid must be called after any call to Seek(), Next(), Prev(), or
// similar methods. It returns (true, nil) if the iterator points to
// a valid key (it is undefined to call Key(), Value(), or similar
// methods unless Valid() has returned (true, nil)). It returns
// (false, nil) if the iterator has moved past the end of the valid
// range, or (false, err) if an error has occurred. Valid() will
// never return true with a non-nil error.
Valid() (bool, error)
Some iterators will assert that the iterator is valid in test builds within their accessor methods
638564c
to
b6740af
Compare
This commit resolves a bug identified in the code review of cockroachdb#110754 where a lock table iterator configured for prefix iteration could seek to the next key prefix when searching for matching keys. Instead of seeking, we now immediately return the IterExhausted state. This is more than just an optimization. Seeking to the next key prefix would move the underlying iterator (which is also configured for prefix iteration) to the next key prefix, if such a key prefix exists. This commit also updates TestLockTableIteratorEquivalence to exercise prefix iteration. Without the fix, the test would now fail. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @sumeerbhola)
pkg/storage/lock_table_iterator.go
line 332 at r2 (raw file):
Thanks!
I modified
TestLockTableIteratorEquivalence
to generate keys that differ by\0
, but it was unable to catch this bug because once these user keys are wrapped in a lock table key, they are no longer offset by.Next()
, as you suspected. I'll keep the testing change anyway.
This was actually incorrect. TestLockTableIteratorEquivalence
simply wasn't randomly exercising prefix iteration. I updated it to do so, which combined with the generation of keys that differ by \0
does cause it to catch the bug. I then fixed the bug, which is also reflected in the datadriven tests.
All of this was done in a separate commit to make the diff clear. PTAL at the new commit when you get a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 7 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/lock_table_iterator.go
line 320 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The contract of the LockTableIterator is that a key is returned if it matches either of the filters, not both of the filters. So if
MatchTxnID
is set, we need to scan to/lock/<key>/shared/<txnid>
, even ifMatchMinStr
is also specified and is >Shared
.
What about the case where MatchTxnID
is not set. It seems to me that we will iterate over the the Shared locks even if MatchMinStr > Shared. Is that accurate? This is fine as a TODO. Looks there are a few caller that use such a setting.
pkg/storage/lock_table_iterator.go
line 331 at r5 (raw file):
// the next key prefix would move the underlying iterator // (which is also configured for prefix iteration) to the // next key prefix, if such a key prefix exists.
This can be confusing to a reader, IMO.
We iterated a few times and when we got to shared and did not match and seek became true and we were past the MatchTxnID we declared IterExhausted.
The fact that seek is true wasn't needed for correctness -- it just happens to be the place where the current code-structure also implies that we are on Shared locks.
Also why it is ok to declare IterExhausted could use a comment. Something like:
// INVARIANT: we are at the Shared locks and the remaining locks on this prefix are all Shared locks
// INVARIANT: there is no MatchTxnID or we are past MatchTxnID.
// Cases:
// MatchTxnID is set: The failure to match must be because MatchMinStr > Shared. So there are no remaining locks on this prefix that can match.
// - MatchTxnID is not set: Same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/storage/lock_table_iterator.go
line 320 at r2 (raw file):
Previously, sumeerbhola wrote…
What about the case where
MatchTxnID
is not set. It seems to me that we will iterate over the the Shared locks even if MatchMinStr > Shared. Is that accurate? This is fine as a TODO. Looks there are a few caller that use such a setting.
Yes, we'll iterate lockTableItersBeforeSeek
times before seeking, under the assumption that if there are only a small number of shared locks, it's better to iterate over them than to seek. Are you suggesting that we seek to the next key immediately in this case?
Or maybe you're only referring to prefix iteration, and suggesting that we terminate immediately upon seeing a shared lock (or the shared lock following our own if MatchTxnID
is set) in the case where we're also performing prefix iteration, but otherwise leave non-prefix iteration alone? That's what your comment below seems to be implying. If so, I'm happy to make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/lock_table_iterator.go
line 320 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yes, we'll iterate
lockTableItersBeforeSeek
times before seeking, under the assumption that if there are only a small number of shared locks, it's better to iterate over them than to seek. Are you suggesting that we seek to the next key immediately in this case?Or maybe you're only referring to prefix iteration, and suggesting that we terminate immediately upon seeing a shared lock (or the shared lock following our own if
MatchTxnID
is set) in the case where we're also performing prefix iteration, but otherwise leave non-prefix iteration alone? That's what your comment below seems to be implying. If so, I'm happy to make that change.
My comment was all muddled. Ignore it. When I took a look this morning, I had forgotten part of the context.
I wasn't meaning to refer to prefix iteration here.
The implication that we could further optimize prefix iteration (if needed) was buried in my other comment ("it just happens to be the place where the current code-structure ..."), though it's not clear to me that it is worth doing now.
This prepares the test file for the next commit, where the lockTableItersBeforeSeek optimization will be introduced. We split the introduction of the test from the introduction of the optimization to demonstrate the effect of the optimization in the commit diff. Release note: None
Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None
This commit resolves a bug identified in the code review of cockroachdb#110754 where a lock table iterator configured for prefix iteration could seek to the next key prefix when searching for matching keys. Instead of seeking, we now immediately return the IterExhausted state. This is more than just an optimization. Seeking to the next key prefix would move the underlying iterator (which is also configured for prefix iteration) to the next key prefix, if such a key prefix exists. This commit also updates TestLockTableIteratorEquivalence to exercise prefix iteration. Without the fix, the test would now fail. Release note: None
b6740af
to
1b81860
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/storage/lock_table_iterator.go
line 320 at r2 (raw file):
Previously, sumeerbhola wrote…
My comment was all muddled. Ignore it. When I took a look this morning, I had forgotten part of the context.
I wasn't meaning to refer to prefix iteration here.
The implication that we could further optimize prefix iteration (if needed) was buried in my other comment ("it just happens to be the place where the current code-structure ..."), though it's not clear to me that it is worth doing now.
Ack. I added a comment about how we could detect the prefix iteration case earlier and terminate before deciding to seek, but that we don't for now due to the added complexity that such logic would require.
pkg/storage/lock_table_iterator.go
line 331 at r5 (raw file):
Previously, sumeerbhola wrote…
This can be confusing to a reader, IMO.
We iterated a few times and when we got to shared and did not match and seek became true and we were past the MatchTxnID we declared IterExhausted.The fact that seek is true wasn't needed for correctness -- it just happens to be the place where the current code-structure also implies that we are on Shared locks.
Also why it is ok to declare IterExhausted could use a comment. Something like:
// INVARIANT: we are at the Shared locks and the remaining locks on this prefix are all Shared locks
// INVARIANT: there is no MatchTxnID or we are past MatchTxnID.
// Cases:
// MatchTxnID is set: The failure to match must be because MatchMinStr > Shared. So there are no remaining locks on this prefix that can match.
// - MatchTxnID is not set: Same reason.
I added some more commentary around here. I found the invariants presented in your comment to be confusing when combined with the code because they came from the perspective of wanting to terminate the scan because we're a prefix iterator and deciding whether it is safe, instead of deciding to seek to the next key prefix and then correctly handling if we're a prefix iterator. The latter matches how the code is structured and requires fewer assumptions for correctness because it's not trying to immediately detect the cases where all remaining locks in the key prefix will be ignored.
Build succeeded: |
This commit resolves a bug identified in the code review of cockroachdb#110754 where a lock table iterator configured for prefix iteration could seek to the next key prefix when searching for matching keys. Instead of seeking, we now immediately return the IterExhausted state. This is more than just an optimization. Seeking to the next key prefix would move the underlying iterator (which is also configured for prefix iteration) to the next key prefix, if such a key prefix exists. This commit also updates TestLockTableIteratorEquivalence to exercise prefix iteration. Without the fix, the test would now fail. Release note: None
Informs #100193.
This commit addresses a TODO left by #110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary.
The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in #110324.
Release note: None