storage: implement iter-before-seek optimization for LockTableIterator #110754

nvanbenschoten · 2023-09-15T23:28:16Z

This commit addresses a TODO left by #110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in #110324.

Release note: None

cockroach-teamcity · 2023-09-15T23:28:26Z

This change is

sumeerbhola

Reviewed 1 of 1 files at r1, 6 of 7 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

				// NOTE: Recall that shared locks are ordered last for a given key.
				if i.matchTxnID != uuid.Nil && bytes.Compare(txnID.GetBytes(), i.matchTxnID.GetBytes()) > 0 {
					ltKey.TxnUUID = i.matchTxnID

This has left the ltKey.Strength as Shared, so we will seek within the shared key space for the current ltKey.Key. Seems wasteful if MatchMinStr is also specified and is > Shared.

pkg/storage/lock_table_iterator.go line 327 at r2 (raw file):

					// exposed an interface that called NextPrefix(), we could use that
					// instead. This might require adding a NextPrefixWithLimit() method
					// to pebble.

s/might/will/

I had overlooked this need in our slack discussion. It's non trivial, since the WithLimit functions are currently implemented inside pebble.Iterator and not pushed down to the InternalIterators, but NextPrefix is pushed down.

pkg/storage/lock_table_iterator.go line 332 at r2 (raw file):

					seekKey = EngineKey{Key: seekKeyPrefix}
				}
				state, err = i.iter.SeekEngineKeyGEWithLimit(seekKey, limit)

I think this is incorrect in prefix iteration mode.

LockTableIterator.SeekEngineKeyGE* specifies the prefix we are interested in. Then all Next* calls until the next SeekEngineKeyGE* calls are expected to stay within that prefix. When we have landed on the shared strength keys and gone past the matchTxnID or there was no matchTxnID (so this lack of match was because the shared strength was not needed), we shouldn't be calling SeekEngineKeyGEWithLimit again and instead just declare that we are done -- by calling SeekEngineKeyGEWithLimit with the next roachpb.Key we are now putting it in prefix iteration mode for that key. It's possible this bug won't get tickled in practice since there may never be a real key that only differs in the \0.

A randomized test that turns off this seek optimization may catch this.

jbowens

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

pkg/storage/lock_table_iterator.go line 327 at r2 (raw file):

Previously, sumeerbhola wrote…

s/might/will/

I had overlooked this need in our slack discussion. It's non trivial, since the WithLimit functions are currently implemented inside pebble.Iterator and not pushed down to the InternalIterators, but NextPrefix is pushed down.

I'm probably overlooking something, but shouldn't that be okay? In the worst case of many point tombstones both within the prefix and beyond the prefix, the internalIterator.NextPrefix will fall back to a seek to the immediate successor prefix at the leaves (~ equivalent if not marginally better than the comparable SeekGE). Regardless of whether the next key is a tombstone or not, it'll bubble up to the pebble.Iterator which will stop advancing if the limits been exceeded.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @nvanbenschoten)

pkg/storage/lock_table_iterator.go line 327 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm probably overlooking something, but shouldn't that be okay? In the worst case of many point tombstones both within the prefix and beyond the prefix, the internalIterator.NextPrefix will fall back to a seek to the immediate successor prefix at the leaves (~ equivalent if not marginally better than the comparable SeekGE). Regardless of whether the next key is a tombstone or not, it'll bubble up to the pebble.Iterator which will stop advancing if the limits been exceeded.

You are correct. I am not sure what I was thinking.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @sumeerbhola)

pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

Previously, sumeerbhola wrote…

This has left the ltKey.Strength as Shared, so we will seek within the shared key space for the current ltKey.Key. Seems wasteful if MatchMinStr is also specified and is > Shared.

The contract of the LockTableIterator is that a key is returned if it matches either of the filters, not both of the filters. So if MatchTxnID is set, we need to scan to /lock/<key>/shared/<txnid>, even if MatchMinStr is also specified and is > Shared.

pkg/storage/lock_table_iterator.go line 327 at r2 (raw file):

I'm probably overlooking something, but shouldn't that be okay?

To clarify, you're saying that this will still require a new pebble API, but that it won't be "non trivial" like originally thought, right?

pkg/storage/lock_table_iterator.go line 332 at r2 (raw file):
Nice catch!

I'm curious how "wrapping" iterators usually implement this logic. To start, it sounds like they need to remember whether they are configured in prefix mode or not, so that would mean adding another field to LockTableIterator that's configured on creation. That's straightforward enough.

Once they do that, how do they handle the "just declare that we are done" part? I'm not aware of a way to poison the wrapped iterator, so do they:

maintain state about the iterator being exhausted until the next seek and then surface this in the various UnsafeKey/UnsafeValue/etc. methods? It looks like we do this in intentInterleavingIter with the iterValid field, but that's pretty invasive and may only be needed because intentInterleavingIter is joining two iterators together.
not maintain this state and just return the pebble.IterExhausted error, assuming that users won't call any of the accessor methods when an error is returned from an iteration method.

Put differently, are the results of accessor methods on iterators undefined when the iterators are not valid, or are they defined to detect the invalid state and throw an error?

A randomized test that turns off this seek optimization may catch this.

I modified TestLockTableIteratorEquivalence to generate keys that differ by \0, but it was unable to catch this bug because once these user keys are wrapped in a lock table key, they are no longer offset by .Next(), as you suspected. I'll keep the testing change anyway.

jbowens

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/storage/lock_table_iterator.go line 327 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm probably overlooking something, but shouldn't that be okay?

To clarify, you're saying that this will still require a new pebble API, but that it won't be "non trivial" like originally thought, right?

Yeah, that's right

pkg/storage/lock_table_iterator.go line 332 at r2 (raw file):

Put differently, are the results of accessor methods on iterators undefined when the iterators are not valid, or are they defined to detect the invalid state and throw an error?

They're undefined

	// Valid must be called after any call to Seek(), Next(), Prev(), or
	// similar methods. It returns (true, nil) if the iterator points to
	// a valid key (it is undefined to call Key(), Value(), or similar
	// methods unless Valid() has returned (true, nil)). It returns
	// (false, nil) if the iterator has moved past the end of the valid
	// range, or (false, err) if an error has occurred. Valid() will
	// never return true with a non-nil error.
	Valid() (bool, error)

Some iterators will assert that the iterator is valid in test builds within their accessor methods

This commit resolves a bug identified in the code review of cockroachdb#110754 where a lock table iterator configured for prefix iteration could seek to the next key prefix when searching for matching keys. Instead of seeking, we now immediately return the IterExhausted state. This is more than just an optimization. Seeking to the next key prefix would move the underlying iterator (which is also configured for prefix iteration) to the next key prefix, if such a key prefix exists. This commit also updates TestLockTableIteratorEquivalence to exercise prefix iteration. Without the fix, the test would now fail. Release note: None

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @sumeerbhola)

pkg/storage/lock_table_iterator.go line 332 at r2 (raw file):
Thanks!

I modified TestLockTableIteratorEquivalence to generate keys that differ by \0, but it was unable to catch this bug because once these user keys are wrapped in a lock table key, they are no longer offset by .Next(), as you suspected. I'll keep the testing change anyway.

This was actually incorrect. TestLockTableIteratorEquivalence simply wasn't randomly exercising prefix iteration. I updated it to do so, which combined with the generation of keys that differ by \0 does cause it to catch the bug. I then fixed the bug, which is also reflected in the datadriven tests.

All of this was done in a separate commit to make the diff clear. PTAL at the new commit when you get a chance.

sumeerbhola

Reviewed 4 of 7 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The contract of the LockTableIterator is that a key is returned if it matches either of the filters, not both of the filters. So if MatchTxnID is set, we need to scan to /lock/<key>/shared/<txnid>, even if MatchMinStr is also specified and is > Shared.

What about the case where MatchTxnID is not set. It seems to me that we will iterate over the the Shared locks even if MatchMinStr > Shared. Is that accurate? This is fine as a TODO. Looks there are a few caller that use such a setting.

pkg/storage/lock_table_iterator.go line 331 at r5 (raw file):

						// the next key prefix would move the underlying iterator
						// (which is also configured for prefix iteration) to the
						// next key prefix, if such a key prefix exists.

This can be confusing to a reader, IMO.
We iterated a few times and when we got to shared and did not match and seek became true and we were past the MatchTxnID we declared IterExhausted.

The fact that seek is true wasn't needed for correctness -- it just happens to be the place where the current code-structure also implies that we are on Shared locks.

Also why it is ok to declare IterExhausted could use a comment. Something like:
// INVARIANT: we are at the Shared locks and the remaining locks on this prefix are all Shared locks
// INVARIANT: there is no MatchTxnID or we are past MatchTxnID.
// Cases:
// MatchTxnID is set: The failure to match must be because MatchMinStr > Shared. So there are no remaining locks on this prefix that can match.
// - MatchTxnID is not set: Same reason.

nvanbenschoten

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

Previously, sumeerbhola wrote…

What about the case where MatchTxnID is not set. It seems to me that we will iterate over the the Shared locks even if MatchMinStr > Shared. Is that accurate? This is fine as a TODO. Looks there are a few caller that use such a setting.

Yes, we'll iterate lockTableItersBeforeSeek times before seeking, under the assumption that if there are only a small number of shared locks, it's better to iterate over them than to seek. Are you suggesting that we seek to the next key immediately in this case?

Or maybe you're only referring to prefix iteration, and suggesting that we terminate immediately upon seeing a shared lock (or the shared lock following our own if MatchTxnID is set) in the case where we're also performing prefix iteration, but otherwise leave non-prefix iteration alone? That's what your comment below seems to be implying. If so, I'm happy to make that change.

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yes, we'll iterate lockTableItersBeforeSeek times before seeking, under the assumption that if there are only a small number of shared locks, it's better to iterate over them than to seek. Are you suggesting that we seek to the next key immediately in this case?

Or maybe you're only referring to prefix iteration, and suggesting that we terminate immediately upon seeing a shared lock (or the shared lock following our own if MatchTxnID is set) in the case where we're also performing prefix iteration, but otherwise leave non-prefix iteration alone? That's what your comment below seems to be implying. If so, I'm happy to make that change.

My comment was all muddled. Ignore it. When I took a look this morning, I had forgotten part of the context.

I wasn't meaning to refer to prefix iteration here.
The implication that we could further optimize prefix iteration (if needed) was buried in my other comment ("it just happens to be the place where the current code-structure ..."), though it's not clear to me that it is worth doing now.

This prepares the test file for the next commit, where the lockTableItersBeforeSeek optimization will be introduced. We split the introduction of the test from the introduction of the optimization to demonstrate the effect of the optimization in the commit diff. Release note: None

Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None

This commit resolves a bug identified in the code review of cockroachdb#110754 where a lock table iterator configured for prefix iteration could seek to the next key prefix when searching for matching keys. Instead of seeking, we now immediately return the IterExhausted state. This is more than just an optimization. Seeking to the next key prefix would move the underlying iterator (which is also configured for prefix iteration) to the next key prefix, if such a key prefix exists. This commit also updates TestLockTableIteratorEquivalence to exercise prefix iteration. Without the fix, the test would now fail. Release note: None

nvanbenschoten

TFTR!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

Previously, sumeerbhola wrote…

My comment was all muddled. Ignore it. When I took a look this morning, I had forgotten part of the context.

I wasn't meaning to refer to prefix iteration here.
The implication that we could further optimize prefix iteration (if needed) was buried in my other comment ("it just happens to be the place where the current code-structure ..."), though it's not clear to me that it is worth doing now.

Ack. I added a comment about how we could detect the prefix iteration case earlier and terminate before deciding to seek, but that we don't for now due to the added complexity that such logic would require.

pkg/storage/lock_table_iterator.go line 331 at r5 (raw file):

Previously, sumeerbhola wrote…

This can be confusing to a reader, IMO.
We iterated a few times and when we got to shared and did not match and seek became true and we were past the MatchTxnID we declared IterExhausted.

The fact that seek is true wasn't needed for correctness -- it just happens to be the place where the current code-structure also implies that we are on Shared locks.

Also why it is ok to declare IterExhausted could use a comment. Something like:
// INVARIANT: we are at the Shared locks and the remaining locks on this prefix are all Shared locks
// INVARIANT: there is no MatchTxnID or we are past MatchTxnID.
// Cases:
// MatchTxnID is set: The failure to match must be because MatchMinStr > Shared. So there are no remaining locks on this prefix that can match.
// - MatchTxnID is not set: Same reason.

I added some more commentary around here. I found the invariants presented in your comment to be confusing when combined with the code because they came from the perspective of wanting to terminate the scan because we're a prefix iterator and deciding whether it is safe, instead of deciding to seek to the next key prefix and then correctly handling if we're a prefix iterator. The latter matches how the code is structured and requires fewer assumptions for correctness because it's not trying to immediately detect the cases where all remaining locks in the key prefix will be ignored.

craig · 2023-10-02T21:33:12Z

Build succeeded:

Bazel Essential CI (Cockroach)

This commit resolves a bug identified in the code review of cockroachdb#110754 where a lock table iterator configured for prefix iteration could seek to the next key prefix when searching for matching keys. Instead of seeking, we now immediately return the IterExhausted state. This is more than just an optimization. Seeking to the next key prefix would move the underlying iterator (which is also configured for prefix iteration) to the next key prefix, if such a key prefix exists. This commit also updates TestLockTableIteratorEquivalence to exercise prefix iteration. Without the fix, the test would now fail. Release note: None

nvanbenschoten requested a review from sumeerbhola September 15, 2023 23:28

nvanbenschoten requested a review from a team as a code owner September 15, 2023 23:28

nvanbenschoten mentioned this pull request Sep 21, 2023

storage: implement intent resolution using LockTableIterator #110324

Merged

sumeerbhola requested changes Sep 22, 2023

View reviewed changes

jbowens reviewed Sep 22, 2023

View reviewed changes

sumeerbhola reviewed Sep 25, 2023

View reviewed changes

nvanbenschoten commented Sep 28, 2023

View reviewed changes

jbowens reviewed Sep 29, 2023

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/lockTableIterSeek branch from 638564c to b6740af Compare September 29, 2023 21:25

nvanbenschoten commented Sep 29, 2023

View reviewed changes

sumeerbhola approved these changes Oct 2, 2023

View reviewed changes

nvanbenschoten commented Oct 2, 2023

View reviewed changes

sumeerbhola reviewed Oct 2, 2023

View reviewed changes

nvanbenschoten added 3 commits October 2, 2023 15:51

nvanbenschoten force-pushed the nvanbenschoten/lockTableIterSeek branch from b6740af to 1b81860 Compare October 2, 2023 20:17

nvanbenschoten commented Oct 2, 2023

View reviewed changes

craig bot merged commit 422b70f into cockroachdb:master Oct 2, 2023

nvanbenschoten deleted the nvanbenschoten/lockTableIterSeek branch October 3, 2023 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: implement iter-before-seek optimization for LockTableIterator #110754

storage: implement iter-before-seek optimization for LockTableIterator #110754

nvanbenschoten commented Sep 15, 2023

cockroach-teamcity commented Sep 15, 2023

sumeerbhola left a comment

jbowens left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

jbowens left a comment

nvanbenschoten left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

craig bot commented Oct 2, 2023

storage: implement iter-before-seek optimization for LockTableIterator #110754

storage: implement iter-before-seek optimization for LockTableIterator #110754

Conversation

nvanbenschoten commented Sep 15, 2023

cockroach-teamcity commented Sep 15, 2023

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

craig bot commented Oct 2, 2023