Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: implement iter-before-seek optimization for LockTableIterator #110754

Merged

Conversation

nvanbenschoten
Copy link
Member

Informs #100193.

This commit addresses a TODO left by #110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in #110324.

Release note: None

@nvanbenschoten nvanbenschoten requested a review from a team as a code owner September 15, 2023 23:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 6 of 7 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

				// NOTE: Recall that shared locks are ordered last for a given key.
				if i.matchTxnID != uuid.Nil && bytes.Compare(txnID.GetBytes(), i.matchTxnID.GetBytes()) > 0 {
					ltKey.TxnUUID = i.matchTxnID

This has left the ltKey.Strength as Shared, so we will seek within the shared key space for the current ltKey.Key. Seems wasteful if MatchMinStr is also specified and is > Shared.


pkg/storage/lock_table_iterator.go line 327 at r2 (raw file):

					// exposed an interface that called NextPrefix(), we could use that
					// instead. This might require adding a NextPrefixWithLimit() method
					// to pebble.

s/might/will/

I had overlooked this need in our slack discussion. It's non trivial, since the WithLimit functions are currently implemented inside pebble.Iterator and not pushed down to the InternalIterators, but NextPrefix is pushed down.


pkg/storage/lock_table_iterator.go line 332 at r2 (raw file):

					seekKey = EngineKey{Key: seekKeyPrefix}
				}
				state, err = i.iter.SeekEngineKeyGEWithLimit(seekKey, limit)

I think this is incorrect in prefix iteration mode.

LockTableIterator.SeekEngineKeyGE* specifies the prefix we are interested in. Then all Next* calls until the next SeekEngineKeyGE* calls are expected to stay within that prefix. When we have landed on the shared strength keys and gone past the matchTxnID or there was no matchTxnID (so this lack of match was because the shared strength was not needed), we shouldn't be calling SeekEngineKeyGEWithLimit again and instead just declare that we are done -- by calling SeekEngineKeyGEWithLimit with the next roachpb.Key we are now putting it in prefix iteration mode for that key. It's possible this bug won't get tickled in practice since there may never be a real key that only differs in the \0.

A randomized test that turns off this seek optimization may catch this.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/storage/lock_table_iterator.go line 327 at r2 (raw file):

Previously, sumeerbhola wrote…

s/might/will/

I had overlooked this need in our slack discussion. It's non trivial, since the WithLimit functions are currently implemented inside pebble.Iterator and not pushed down to the InternalIterators, but NextPrefix is pushed down.

I'm probably overlooking something, but shouldn't that be okay? In the worst case of many point tombstones both within the prefix and beyond the prefix, the internalIterator.NextPrefix will fall back to a seek to the immediate successor prefix at the leaves (~ equivalent if not marginally better than the comparable SeekGE). Regardless of whether the next key is a tombstone or not, it'll bubble up to the pebble.Iterator which will stop advancing if the limits been exceeded.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @nvanbenschoten)


pkg/storage/lock_table_iterator.go line 327 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm probably overlooking something, but shouldn't that be okay? In the worst case of many point tombstones both within the prefix and beyond the prefix, the internalIterator.NextPrefix will fall back to a seek to the immediate successor prefix at the leaves (~ equivalent if not marginally better than the comparable SeekGE). Regardless of whether the next key is a tombstone or not, it'll bubble up to the pebble.Iterator which will stop advancing if the limits been exceeded.

You are correct. I am not sure what I was thinking.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @sumeerbhola)


pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

Previously, sumeerbhola wrote…

This has left the ltKey.Strength as Shared, so we will seek within the shared key space for the current ltKey.Key. Seems wasteful if MatchMinStr is also specified and is > Shared.

The contract of the LockTableIterator is that a key is returned if it matches either of the filters, not both of the filters. So if MatchTxnID is set, we need to scan to /lock/<key>/shared/<txnid>, even if MatchMinStr is also specified and is > Shared.


pkg/storage/lock_table_iterator.go line 327 at r2 (raw file):

I'm probably overlooking something, but shouldn't that be okay?

To clarify, you're saying that this will still require a new pebble API, but that it won't be "non trivial" like originally thought, right?


pkg/storage/lock_table_iterator.go line 332 at r2 (raw file):
Nice catch!

I'm curious how "wrapping" iterators usually implement this logic. To start, it sounds like they need to remember whether they are configured in prefix mode or not, so that would mean adding another field to LockTableIterator that's configured on creation. That's straightforward enough.

Once they do that, how do they handle the "just declare that we are done" part? I'm not aware of a way to poison the wrapped iterator, so do they:

  1. maintain state about the iterator being exhausted until the next seek and then surface this in the various UnsafeKey/UnsafeValue/etc. methods? It looks like we do this in intentInterleavingIter with the iterValid field, but that's pretty invasive and may only be needed because intentInterleavingIter is joining two iterators together.
  2. not maintain this state and just return the pebble.IterExhausted error, assuming that users won't call any of the accessor methods when an error is returned from an iteration method.

Put differently, are the results of accessor methods on iterators undefined when the iterators are not valid, or are they defined to detect the invalid state and throw an error?

A randomized test that turns off this seek optimization may catch this.

I modified TestLockTableIteratorEquivalence to generate keys that differ by \0, but it was unable to catch this bug because once these user keys are wrapped in a lock table key, they are no longer offset by .Next(), as you suspected. I'll keep the testing change anyway.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/lock_table_iterator.go line 327 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm probably overlooking something, but shouldn't that be okay?

To clarify, you're saying that this will still require a new pebble API, but that it won't be "non trivial" like originally thought, right?

Yeah, that's right


pkg/storage/lock_table_iterator.go line 332 at r2 (raw file):

Put differently, are the results of accessor methods on iterators undefined when the iterators are not valid, or are they defined to detect the invalid state and throw an error?

They're undefined

	// Valid must be called after any call to Seek(), Next(), Prev(), or
	// similar methods. It returns (true, nil) if the iterator points to
	// a valid key (it is undefined to call Key(), Value(), or similar
	// methods unless Valid() has returned (true, nil)). It returns
	// (false, nil) if the iterator has moved past the end of the valid
	// range, or (false, err) if an error has occurred. Valid() will
	// never return true with a non-nil error.
	Valid() (bool, error)

Some iterators will assert that the iterator is valid in test builds within their accessor methods

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/lockTableIterSeek branch from 638564c to b6740af Compare September 29, 2023 21:25
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Sep 29, 2023
This commit resolves a bug identified in the code review of cockroachdb#110754
where a lock table iterator configured for prefix iteration could seek
to the next key prefix when searching for matching keys.

Instead of seeking, we now immediately return the IterExhausted state.
This is more than just an optimization. Seeking to the next key prefix
would move the underlying iterator (which is also configured for prefix
iteration) to the next key prefix, if such a key prefix exists.

This commit also updates TestLockTableIteratorEquivalence to exercise
prefix iteration. Without the fix, the test would now fail.

Release note: None
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @sumeerbhola)


pkg/storage/lock_table_iterator.go line 332 at r2 (raw file):
Thanks!

I modified TestLockTableIteratorEquivalence to generate keys that differ by \0, but it was unable to catch this bug because once these user keys are wrapped in a lock table key, they are no longer offset by .Next(), as you suspected. I'll keep the testing change anyway.

This was actually incorrect. TestLockTableIteratorEquivalence simply wasn't randomly exercising prefix iteration. I updated it to do so, which combined with the generation of keys that differ by \0 does cause it to catch the bug. I then fixed the bug, which is also reflected in the datadriven tests.

All of this was done in a separate commit to make the diff clear. PTAL at the new commit when you get a chance.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 7 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The contract of the LockTableIterator is that a key is returned if it matches either of the filters, not both of the filters. So if MatchTxnID is set, we need to scan to /lock/<key>/shared/<txnid>, even if MatchMinStr is also specified and is > Shared.

What about the case where MatchTxnID is not set. It seems to me that we will iterate over the the Shared locks even if MatchMinStr > Shared. Is that accurate? This is fine as a TODO. Looks there are a few caller that use such a setting.


pkg/storage/lock_table_iterator.go line 331 at r5 (raw file):

						// the next key prefix would move the underlying iterator
						// (which is also configured for prefix iteration) to the
						// next key prefix, if such a key prefix exists.

This can be confusing to a reader, IMO.
We iterated a few times and when we got to shared and did not match and seek became true and we were past the MatchTxnID we declared IterExhausted.

The fact that seek is true wasn't needed for correctness -- it just happens to be the place where the current code-structure also implies that we are on Shared locks.

Also why it is ok to declare IterExhausted could use a comment. Something like:
// INVARIANT: we are at the Shared locks and the remaining locks on this prefix are all Shared locks
// INVARIANT: there is no MatchTxnID or we are past MatchTxnID.
// Cases:
// MatchTxnID is set: The failure to match must be because MatchMinStr > Shared. So there are no remaining locks on this prefix that can match.
// - MatchTxnID is not set: Same reason.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

Previously, sumeerbhola wrote…

What about the case where MatchTxnID is not set. It seems to me that we will iterate over the the Shared locks even if MatchMinStr > Shared. Is that accurate? This is fine as a TODO. Looks there are a few caller that use such a setting.

Yes, we'll iterate lockTableItersBeforeSeek times before seeking, under the assumption that if there are only a small number of shared locks, it's better to iterate over them than to seek. Are you suggesting that we seek to the next key immediately in this case?

Or maybe you're only referring to prefix iteration, and suggesting that we terminate immediately upon seeing a shared lock (or the shared lock following our own if MatchTxnID is set) in the case where we're also performing prefix iteration, but otherwise leave non-prefix iteration alone? That's what your comment below seems to be implying. If so, I'm happy to make that change.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yes, we'll iterate lockTableItersBeforeSeek times before seeking, under the assumption that if there are only a small number of shared locks, it's better to iterate over them than to seek. Are you suggesting that we seek to the next key immediately in this case?

Or maybe you're only referring to prefix iteration, and suggesting that we terminate immediately upon seeing a shared lock (or the shared lock following our own if MatchTxnID is set) in the case where we're also performing prefix iteration, but otherwise leave non-prefix iteration alone? That's what your comment below seems to be implying. If so, I'm happy to make that change.

My comment was all muddled. Ignore it. When I took a look this morning, I had forgotten part of the context.

I wasn't meaning to refer to prefix iteration here.
The implication that we could further optimize prefix iteration (if needed) was buried in my other comment ("it just happens to be the place where the current code-structure ..."), though it's not clear to me that it is worth doing now.

This prepares the test file for the next commit, where the
lockTableItersBeforeSeek optimization will be introduced. We split the
introduction of the test from the introduction of the optimization to
demonstrate the effect of the optimization in the commit diff.

Release note: None
Informs cockroachdb#100193.

This commit addresses a TODO left by cockroachdb#110319 to implement an "iter
before seek" optimization in the LockTableIterator, similar to the one
that exists in the pebbleMVCCScanner. The optimization places an upper
bound on the number of iterations that a LockTableIterator that is
configured to ignore some or all shared locks will perform across the
shared locks on a single user key before seeking past them. This is used
to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks)
work in the LockTableIterator, instead performing at most
O(matching_locks + locked_keys) work. This is important for iteration
over the lock table (e.g. intentInterleavingIter), lock acquisition
(MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is
a caveat to these complexity bounds, however, in that they do not
consider LSM tombstones. This is being discussed in cockroachdb#110324.

Release note: None
This commit resolves a bug identified in the code review of cockroachdb#110754
where a lock table iterator configured for prefix iteration could seek
to the next key prefix when searching for matching keys.

Instead of seeking, we now immediately return the IterExhausted state.
This is more than just an optimization. Seeking to the next key prefix
would move the underlying iterator (which is also configured for prefix
iteration) to the next key prefix, if such a key prefix exists.

This commit also updates TestLockTableIteratorEquivalence to exercise
prefix iteration. Without the fix, the test would now fail.

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/lockTableIterSeek branch from b6740af to 1b81860 Compare October 2, 2023 20:17
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/lock_table_iterator.go line 320 at r2 (raw file):

Previously, sumeerbhola wrote…

My comment was all muddled. Ignore it. When I took a look this morning, I had forgotten part of the context.

I wasn't meaning to refer to prefix iteration here.
The implication that we could further optimize prefix iteration (if needed) was buried in my other comment ("it just happens to be the place where the current code-structure ..."), though it's not clear to me that it is worth doing now.

Ack. I added a comment about how we could detect the prefix iteration case earlier and terminate before deciding to seek, but that we don't for now due to the added complexity that such logic would require.


pkg/storage/lock_table_iterator.go line 331 at r5 (raw file):

Previously, sumeerbhola wrote…

This can be confusing to a reader, IMO.
We iterated a few times and when we got to shared and did not match and seek became true and we were past the MatchTxnID we declared IterExhausted.

The fact that seek is true wasn't needed for correctness -- it just happens to be the place where the current code-structure also implies that we are on Shared locks.

Also why it is ok to declare IterExhausted could use a comment. Something like:
// INVARIANT: we are at the Shared locks and the remaining locks on this prefix are all Shared locks
// INVARIANT: there is no MatchTxnID or we are past MatchTxnID.
// Cases:
// MatchTxnID is set: The failure to match must be because MatchMinStr > Shared. So there are no remaining locks on this prefix that can match.
// - MatchTxnID is not set: Same reason.

I added some more commentary around here. I found the invariants presented in your comment to be confusing when combined with the code because they came from the perspective of wanting to terminate the scan because we're a prefix iterator and deciding whether it is safe, instead of deciding to seek to the next key prefix and then correctly handling if we're a prefix iterator. The latter matches how the code is structured and requires fewer assumptions for correctness because it's not trying to immediately detect the cases where all remaining locks in the key prefix will be ignored.

@craig
Copy link
Contributor

craig bot commented Oct 2, 2023

Build succeeded:

@craig craig bot merged commit 422b70f into cockroachdb:master Oct 2, 2023
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/lockTableIterSeek branch October 3, 2023 02:26
THardy98 pushed a commit to THardy98/cockroach that referenced this pull request Oct 6, 2023
This commit resolves a bug identified in the code review of cockroachdb#110754
where a lock table iterator configured for prefix iteration could seek
to the next key prefix when searching for matching keys.

Instead of seeking, we now immediately return the IterExhausted state.
This is more than just an optimization. Seeking to the next key prefix
would move the underlying iterator (which is also configured for prefix
iteration) to the next key prefix, if such a key prefix exists.

This commit also updates TestLockTableIteratorEquivalence to exercise
prefix iteration. Without the fix, the test would now fail.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants