storage: implement iterator to interleave separated intents #56535

sumeerbhola · 2020-11-11T01:11:03Z

intentInterleavingIter allows for both physically separate
and physically interleaved intents, and makes both look
like interleaved intents. It uses two underlying iterators,
a MVCCIterator and an EngineIterator, that are kept
synchronized.

Informs #41720

Release note: None

cockroach-teamcity · 2020-11-11T01:11:11Z

This change is

sumeerbhola

The test will fail until pebble is bumped in #56534

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, and @petermattis)

petermattis

Mostly just skimmed this as I'm expecting @nvanbenschoten to scrutinize the logic. Are you planning to add microbenchmarks to compare the performance of iterating over separated vs interleaved intents?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, @petermattis, and @sumeerbhola)

pkg/storage/intent_interleaving_iter.go, line 25 at r1 (raw file):

// intentInterleavingIter makes separated intents appear as interleaved. It
// assumes that there can also be intents that are physically interleaved.
// However, for a particular roachpb.Key there will be at most intent, either

Nit: Something is off with the grammar of at most intent. Perhaps at most one intent.

sumeerbhola

Are you planning to add microbenchmarks to compare the performance of iterating over separated vs interleaved intents?

I've added a TODO for microbenchmarks. Both the randomized test and microbenchmarks will happen in follow-up PRs, but before I integrate this with existing code.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @nvanbenschoten)

pkg/storage/intent_interleaving_iter.go, line 25 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: Something is off with the grammar of at most intent. Perhaps at most one intent.

Done

nvanbenschoten

Reviewed 6 of 8 files at r1, 3 of 3 files at r2, 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)

pkg/keys/keys.go, line 401 at r1 (raw file):

// For a scan [start, end) the corresponding lock table scan is
// [LTSK(start), LTSK(end)).
func LockTableSingleKey(key roachpb.Key, buf []byte) (roachpb.Key, []byte) {

We should note somewhere in the function's comment that buf is replaced and not appended to.

pkg/storage/intent_interleaving_iter.go, line 30 at r2 (raw file):

// Additionally, the implementation assumes that the only single key locks
// in the lock table key space are intents.
type intentInterleavingIter struct {

What is the plan for this type once all intents have been moved into the lockTable? It seems like we will still need it for limited scans that do not scan the lockTable ahead of time (#49973), but will other operations that have already scanned the lockTable still use it? Also, do we plan to simplify the implementation once all intents are segregated?

pkg/storage/intent_interleaving_iter.go, line 44 at r2 (raw file):

	// state of the intentKey overrides that state.
	intentKey roachpb.Key
	// - cmp output of (intentKey, current iter key) when both are non-nil.

Could you expand this out a bit more to discuss whether this is taking timestamps into account or not? For instance, what should this be if the intentIter is positioned on an intent and the iter is positioned on the same key but at a committed version instead of the provisional version? I think it will be 0 after reading some of this file. Or is that not possible because we'll move intentIter forward first?

EDIT: I was able to resolve these questions after reading the rest of the code in depth, but it could be made easier to understand.

Maybe this brings up a more general point about being more clear about how the MVCCIterator and the EngineIterator interleave. It might be helpful to think of this as an interleaving between two MVCCIterators, where the first is iter and the second is a transformation of intentIter that a) maps the EngineKey to another EngineKey using DecodeLockTableSingleKey and b) maps the EngineKey to an MVCCKey by giving it a timestamp of 0. If we didn't care about performance, I bet you could even define this type a little more abstractly like that:

 intentInterleavingIter = iter_merge(iter, intentIter.map(|k| DecodeLockTableSingleKey(k)).map(|k| MVCCKey{Key: k, Timestamp: 0}))

pkg/storage/intent_interleaving_iter.go, line 71 at r2 (raw file):

	var intentKeyBuf []byte
	if opts.LowerBound != nil {
		intentOpts.LowerBound, intentKeyBuf = keys.LockTableSingleKey(opts.LowerBound, nil)

Isn't this going to alias intentKeyBuf and intentOpts.LowerBound to the same memory? Is that not unsafe? Can the buffers passed in an IterOptions be reused after a call to NewEngineIterator?

pkg/storage/intent_interleaving_iter.go, line 105 at r2 (raw file):

		// or using SetUpperBound.
		if !i.hasUpperBound && keys.IsLocal(key.Key) && !keys.IsLocalStoreKey(key.Key) {
			i.iter.SetUpperBound(keys.LocalRangeLockTablePrefix)

Should we set i.hasUpperBound = true after this? Or should we be calling i.SetUpperBound?

pkg/storage/intent_interleaving_iter.go, line 109 at r2 (raw file):

	}
	var intentSeekKey roachpb.Key
	if key.Timestamp == (hlc.Timestamp{}) {

.IsEmpty(). We just got rid of all of these in 4fa8a38.

pkg/storage/intent_interleaving_iter.go, line 118 at r2 (raw file):

		// Else seeking to a particular version and using prefix iteration,
		// so don't expect to ever see the intent.
		i.intentKey = nil

nit: consider adding intentSeekKey = nil. I know it's not needed, but it makes the intention and the interaction with the next block more clear.

pkg/storage/intent_interleaving_iter.go, line 155 at r2 (raw file):

func (i *intentInterleavingIter) tryDecodeLockKey(valid bool) error {
	if !valid {
		i.intentKey = nil

Consider mentioning why this doesn't set i.valid = false in a comment. It took me a second to understand how this interacts with computePos.

pkg/storage/intent_interleaving_iter.go, line 206 at r2 (raw file):

			// preceding key or exhausted. So step it forward. It will now point to
			// a key that is the same as the intent key since an intent always has a
			// corresponding provisional value.

" and provisional values must have a higher timestamp than any committed value on a key". We've actually had bugs with this in the past, so it's worth considering where we're making assumptions about this. We should be able to assume it, but it's probably best to call the assumption out explicitly.

pkg/storage/intent_interleaving_iter.go, line 337 at r2 (raw file):

func (i *intentInterleavingIter) isCurAtIntentIter() bool {
	return (i.dir > 0 && i.intentCmp <= 0) || (i.dir < 0 && i.intentCmp > 0)

This might be a good place for a comment and maybe even a diagram demonstrating how the two iterators move and why this is the case. If you're feeling inspired.

pkg/storage/intent_interleaving_iter.go, line 360 at r2 (raw file):

func (i *intentInterleavingIter) Key() MVCCKey {
	unsafeKey := i.UnsafeKey()
	return MVCCKey{Key: append(roachpb.Key(nil), unsafeKey.Key...), Timestamp: unsafeKey.Timestamp}

nit: append(roachpb.Key(nil), unsafeKey.Key... isn't quite the same as keyCopy := make([]byte, len(key.Key)); copy(keyCopy, key.Key). One reason is that the append method leaves some capacity on the slice. In the past, I believe we measured it to be a bit slower. I'd lean towards doing exactly what we do in pebbleIterator.Key.

Also, do we think there might be cases where an Iterator implementation provides a more efficient Key impl? In other words, would it be worth deferring to the underlying MVCCIterator implementation when !i.isCurAtIntentIter().

pkg/storage/intent_interleaving_iter.go, line 392 at r2 (raw file):

			return
		} else {
			intentSeekKey, i.intentKeyBuf = keys.LockTableSingleKey(key.Key.Next(), i.intentKeyBuf)

Could you give this a small comment that calls out why Next is required. It took me a minute to figure that out.

pkg/storage/testdata/ii_iter, line 1 at r2 (raw file):

# Both separated and interleaved intents, and one inline meta.

Nice test!

pkg/storage/testdata/ii_iter, line 47 at r2 (raw file):

next
----
cmd: seek-ge meta k=a ts=20 txn=1

nit: this would be a little easier to follow if it more clearly separated the cmd from the output, and if it included the arguments to cmds that take them, like seek-ge and seek-lt.

pkg/storage/testdata/ii_iter, line 123 at r2 (raw file):

cmd: next .

# Prefix iteration and NextKey. There is no guarantee on what we will

Does this combination make sense to support? Does the "no guarantee" part mean that its behavior is undefined?

pkg/storage/testdata/ii_iter, line 149 at r2 (raw file):

seek-ge k=a ts=25
next
next

Should this also include a case like seek-ge k=a ts=5 where we immediately seek to the next key?

sumeerbhola

TFTRs!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, and @sumeerbhola)

pkg/keys/keys.go, line 401 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We should note somewhere in the function's comment that buf is replaced and not appended to.

Done

pkg/storage/intent_interleaving_iter.go, line 30 at r2 (raw file):

What is the plan for this type once all intents have been moved into the lockTable? It seems like we will still need it for limited scans that do not scan the lockTable ahead of time (#49973), but will other operations that have already scanned the lockTable still use it?

Yes, that is correct -- it will continue to exist for use in limited scans only.

Also, do we plan to simplify the implementation once all intents are segregated?

Yes, I think there will be (minor) simplifications.

pkg/storage/intent_interleaving_iter.go, line 44 at r2 (raw file):

EDIT: I was able to resolve these questions after reading the rest of the code in depth, but it could be made easier to understand.

I've added a struct-level long comment and another comment here.

It might be helpful to think of this as an interleaving between two MVCCIterators, where the first is iter and the second is a transformation of intentIter that a) maps the EngineKey to another EngineKey using DecodeLockTableSingleKey and b) maps the EngineKey to an MVCCKey by giving it a timestamp of 0. If we didn't care about performance, ...

This is now included in the struct-level comment.

pkg/storage/intent_interleaving_iter.go, line 71 at r2 (raw file):

Isn't this going to alias intentKeyBuf and intentOpts.LowerBound to the same memory? Is that not unsafe? Can the buffers passed in an IterOptions be reused after a call to NewEngineIterator?

Yes to aliasing. No to it being unsafe. Yes, we don't document the copying behavior in pebbleIterator -- I have now added comments in engine.go to this effect.
I will probably need to further optimize this when integrating, since (a) iterator reuse should include reusing already allocated memory for intentInterleavingIter, (b) this intentKeyBuf reuse is quite half-hearted -- I am inclined towards having 2 bufs.

pkg/storage/intent_interleaving_iter.go, line 105 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we set i.hasUpperBound = true after this? Or should we be calling i.SetUpperBound?

I added a comment. This is not a user-specified upper-bound, so we want to keep changing it on every call to SeekGE, which is why we don't set hasUpperBound=true.

I've also added some test cases with local keys, to exercise this code.

pkg/storage/intent_interleaving_iter.go, line 109 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

.IsEmpty(). We just got rid of all of these in 4fa8a38.

Done

pkg/storage/intent_interleaving_iter.go, line 118 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider adding intentSeekKey = nil. I know it's not needed, but it makes the intention and the interaction with the next block more clear.

I added a comment instead -- I prefer a code style where we don't set things to their existing value, and instead assert an invariant. Since this invariant is very local (intentSeekKey was declared a few lines before) I figured it wasn't worth asserting. Let me know if you think otherwise.

pkg/storage/intent_interleaving_iter.go, line 155 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider mentioning why this doesn't set i.valid = false in a comment. It took me a second to understand how this interacts with computePos.

Done. I didn't mention computePos here since the caller sometimes uses computePos and sometimes does something more specialized (by relying on an invariant that avoids a key comparison).

pkg/storage/intent_interleaving_iter.go, line 206 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

" and provisional values must have a higher timestamp than any committed value on a key". We've actually had bugs with this in the past, so it's worth considering where we're making assumptions about this. We should be able to assume it, but it's probably best to call the assumption out explicitly.

I've added a longer comment, which mentions that we don't actually rely on it pointing to the provisional value. But we do rely on it pointing to some version of the same key.
And I audited all the other places in the code that rely on some invariant outside the scope of this iterator and added comments for them.

pkg/storage/intent_interleaving_iter.go, line 337 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This might be a good place for a comment and maybe even a diagram demonstrating how the two iterators move and why this is the case. If you're feeling inspired.

I've added a long comment. It also resulted in a couple of code changes elsewhere.

pkg/storage/intent_interleaving_iter.go, line 360 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: append(roachpb.Key(nil), unsafeKey.Key... isn't quite the same as keyCopy := make([]byte, len(key.Key)); copy(keyCopy, key.Key). One reason is that the append method leaves some capacity on the slice. In the past, I believe we measured it to be a bit slower. I'd lean towards doing exactly what we do in pebbleIterator.Key.

Also, do we think there might be cases where an Iterator implementation provides a more efficient Key impl? In other words, would it be worth deferring to the underlying MVCCIterator implementation when !i.isCurAtIntentIter().

Good point. I've duplicated the logic from pebbleIterator.Key.

pkg/storage/intent_interleaving_iter.go, line 392 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you give this a small comment that calls out why Next is required. It took me a minute to figure that out.

Done

pkg/storage/testdata/ii_iter, line 47 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this would be a little easier to follow if it more clearly separated the cmd from the output, and if it included the arguments to cmds that take them, like seek-ge and seek-lt.

Fixed.

pkg/storage/testdata/ii_iter, line 123 at r2 (raw file):

Does this combination make sense to support? Does the "no guarantee" part mean that its behavior is undefined?

Yes, it is undefined on what happens after we are done with the key that SeekGE was called with. I've adjusted the comment. We don't "support" it, but we don't prevent the user from continuing to iterate either.

pkg/storage/testdata/ii_iter, line 149 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this also include a case like seek-ge k=a ts=5 where we immediately seek to the next key?

Good point. Added

nvanbenschoten

Reviewed 5 of 5 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal)

sumeerbhola · 2020-11-13T14:01:04Z

TFTR!
I've moved one error test case out into separate testdata files for race off and on, since the error is caught at different points of the iteration.

intentInterleavingIter allows for both physically separate and physically interleaved intents, and makes both look like interleaved intents. It uses two underlying iterators, a MVCCIterator and an EngineIterator, that are kept synchronized. Informs cockroachdb#41720 Release note: None

sumeerbhola · 2020-11-13T18:17:22Z

bors r+

craig · 2020-11-13T19:19:07Z

Build succeeded:

GitHub CI (Cockroach)

sumeerbhola requested review from itsbilal, nvanbenschoten and petermattis November 11, 2020 01:11

sumeerbhola commented Nov 11, 2020

View reviewed changes

petermattis reviewed Nov 11, 2020

View reviewed changes

sumeerbhola force-pushed the interleave branch from f56915c to 6316b09 Compare November 11, 2020 17:11

sumeerbhola commented Nov 11, 2020

View reviewed changes

sumeerbhola force-pushed the interleave branch 2 times, most recently from 9748ba1 to ee2332c Compare November 12, 2020 02:20

nvanbenschoten reviewed Nov 12, 2020

View reviewed changes

sumeerbhola force-pushed the interleave branch from ee2332c to 4c0758a Compare November 12, 2020 23:05

sumeerbhola commented Nov 12, 2020

View reviewed changes

nvanbenschoten approved these changes Nov 12, 2020

View reviewed changes

sumeerbhola force-pushed the interleave branch 2 times, most recently from 90009ab to d8243d6 Compare November 13, 2020 13:58

sumeerbhola force-pushed the interleave branch from d8243d6 to d1f80c1 Compare November 13, 2020 15:12

sumeerbhola force-pushed the interleave branch from d1f80c1 to 95b836d Compare November 13, 2020 17:17

craig bot merged commit 707c608 into cockroachdb:master Nov 13, 2020

sumeerbhola deleted the interleave branch November 23, 2020 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: implement iterator to interleave separated intents #56535

storage: implement iterator to interleave separated intents #56535

sumeerbhola commented Nov 11, 2020

cockroach-teamcity commented Nov 11, 2020

sumeerbhola left a comment

petermattis left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

sumeerbhola commented Nov 13, 2020

sumeerbhola commented Nov 13, 2020

craig bot commented Nov 13, 2020

storage: implement iterator to interleave separated intents #56535

storage: implement iterator to interleave separated intents #56535

Conversation

sumeerbhola commented Nov 11, 2020

cockroach-teamcity commented Nov 11, 2020

sumeerbhola left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola commented Nov 13, 2020

sumeerbhola commented Nov 13, 2020

craig bot commented Nov 13, 2020