[WIP] storage: intentInterleavingIter support for multiple intents

We allow a key (with multiple versions) to have multiple intents, under the condition that at most one of the intents is uncommitted. To aid this behavior we introduce a commitMap, that maintains a logical set of TxnIDs of transactions that were "simple" in their behavior, and have committed, where simple is defined as all the following conditions: - No savepoint rollbacks: len(intent.IgnoredSeqNums)==0 - Single epoch, i.e., TxnMeta.Epoch==0 - Never pushed, i.e., TxnMeta.MinTimestamp==TxnMeta.WriteTimestamp For such transactions, their provisional values can be considered committed with the current version and local timestamp, i.e., we need no additional information about the txn other than the TxnID. Adding to commitMap: The earliest a txn can be added to the commitMap is when the transaction is in STAGING and has verified all the conditions for commit. That is, this can be done before the transition to COMMITTED is replicated. For the node that has the leaseholder of the range containing the txn record, this requires no external communication. For other nodes with intents for the txn, one could piggyback this information on the RPCs for async intent resolution, and add to the the commitMap before doing the intent resolution -- this piggybacking would incur 2 consensus rounds of contention. If we are willing to send the RPC earlier, it will be contention for 1 consensus round only. Note that these RPCs should also remove locks from the non-persistent lock table data-structure so should send information about the keys (like in a LockUpdate but won't remove the durable lock state). Removal from commitMap: The commitMap can be considered a cache of TxnIDs. It is helpful to have a txn in the cache until its intents have been resolved. Additionally, latchless intent resolution must pin a txn in the map before it releases latches and unpin when the intent resolution has been applied on the leaseholder. This pinning behavior is needed to ensure the correctness of the in-memory concurrency.lockTable, which must maintain the property that the replicated locks known to it are a subset of the persistent replicated locks. We are assuming here that there is no lockTable on followers. Why "simple-committed": - We don't want to have to coordinate intent resolution of these multiple intents, by mandating that the resolution happen in any particular order. - We want to guarantee that even if the commitMap is cleared (since it is a cache), we can maintain the invariant that a caller iterating over a key sees at most one intent. As we will illustrate below, providing this guarantee requires us to limit the commitMap to only contain simple-committed txns. Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and intents for t5, t4, t2, with corresponding txns txn5, ... txn1. We consider the disposition of an intent to be either unknown or simple-committed. In this example, the disposition of the intent for t4 and t2 is simple-committed solely based on the fact that there is at least one version (provisional or otherwise) more recent that the timestamp of the intent. That is, at most one intent, the one for t5, has a disposition that needs to rely on knowledge that is not self-contained in the history. For t5, we must rely on the commitMap to decide whether is unknown or simple-committed. It is possible that some user of the intentInterleavingIter saw t5 as simple-committed and a later user sees it as unknown disposition, if the txn5 got removed from the commitMap -- such regression is harmless since the latter user will simply have to do intent resolution. Note that the intent for t5 could get resolved before those for t4 and t2, and that is also fine since the disposition of t4, t2 stays simple-committed. If txn5 is aborted and the intent for t5 removed, and txn4 is no longer in the commitMap, the disposition of t4 could change to unknown. This is also acceptable, since t5 was only serving as a "local promise" that t4 was committed, which is simply an optimization. There is still a less efficient globally available "promise" that t4 is committed, and intent resolution of t4 is how we will enact that promise. Maintaining the above guarantees requires that historical versions must not be garbage collected without resolving intents. This is acceptable since GC is not latency sensitive. This PR only introduces the changes for the intentInterleavingIter. It excludes: - A sophisticated commitMap data-structure. There are code comments sketching out what properties are desirable. - The iterForKeyVersions implementation used for intent resolution, now that we can have multiple intents. This will be straightforward. - The changes for latchless intent resolution. These should be straightforward once we have a commitMap with pinning support and we extend cockroachdb#55461 to use a consistent storage snapshot for intent resolution. - The KV layer changes to send lists of intent keys for simple-committed txns to the various ranges, so that they can add to their commitMap and remove from the in-memory lockTable. Informs cockroachdb#66867 Release note: None
sumeerbhola · May 13, 2022 · a39efee · a39efee
1 parent d05bbbe
commit a39efee
Show file tree

Hide file tree

Showing 6 changed files with 857 additions and 307 deletions.
diff --git a/pkg/storage/intent_interleaving_iter.go b/pkg/storage/intent_interleaving_iter.go
diff --git a/pkg/storage/intent_interleaving_iter_test.go b/pkg/storage/intent_interleaving_iter_test.go
@@ -16,6 +16,7 @@ import (
 	"fmt"
 	"math/rand"
 	"sort"
+	"strconv"
 	"strings"
 	"testing"
 
@@ -332,7 +333,24 @@ func TestIntentInterleavingIter(t *testing.T) {
 				if d.HasArg("prefix") {
 					d.ScanArgs(t, "prefix", &opts.Prefix)
 				}
-				iter := wrapInUnsafeIter(newIntentInterleavingIterator(eng, opts))
+				var committedTxns *commitMap
+				if d.HasArg("commit-map") {
+					var commitTxnsStr string
+					d.ScanArgs(t, "commit-map", &commitTxnsStr)
+					txns := strings.Split(commitTxnsStr, ",")
+					committedTxns = &commitMap{
+						simpleCommits: make(map[uuid.UUID]struct{}),
+					}
+					for i := range txns {
+						txn, err := strconv.Atoi(txns[i])
+						require.NoError(t, err)
+						txnUUID := uuid.FromUint128(uint128.FromInts(0, uint64(txn)))
+						committedTxns.simpleCommits[txnUUID] = struct{}{}
+					}
+				}
+				iiter := newIntentInterleavingIterator(eng, opts)
+				iiter.(*intentInterleavingIter).commitMap = committedTxns
+				iter := wrapInUnsafeIter(iiter)
 				var b strings.Builder
 				defer iter.Close()
 				// pos is the original <file>:<lineno> prefix computed by
@@ -724,6 +742,8 @@ func doOps(t *testing.T, ops []string, eng Engine, interleave bool, out *strings
 	}
 }
 
+// TODO: fix test.
+
 var seedFlag = flag.Int64("seed", -1, "specify seed to use for random number generator")
 
 func TestRandomizedIntentInterleavingIter(t *testing.T) {

diff --git a/pkg/storage/testdata/intent_interleaving_iter/basic b/pkg/storage/testdata/intent_interleaving_iter/basic
@@ -1,11 +1,11 @@
-# Both separated and interleaved intents, and one inline meta.
+# Separated intents, and one inline meta.
 define
 locks
 meta k=a ts=20 txn=1
+meta k=b ts=30 txn=2
 mvcc
 value k=a ts=20 v=a20
 value k=a ts=10 v=a10
-meta k=b ts=30 txn=2
 value k=b ts=30 v=b30
 meta k=c
 value k=d ts=25 v=d25
@@ -18,7 +18,7 @@ value k=d ts=25 v=d25
 #   iterators.
 # - The second stats call below (after some prev steps) shows a higher value
 #   of interface reverse steps compared to internal reverse steps. This is
-#   because the intent iterator is being called with PrevWithLimit ans is
+#   because the intent iterator is being called with PrevWithLimit and is
 #   already at the separated intent "a", so does not have to step the internal
 #   iterator.
 iter lower=a upper=f
@@ -65,7 +65,7 @@ next: output: value k=b ts=30.000000000,0 v=b30
 next: output: meta k=c
 next: output: value k=d ts=25.000000000,0 v=d25
 next: output: .
-stats: (interface (dir, seek, step): (fwd, 2, 7), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 2, 7), (rev, 0, 0))
+stats: (interface (dir, seek, step): (fwd, 2, 9), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 2, 7), (rev, 0, 0))
 prev: output: value k=d ts=25.000000000,0 v=d25
 prev: output: meta k=c
 prev: output: value k=b ts=30.000000000,0 v=b30
@@ -74,17 +74,17 @@ prev: output: value k=a ts=10.000000000,0 v=a10
 prev: output: value k=a ts=20.000000000,0 v=a20
 prev: output: meta k=a ts=20.000000000,0 txn=1
 prev: output: .
-stats: (interface (dir, seek, step): (fwd, 2, 7), (rev, 0, 13)), (internal (dir, seek, step): (fwd, 2, 7), (rev, 2, 7))
+stats: (interface (dir, seek, step): (fwd, 2, 9), (rev, 0, 12)), (internal (dir, seek, step): (fwd, 2, 7), (rev, 2, 7))
 set-upper c
 seek-ge "b"/0,0: output: meta k=b ts=30.000000000,0 txn=2
-stats: (interface (dir, seek, step): (fwd, 4, 7), (rev, 0, 13)), (internal (dir, seek, step): (fwd, 4, 7), (rev, 2, 7))
+stats: (interface (dir, seek, step): (fwd, 4, 10), (rev, 0, 12)), (internal (dir, seek, step): (fwd, 4, 8), (rev, 2, 7))
 next: output: value k=b ts=30.000000000,0 v=b30
 next: output: .
 prev: output: value k=b ts=30.000000000,0 v=b30
 prev: output: meta k=b ts=30.000000000,0 txn=2
 prev: output: value k=a ts=10.000000000,0 v=a10
 seek-lt "b"/0,0: output: value k=a ts=10.000000000,0 v=a10
-stats: (interface (dir, seek, step): (fwd, 4, 9), (rev, 2, 19)), (internal (dir, seek, step): (fwd, 4, 9), (rev, 6, 13))
+stats: (interface (dir, seek, step): (fwd, 4, 11), (rev, 2, 17)), (internal (dir, seek, step): (fwd, 4, 9), (rev, 6, 13))
 next: output: meta k=b ts=30.000000000,0 txn=2
 prev: output: value k=a ts=10.000000000,0 v=a10
 prev: output: value k=a ts=20.000000000,0 v=a20
@@ -160,6 +160,8 @@ next: output: .
 
 # Prefix iteration and NextKey. What we will see after the prefix is
 # exhausted is undefined.
+# seek-ge k=a exhausts the intentIter when extracting the intent, so next or
+# next-key do not see the next intent.
 iter prefix=true
 seek-ge k=d
 next-key
@@ -172,10 +174,10 @@ next-key
 seek-ge "d"/0,0: output: value k=d ts=25.000000000,0 v=d25
 next-key: output: .
 seek-ge "a"/0,0: output: meta k=a ts=20.000000000,0 txn=1
-next-key: output: meta k=b ts=30.000000000,0 txn=2
+next-key: output: value k=b ts=30.000000000,0 v=b30
 seek-ge "a"/0,0: output: meta k=a ts=20.000000000,0 txn=1
 next: output: value k=a ts=20.000000000,0 v=a20
-next-key: output: meta k=b ts=30.000000000,0 txn=2
+next-key: output: value k=b ts=30.000000000,0 v=b30
 
 # Seek to particular timestamp.
 iter lower=a upper=f
@@ -198,13 +200,15 @@ prev
 prev
 next
 seek-lt k=a ts=25
+seek-lt k=a ts=19
 prev
 next
 seek-ge k=a ts=5
 next
 next
 prev
 seek-lt k=b ts=40
+seek-lt k=b ts=19
 prev
 prev
 prev
@@ -228,18 +232,20 @@ seek-lt "a"/15.000000000,0: output: value k=a ts=20.000000000,0 v=a20
 prev: output: meta k=a ts=20.000000000,0 txn=1
 prev: output: .
 next: output: meta k=a ts=20.000000000,0 txn=1
-seek-lt "a"/25.000000000,0: output: meta k=a ts=20.000000000,0 txn=1
-prev: output: .
-next: output: meta k=a ts=20.000000000,0 txn=1
+seek-lt "a"/25.000000000,0: output: err: SeekLT cannot be used to skip the latest version and land on intent
+seek-lt "a"/19.000000000,0: output: value k=a ts=20.000000000,0 v=a20
+prev: output: meta k=a ts=20.000000000,0 txn=1
+next: output: value k=a ts=20.000000000,0 v=a20
 seek-ge "a"/5.000000000,0: output: meta k=b ts=30.000000000,0 txn=2
 next: output: value k=b ts=30.000000000,0 v=b30
 next: output: meta k=c
 prev: output: value k=b ts=30.000000000,0 v=b30
-seek-lt "b"/40.000000000,0: output: meta k=b ts=30.000000000,0 txn=2
+seek-lt "b"/40.000000000,0: output: err: SeekLT cannot be used to skip the latest version and land on intent
+seek-lt "b"/19.000000000,0: output: value k=b ts=30.000000000,0 v=b30
+prev: output: meta k=b ts=30.000000000,0 txn=2
 prev: output: value k=a ts=10.000000000,0 v=a10
 prev: output: value k=a ts=20.000000000,0 v=a20
-prev: output: meta k=a ts=20.000000000,0 txn=1
-next: output: value k=a ts=20.000000000,0 v=a20
+next: output: value k=a ts=10.000000000,0 v=a10
 
 # Seek to particular timestamp and prefix iteration. What we will
 # see after the prefix is exhausted is undefined.
@@ -293,17 +299,17 @@ next-key: output: value k=d ts=25.000000000,0 v=d25
 next-key: output: .
 
 
-# Multiple separated intents and multiple interleaved intents.
+# Multiple separated intents.
 define
 locks
+meta k=a ts=10 txn=1
 meta k=b ts=20 txn=2
+meta k=c ts=30 txn=3
 meta k=d ts=40 txn=4
 meta k=e ts=50 txn=5
 mvcc
-meta k=a ts=10 txn=1
 value k=a ts=10 v=a10
 value k=b ts=20 v=b20
-meta k=c ts=30 txn=3
 value k=c ts=30 v=c30
 value k=d ts=40 v=d40
 value k=e ts=50 v=e50
@@ -386,24 +392,24 @@ prev
 seek-lt k=e
 prev
 ----
-seek-ge "a"/0,0: output: meta k=b ts=20.000000000,0 txn=2
-next: output: err: intentIter at intent, but iter not at provisional value
-seek-lt "e"/0,0: output: meta k=d ts=40.000000000,0 txn=4
-next: output: err: intent has no provisional value
-seek-ge "d"/0,0: output: meta k=d ts=40.000000000,0 txn=4
-next-key: output: err: intentIter at intent, but iter not at provisional value
-seek-ge "d"/0,0: output: meta k=d ts=40.000000000,0 txn=4
-prev: output: err: iter not at provisional value, cmp: -1
-seek-lt "e"/0,0: output: meta k=d ts=40.000000000,0 txn=4
-prev: output: err: reverse iteration discovered intent without provisional value
+seek-ge "a"/0,0: output: err: intent with no version
+next: output: err: intent with no version
+seek-lt "e"/0,0: output: err: SeekLT cannot be used to skip the latest version and land on intent
+next: output: err: SeekLT cannot be used to skip the latest version and land on intent
+seek-ge "d"/0,0: output: err: intent with no version
+next-key: output: err: intent with no version
+seek-ge "d"/0,0: output: err: intent with no version
+prev: output: err: intent with no version
+seek-lt "e"/0,0: output: err: SeekLT cannot be used to skip the latest version and land on intent
+prev: output: err: SeekLT cannot be used to skip the latest version and land on intent
 
 # Local range keys. This exercises local keys having separated locks.
 define
 locks
+meta k=La ts=10 txn=1
 meta k=Lb ts=20 txn=2
 meta k=Lc ts=30 txn=4
 mvcc
-meta k=La ts=10 txn=1
 value k=La ts=10 v=a10
 value k=Lb ts=20 v=b20
 value k=Lc ts=30 v=c30
@@ -523,12 +529,7 @@ seek-ge "Lb"/25.000000000,0: output: value k=Lb ts=20.000000000,0 v=b20
 next: output: .
 seek-ge "Lc"/25.000000000,0: output: .
 
-# Keys with \x00 byte. To exercise the slow-path in UnsafeRawMVCCKey. The keys
-# that are length 8, 16 will exercise the slow-path. The len(key) < 8 does
-# not. DecodeLockTableSingleKey allocates a new slice for all these keys, due
-# to the escaping. However the starting capacity is 8, and the next growth
-# step is 16, which means that when len(key) < 8, the len of the allocated
-# slice is smaller than cap and the first byte beyond len is 0.
+# Keys with \x00 byte.
 define
 locks
 meta k=abcdefg\0 ts=20 txn=1
@@ -593,12 +594,12 @@ prev: output: .
 
 define
 locks
+meta k=La ts=10 txn=1
 meta k=Lb ts=20 txn=2
 meta k=Lc ts=30 txn=4
 meta k=b ts=40 txn=5
 meta k=d ts=50 txn=6
 mvcc
-meta k=La ts=10 txn=1
 value k=La ts=10 v=a10
 value k=Lb ts=20 v=b20
 value k=Lc ts=30 v=c30

diff --git a/pkg/storage/testdata/intent_interleaving_iter/error_race b/pkg/storage/testdata/intent_interleaving_iter/error_race
@@ -3,11 +3,12 @@
 # when RaceEnabled=true. This file contains the output when
 # RaceEnabled=true
 
-# Error case: Multiple intents for a key
+# Error case: Multiple intents for a key but only one has corresponding
+# version.
 define
 locks
 meta k=a ts=10 txn=1
-meta k=b ts=20 txn=2
+meta k=b ts=19 txn=2
 meta k=b ts=20 txn=4
 meta k=c ts=30 txn=4
 mvcc
@@ -16,6 +17,8 @@ value k=b ts=20 v=b20
 value k=c ts=30 v=c30
 ----
 
+# We don't catch the error on forward iteration, since we forget the simple-committed
+# intent.
 iter lower=a upper=d
 seek-ge k=a
 next
@@ -32,11 +35,11 @@ prev
 seek-ge "a"/0,0: output: meta k=a ts=10.000000000,0 txn=1
 next: output: value k=a ts=10.000000000,0 v=a10
 next: output: meta k=b ts=20.000000000,0 txn=4
-next: output: err: intentIter incorrectly positioned, cmp: 0
-next: output: err: intentIter incorrectly positioned, cmp: 0
-next: output: err: intentIter incorrectly positioned, cmp: 0
+next: output: value k=b ts=20.000000000,0 v=b20
+next: output: meta k=c ts=30.000000000,0 txn=4
+next: output: value k=c ts=30.000000000,0 v=c30
 seek-lt "d"/0,0: output: value k=c ts=30.000000000,0 v=c30
 prev: output: meta k=c ts=30.000000000,0 txn=4
 prev: output: value k=b ts=20.000000000,0 v=b20
-prev: output: meta k=b ts=20.000000000,0 txn=2
-prev: output: err: intentIter should not be after iter
+prev: output: meta k=b ts=20.000000000,0 txn=4
+prev: output: value k=a ts=10.000000000,0 v=a10
diff --git a/pkg/storage/testdata/intent_interleaving_iter/error_race_off b/pkg/storage/testdata/intent_interleaving_iter/error_race_off
@@ -3,11 +3,12 @@
 # when RaceEnabled=true. This file contains the output when
 # RaceEnabled=false
 
-# Error case: Multiple intents for a key
+# Error case: Multiple intents for a key but only one has corresponding
+# version.
 define
 locks
 meta k=a ts=10 txn=1
-meta k=b ts=20 txn=2
+meta k=b ts=19 txn=2
 meta k=b ts=20 txn=4
 meta k=c ts=30 txn=4
 mvcc
@@ -16,10 +17,8 @@ value k=b ts=20 v=b20
 value k=c ts=30 v=c30
 ----
 
-# We don't catch the error immediately on forward iteration, since unnecessary
-# key comparisons are elided (except for RaceEnabled=true). Which is why the
-# intent for k=b is returned twice. But continued forward iteration eventually
-# catches it.
+# We don't catch the error on forward iteration, since we forget the simple-committed
+# intent.
 iter lower=a upper=d
 seek-ge k=a
 next
@@ -37,10 +36,12 @@ seek-ge "a"/0,0: output: meta k=a ts=10.000000000,0 txn=1
 next: output: value k=a ts=10.000000000,0 v=a10
 next: output: meta k=b ts=20.000000000,0 txn=4
 next: output: value k=b ts=20.000000000,0 v=b20
-next: output: meta k=b ts=20.000000000,0 txn=2
-next: output: err: intentIter at intent, but iter not at provisional value
+next: output: meta k=c ts=30.000000000,0 txn=4
+next: output: value k=c ts=30.000000000,0 v=c30
 seek-lt "d"/0,0: output: value k=c ts=30.000000000,0 v=c30
 prev: output: meta k=c ts=30.000000000,0 txn=4
 prev: output: value k=b ts=20.000000000,0 v=b20
-prev: output: meta k=b ts=20.000000000,0 txn=2
-prev: output: err: intentIter should not be after iter
+prev: output: meta k=b ts=20.000000000,0 txn=4
+prev: output: value k=a ts=10.000000000,0 v=a10
+
+# TODO: add more tests that differ depending on RaceEnabled value