Skip to content

Commit

Permalink
[WIP] storage: intentInterleavingIter support for multiple intents
Browse files Browse the repository at this point in the history
We allow a key (with multiple versions) to have multiple intents, under the
condition that at most one of the intents is uncommitted.

To aid this behavior we introduce a commitMap, that maintains a logical set
of TxnIDs of transactions that were "simple" in their behavior, and have
committed, where simple is defined as all the following conditions:
- No savepoint rollbacks: len(intent.IgnoredSeqNums)==0
- Single epoch, i.e., TxnMeta.Epoch==0
- Never pushed, i.e., TxnMeta.MinTimestamp==TxnMeta.WriteTimestamp
For such transactions, their provisional values can be considered committed
with the current version and local timestamp, i.e., we need no additional
information about the txn other than the TxnID.

Adding to commitMap:
The earliest a txn can be added to the commitMap is when the transaction is
in STAGING and has verified all the conditions for commit. That is, this
can be done before the transition to COMMITTED is replicated. For the node
that has the leaseholder of the range containing the txn record, this
requires no external communication. For other nodes with intents for the
txn, one could piggyback this information on the RPCs for async intent
resolution, and add to the the commitMap before doing the intent resolution
-- this piggybacking would incur 2 consensus rounds of contention. If we
are willing to send the RPC earlier, it will be contention for 1 consensus
round only. Note that these RPCs should also remove locks from the
non-persistent lock table data-structure so should send information about
the keys (like in a LockUpdate but won't remove the durable lock state).

Removal from commitMap:
The commitMap can be considered a cache of TxnIDs. It is helpful to have a
txn in the cache until its intents have been resolved. Additionally,
latchless intent resolution must pin a txn in the map before it releases
latches and unpin when the intent resolution has been applied on the
leaseholder. This pinning behavior is needed to ensure the correctness of
the in-memory concurrency.lockTable, which must maintain the property that
the replicated locks known to it are a subset of the persistent replicated
locks. We are assuming here that there is no lockTable on followers.

Why "simple-committed":
- We don't want to have to coordinate intent resolution of these multiple
  intents, by mandating that the resolution happen in any particular order.
- We want to guarantee that even if the commitMap is cleared (since it is a
  cache), we can maintain the invariant that a caller iterating over a key
  sees at most one intent. As we will illustrate below, providing this
  guarantee requires us to limit the commitMap to only contain
  simple-committed txns.

Consider a key with timestamps t5, t4, t3, t2, t1 in decreasing order and
intents for t5, t4, t2, with corresponding txns txn5, ... txn1. We consider
the disposition of an intent to be either unknown or simple-committed. In
this example, the disposition of the intent for t4 and t2 is
simple-committed solely based on the fact that there is at least one
version (provisional or otherwise) more recent that the timestamp of the
intent. That is, at most one intent, the one for t5, has a disposition that
needs to rely on knowledge that is not self-contained in the history. For
t5, we must rely on the commitMap to decide whether is unknown or
simple-committed. It is possible that some user of the
intentInterleavingIter saw t5 as simple-committed and a later user sees it
as unknown disposition, if the txn5 got removed from the commitMap -- such
regression is harmless since the latter user will simply have to do intent
resolution. Note that the intent for t5 could get resolved before those for
t4 and t2, and that is also fine since the disposition of t4, t2 stays
simple-committed. If txn5 is aborted and the intent for t5 removed, and
txn4 is no longer in the commitMap, the disposition of t4 could change to
unknown. This is also acceptable, since t5 was only serving as a "local
promise" that t4 was committed, which is simply an optimization. There is
still a less efficient globally available "promise" that t4 is committed,
and intent resolution of t4 is how we will enact that promise.

Maintaining the above guarantees requires that historical versions must not
be garbage collected without resolving intents. This is acceptable since GC
is not latency sensitive.

This PR only introduces the changes for the intentInterleavingIter.
It excludes:
- A sophisticated commitMap data-structure. There are code comments
  sketching out what properties are desirable.
- The iterForKeyVersions implementation used for intent resolution,
  now that we can have multiple intents. This will be straightforward.
- The changes for latchless intent resolution. These should be
  straightforward once we have a commitMap with pinning support and we
  extend cockroachdb#55461 to use a consistent storage snapshot for intent resolution.
- The KV layer changes to send lists of intent keys for simple-committed
  txns to the various ranges, so that they can add to their commitMap and
  remove from the in-memory lockTable.

Informs cockroachdb#66867

Release note: None
  • Loading branch information
sumeerbhola committed May 13, 2022
1 parent d05bbbe commit a39efee
Show file tree
Hide file tree
Showing 6 changed files with 857 additions and 307 deletions.
775 changes: 522 additions & 253 deletions pkg/storage/intent_interleaving_iter.go

Large diffs are not rendered by default.

22 changes: 21 additions & 1 deletion pkg/storage/intent_interleaving_iter_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import (
"fmt"
"math/rand"
"sort"
"strconv"
"strings"
"testing"

Expand Down Expand Up @@ -332,7 +333,24 @@ func TestIntentInterleavingIter(t *testing.T) {
if d.HasArg("prefix") {
d.ScanArgs(t, "prefix", &opts.Prefix)
}
iter := wrapInUnsafeIter(newIntentInterleavingIterator(eng, opts))
var committedTxns *commitMap
if d.HasArg("commit-map") {
var commitTxnsStr string
d.ScanArgs(t, "commit-map", &commitTxnsStr)
txns := strings.Split(commitTxnsStr, ",")
committedTxns = &commitMap{
simpleCommits: make(map[uuid.UUID]struct{}),
}
for i := range txns {
txn, err := strconv.Atoi(txns[i])
require.NoError(t, err)
txnUUID := uuid.FromUint128(uint128.FromInts(0, uint64(txn)))
committedTxns.simpleCommits[txnUUID] = struct{}{}
}
}
iiter := newIntentInterleavingIterator(eng, opts)
iiter.(*intentInterleavingIter).commitMap = committedTxns
iter := wrapInUnsafeIter(iiter)
var b strings.Builder
defer iter.Close()
// pos is the original <file>:<lineno> prefix computed by
Expand Down Expand Up @@ -724,6 +742,8 @@ func doOps(t *testing.T, ops []string, eng Engine, interleave bool, out *strings
}
}

// TODO: fix test.

var seedFlag = flag.Int64("seed", -1, "specify seed to use for random number generator")

func TestRandomizedIntentInterleavingIter(t *testing.T) {
Expand Down
73 changes: 37 additions & 36 deletions pkg/storage/testdata/intent_interleaving_iter/basic
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Both separated and interleaved intents, and one inline meta.
# Separated intents, and one inline meta.
define
locks
meta k=a ts=20 txn=1
meta k=b ts=30 txn=2
mvcc
value k=a ts=20 v=a20
value k=a ts=10 v=a10
meta k=b ts=30 txn=2
value k=b ts=30 v=b30
meta k=c
value k=d ts=25 v=d25
Expand All @@ -18,7 +18,7 @@ value k=d ts=25 v=d25
# iterators.
# - The second stats call below (after some prev steps) shows a higher value
# of interface reverse steps compared to internal reverse steps. This is
# because the intent iterator is being called with PrevWithLimit ans is
# because the intent iterator is being called with PrevWithLimit and is
# already at the separated intent "a", so does not have to step the internal
# iterator.
iter lower=a upper=f
Expand Down Expand Up @@ -65,7 +65,7 @@ next: output: value k=b ts=30.000000000,0 v=b30
next: output: meta k=c
next: output: value k=d ts=25.000000000,0 v=d25
next: output: .
stats: (interface (dir, seek, step): (fwd, 2, 7), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 2, 7), (rev, 0, 0))
stats: (interface (dir, seek, step): (fwd, 2, 9), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 2, 7), (rev, 0, 0))
prev: output: value k=d ts=25.000000000,0 v=d25
prev: output: meta k=c
prev: output: value k=b ts=30.000000000,0 v=b30
Expand All @@ -74,17 +74,17 @@ prev: output: value k=a ts=10.000000000,0 v=a10
prev: output: value k=a ts=20.000000000,0 v=a20
prev: output: meta k=a ts=20.000000000,0 txn=1
prev: output: .
stats: (interface (dir, seek, step): (fwd, 2, 7), (rev, 0, 13)), (internal (dir, seek, step): (fwd, 2, 7), (rev, 2, 7))
stats: (interface (dir, seek, step): (fwd, 2, 9), (rev, 0, 12)), (internal (dir, seek, step): (fwd, 2, 7), (rev, 2, 7))
set-upper c
seek-ge "b"/0,0: output: meta k=b ts=30.000000000,0 txn=2
stats: (interface (dir, seek, step): (fwd, 4, 7), (rev, 0, 13)), (internal (dir, seek, step): (fwd, 4, 7), (rev, 2, 7))
stats: (interface (dir, seek, step): (fwd, 4, 10), (rev, 0, 12)), (internal (dir, seek, step): (fwd, 4, 8), (rev, 2, 7))
next: output: value k=b ts=30.000000000,0 v=b30
next: output: .
prev: output: value k=b ts=30.000000000,0 v=b30
prev: output: meta k=b ts=30.000000000,0 txn=2
prev: output: value k=a ts=10.000000000,0 v=a10
seek-lt "b"/0,0: output: value k=a ts=10.000000000,0 v=a10
stats: (interface (dir, seek, step): (fwd, 4, 9), (rev, 2, 19)), (internal (dir, seek, step): (fwd, 4, 9), (rev, 6, 13))
stats: (interface (dir, seek, step): (fwd, 4, 11), (rev, 2, 17)), (internal (dir, seek, step): (fwd, 4, 9), (rev, 6, 13))
next: output: meta k=b ts=30.000000000,0 txn=2
prev: output: value k=a ts=10.000000000,0 v=a10
prev: output: value k=a ts=20.000000000,0 v=a20
Expand Down Expand Up @@ -160,6 +160,8 @@ next: output: .

# Prefix iteration and NextKey. What we will see after the prefix is
# exhausted is undefined.
# seek-ge k=a exhausts the intentIter when extracting the intent, so next or
# next-key do not see the next intent.
iter prefix=true
seek-ge k=d
next-key
Expand All @@ -172,10 +174,10 @@ next-key
seek-ge "d"/0,0: output: value k=d ts=25.000000000,0 v=d25
next-key: output: .
seek-ge "a"/0,0: output: meta k=a ts=20.000000000,0 txn=1
next-key: output: meta k=b ts=30.000000000,0 txn=2
next-key: output: value k=b ts=30.000000000,0 v=b30
seek-ge "a"/0,0: output: meta k=a ts=20.000000000,0 txn=1
next: output: value k=a ts=20.000000000,0 v=a20
next-key: output: meta k=b ts=30.000000000,0 txn=2
next-key: output: value k=b ts=30.000000000,0 v=b30

# Seek to particular timestamp.
iter lower=a upper=f
Expand All @@ -198,13 +200,15 @@ prev
prev
next
seek-lt k=a ts=25
seek-lt k=a ts=19
prev
next
seek-ge k=a ts=5
next
next
prev
seek-lt k=b ts=40
seek-lt k=b ts=19
prev
prev
prev
Expand All @@ -228,18 +232,20 @@ seek-lt "a"/15.000000000,0: output: value k=a ts=20.000000000,0 v=a20
prev: output: meta k=a ts=20.000000000,0 txn=1
prev: output: .
next: output: meta k=a ts=20.000000000,0 txn=1
seek-lt "a"/25.000000000,0: output: meta k=a ts=20.000000000,0 txn=1
prev: output: .
next: output: meta k=a ts=20.000000000,0 txn=1
seek-lt "a"/25.000000000,0: output: err: SeekLT cannot be used to skip the latest version and land on intent
seek-lt "a"/19.000000000,0: output: value k=a ts=20.000000000,0 v=a20
prev: output: meta k=a ts=20.000000000,0 txn=1
next: output: value k=a ts=20.000000000,0 v=a20
seek-ge "a"/5.000000000,0: output: meta k=b ts=30.000000000,0 txn=2
next: output: value k=b ts=30.000000000,0 v=b30
next: output: meta k=c
prev: output: value k=b ts=30.000000000,0 v=b30
seek-lt "b"/40.000000000,0: output: meta k=b ts=30.000000000,0 txn=2
seek-lt "b"/40.000000000,0: output: err: SeekLT cannot be used to skip the latest version and land on intent
seek-lt "b"/19.000000000,0: output: value k=b ts=30.000000000,0 v=b30
prev: output: meta k=b ts=30.000000000,0 txn=2
prev: output: value k=a ts=10.000000000,0 v=a10
prev: output: value k=a ts=20.000000000,0 v=a20
prev: output: meta k=a ts=20.000000000,0 txn=1
next: output: value k=a ts=20.000000000,0 v=a20
next: output: value k=a ts=10.000000000,0 v=a10

# Seek to particular timestamp and prefix iteration. What we will
# see after the prefix is exhausted is undefined.
Expand Down Expand Up @@ -293,17 +299,17 @@ next-key: output: value k=d ts=25.000000000,0 v=d25
next-key: output: .


# Multiple separated intents and multiple interleaved intents.
# Multiple separated intents.
define
locks
meta k=a ts=10 txn=1
meta k=b ts=20 txn=2
meta k=c ts=30 txn=3
meta k=d ts=40 txn=4
meta k=e ts=50 txn=5
mvcc
meta k=a ts=10 txn=1
value k=a ts=10 v=a10
value k=b ts=20 v=b20
meta k=c ts=30 txn=3
value k=c ts=30 v=c30
value k=d ts=40 v=d40
value k=e ts=50 v=e50
Expand Down Expand Up @@ -386,24 +392,24 @@ prev
seek-lt k=e
prev
----
seek-ge "a"/0,0: output: meta k=b ts=20.000000000,0 txn=2
next: output: err: intentIter at intent, but iter not at provisional value
seek-lt "e"/0,0: output: meta k=d ts=40.000000000,0 txn=4
next: output: err: intent has no provisional value
seek-ge "d"/0,0: output: meta k=d ts=40.000000000,0 txn=4
next-key: output: err: intentIter at intent, but iter not at provisional value
seek-ge "d"/0,0: output: meta k=d ts=40.000000000,0 txn=4
prev: output: err: iter not at provisional value, cmp: -1
seek-lt "e"/0,0: output: meta k=d ts=40.000000000,0 txn=4
prev: output: err: reverse iteration discovered intent without provisional value
seek-ge "a"/0,0: output: err: intent with no version
next: output: err: intent with no version
seek-lt "e"/0,0: output: err: SeekLT cannot be used to skip the latest version and land on intent
next: output: err: SeekLT cannot be used to skip the latest version and land on intent
seek-ge "d"/0,0: output: err: intent with no version
next-key: output: err: intent with no version
seek-ge "d"/0,0: output: err: intent with no version
prev: output: err: intent with no version
seek-lt "e"/0,0: output: err: SeekLT cannot be used to skip the latest version and land on intent
prev: output: err: SeekLT cannot be used to skip the latest version and land on intent

# Local range keys. This exercises local keys having separated locks.
define
locks
meta k=La ts=10 txn=1
meta k=Lb ts=20 txn=2
meta k=Lc ts=30 txn=4
mvcc
meta k=La ts=10 txn=1
value k=La ts=10 v=a10
value k=Lb ts=20 v=b20
value k=Lc ts=30 v=c30
Expand Down Expand Up @@ -523,12 +529,7 @@ seek-ge "Lb"/25.000000000,0: output: value k=Lb ts=20.000000000,0 v=b20
next: output: .
seek-ge "Lc"/25.000000000,0: output: .

# Keys with \x00 byte. To exercise the slow-path in UnsafeRawMVCCKey. The keys
# that are length 8, 16 will exercise the slow-path. The len(key) < 8 does
# not. DecodeLockTableSingleKey allocates a new slice for all these keys, due
# to the escaping. However the starting capacity is 8, and the next growth
# step is 16, which means that when len(key) < 8, the len of the allocated
# slice is smaller than cap and the first byte beyond len is 0.
# Keys with \x00 byte.
define
locks
meta k=abcdefg\0 ts=20 txn=1
Expand Down Expand Up @@ -593,12 +594,12 @@ prev: output: .

define
locks
meta k=La ts=10 txn=1
meta k=Lb ts=20 txn=2
meta k=Lc ts=30 txn=4
meta k=b ts=40 txn=5
meta k=d ts=50 txn=6
mvcc
meta k=La ts=10 txn=1
value k=La ts=10 v=a10
value k=Lb ts=20 v=b20
value k=Lc ts=30 v=c30
Expand Down
17 changes: 10 additions & 7 deletions pkg/storage/testdata/intent_interleaving_iter/error_race
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@
# when RaceEnabled=true. This file contains the output when
# RaceEnabled=true

# Error case: Multiple intents for a key
# Error case: Multiple intents for a key but only one has corresponding
# version.
define
locks
meta k=a ts=10 txn=1
meta k=b ts=20 txn=2
meta k=b ts=19 txn=2
meta k=b ts=20 txn=4
meta k=c ts=30 txn=4
mvcc
Expand All @@ -16,6 +17,8 @@ value k=b ts=20 v=b20
value k=c ts=30 v=c30
----

# We don't catch the error on forward iteration, since we forget the simple-committed
# intent.
iter lower=a upper=d
seek-ge k=a
next
Expand All @@ -32,11 +35,11 @@ prev
seek-ge "a"/0,0: output: meta k=a ts=10.000000000,0 txn=1
next: output: value k=a ts=10.000000000,0 v=a10
next: output: meta k=b ts=20.000000000,0 txn=4
next: output: err: intentIter incorrectly positioned, cmp: 0
next: output: err: intentIter incorrectly positioned, cmp: 0
next: output: err: intentIter incorrectly positioned, cmp: 0
next: output: value k=b ts=20.000000000,0 v=b20
next: output: meta k=c ts=30.000000000,0 txn=4
next: output: value k=c ts=30.000000000,0 v=c30
seek-lt "d"/0,0: output: value k=c ts=30.000000000,0 v=c30
prev: output: meta k=c ts=30.000000000,0 txn=4
prev: output: value k=b ts=20.000000000,0 v=b20
prev: output: meta k=b ts=20.000000000,0 txn=2
prev: output: err: intentIter should not be after iter
prev: output: meta k=b ts=20.000000000,0 txn=4
prev: output: value k=a ts=10.000000000,0 v=a10
21 changes: 11 additions & 10 deletions pkg/storage/testdata/intent_interleaving_iter/error_race_off
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@
# when RaceEnabled=true. This file contains the output when
# RaceEnabled=false

# Error case: Multiple intents for a key
# Error case: Multiple intents for a key but only one has corresponding
# version.
define
locks
meta k=a ts=10 txn=1
meta k=b ts=20 txn=2
meta k=b ts=19 txn=2
meta k=b ts=20 txn=4
meta k=c ts=30 txn=4
mvcc
Expand All @@ -16,10 +17,8 @@ value k=b ts=20 v=b20
value k=c ts=30 v=c30
----

# We don't catch the error immediately on forward iteration, since unnecessary
# key comparisons are elided (except for RaceEnabled=true). Which is why the
# intent for k=b is returned twice. But continued forward iteration eventually
# catches it.
# We don't catch the error on forward iteration, since we forget the simple-committed
# intent.
iter lower=a upper=d
seek-ge k=a
next
Expand All @@ -37,10 +36,12 @@ seek-ge "a"/0,0: output: meta k=a ts=10.000000000,0 txn=1
next: output: value k=a ts=10.000000000,0 v=a10
next: output: meta k=b ts=20.000000000,0 txn=4
next: output: value k=b ts=20.000000000,0 v=b20
next: output: meta k=b ts=20.000000000,0 txn=2
next: output: err: intentIter at intent, but iter not at provisional value
next: output: meta k=c ts=30.000000000,0 txn=4
next: output: value k=c ts=30.000000000,0 v=c30
seek-lt "d"/0,0: output: value k=c ts=30.000000000,0 v=c30
prev: output: meta k=c ts=30.000000000,0 txn=4
prev: output: value k=b ts=20.000000000,0 v=b20
prev: output: meta k=b ts=20.000000000,0 txn=2
prev: output: err: intentIter should not be after iter
prev: output: meta k=b ts=20.000000000,0 txn=4
prev: output: value k=a ts=10.000000000,0 v=a10

# TODO: add more tests that differ depending on RaceEnabled value
Loading

0 comments on commit a39efee

Please sign in to comment.