From 6f4942f34ac8cdca48d45e81e4f8f82a43554889 Mon Sep 17 00:00:00 2001 From: Nathan VanBenschoten Date: Thu, 28 Oct 2021 16:12:57 -0400 Subject: [PATCH] kv/storage: disable observed timestamps for async resolved intents that are pushed Fixes #36431. Fixes #49360. This commit fixes the potential for a stale read as detailed in #36431 using the approach detailed in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-718347165. This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](https://github.com/cockroachdb/jepsen/pull/19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test. \### Explanation The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing. In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make. This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back some utility by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold. This commit does not take full advantage of this second observation, because we do not retain "written timestamps" in committed values (though we do in intents). Instead, we do something less optimal but cheaper and more convenient. Any intent whose timestamp is changed during asynchronous intent resolution is marked as "synthetic". Doing so is a compressed way of saying that the value could have originally been written as an intent at any time (even min_timestamp) and so observed timestamps cannot be used to limit uncertainty by pulling a read's uncertainty interval below the value's timestamp. This application of the synthetic bit prevents observed timestamps from being used to avoid uncertainty restarts with the set of committed values whose timestamps do not reflect their original write time and therefore do not make a claim about the clock of their leaseholder at the time that they were committed. It does not change the interaction between observed timestamps and any other committed value. \### Correctness testing I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate. However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes. \### Performance analysis This correctness fix will lead to an increased rate of transaction retries under some workloads. TODO(nvanbenschoten): - observe TPC-C performance -- top-line performance -- uncertainty retry rate -- commit-wait rate (should be zero) - compare YCSB performance ---- Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew. [^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. [^2]: see analysis in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-714221846. --- pkg/ccl/changefeedccl/kvevent/event.go | 10 + pkg/cli/debug.go | 2 +- pkg/kv/kvserver/batch_spanset_test.go | 2 +- .../kvserver/batcheval/cmd_end_transaction.go | 8 +- .../batcheval/cmd_refresh_range_test.go | 2 +- .../kvserver/batcheval/cmd_resolve_intent.go | 5 +- .../batcheval/cmd_resolve_intent_range.go | 5 +- pkg/kv/kvserver/below_raft_protos_test.go | 2 +- pkg/kv/kvserver/client_replica_test.go | 238 ++++++++++++++++++ pkg/kv/kvserver/client_test.go | 9 + pkg/kv/kvserver/observedts/doc.go | 69 ++++- pkg/storage/bench_test.go | 8 +- pkg/storage/metamorphic/operations.go | 4 +- pkg/storage/mvcc.go | 46 +++- pkg/storage/mvcc_history_test.go | 23 +- pkg/storage/mvcc_incremental_iterator_test.go | 12 +- pkg/storage/mvcc_logical_ops_test.go | 8 +- pkg/storage/mvcc_stats_test.go | 18 +- pkg/storage/mvcc_test.go | 146 +++++------ .../testdata/mvcc_histories/ignored_seq_nums | 2 +- .../resolve_intent_with_async_resolution | 157 ++++++++++++ pkg/util/hlc/timestamp.pb.go | 9 + pkg/util/hlc/timestamp.proto | 9 + 23 files changed, 653 insertions(+), 141 deletions(-) create mode 100644 pkg/storage/testdata/mvcc_histories/resolve_intent_with_async_resolution diff --git a/pkg/ccl/changefeedccl/kvevent/event.go b/pkg/ccl/changefeedccl/kvevent/event.go index 52a84282891c..2349c0c249ff 100644 --- a/pkg/ccl/changefeedccl/kvevent/event.go +++ b/pkg/ccl/changefeedccl/kvevent/event.go @@ -187,6 +187,11 @@ func (b *Event) DetachAlloc() Alloc { func MakeResolvedEvent( span roachpb.Span, ts hlc.Timestamp, boundaryType jobspb.ResolvedSpan_BoundaryType, ) Event { + // Strip synthetic bit from CDC events. It can confuse consumers. + // TODO DURING PR: this is needed to fix a few tests, but hints at a larger + // question of whether we like that the synthetic timestamp flag can escape + // outside the system. + ts.Synthetic = false return Event{ resolved: &jobspb.ResolvedSpan{ Span: span, @@ -201,6 +206,11 @@ func MakeResolvedEvent( func MakeKVEvent( kv roachpb.KeyValue, prevVal roachpb.Value, backfillTimestamp hlc.Timestamp, ) Event { + // Strip synthetic bit from CDC events. It can confuse consumers. + // TODO DURING PR: this is needed to fix a few tests, but hints at a larger + // question of whether we like that the synthetic timestamp flag can escape + // outside the system. + kv.Value.Timestamp.Synthetic = false return Event{ kv: kv, prevVal: prevVal, diff --git a/pkg/cli/debug.go b/pkg/cli/debug.go index 7d0bd504f0fc..8023df8af510 100644 --- a/pkg/cli/debug.go +++ b/pkg/cli/debug.go @@ -1299,7 +1299,7 @@ func removeDeadReplicas( Txn: intent.Txn, Status: roachpb.ABORTED, } - if _, err := storage.MVCCResolveWriteIntent(ctx, batch, &ms, update); err != nil { + if _, err := storage.MVCCResolveWriteIntent(ctx, batch, &ms, update, false /* asyncResolution */); err != nil { return nil, err } // With the intent resolved, we can try again. diff --git a/pkg/kv/kvserver/batch_spanset_test.go b/pkg/kv/kvserver/batch_spanset_test.go index a3a375457ae2..8bffb8fce393 100644 --- a/pkg/kv/kvserver/batch_spanset_test.go +++ b/pkg/kv/kvserver/batch_spanset_test.go @@ -596,7 +596,7 @@ func TestSpanSetMVCCResolveWriteIntentRange(t *testing.T) { Status: roachpb.PENDING, } if _, _, err := storage.MVCCResolveWriteIntentRange( - ctx, batch, nil /* ms */, intent, 0, eng.IsSeparatedIntentsEnabledForTesting(ctx), + ctx, batch, nil /* ms */, intent, 0, false, eng.IsSeparatedIntentsEnabledForTesting(ctx), ); err != nil { t.Fatal(err) } diff --git a/pkg/kv/kvserver/batcheval/cmd_end_transaction.go b/pkg/kv/kvserver/batcheval/cmd_end_transaction.go index 6fe8f0e2b17f..b8f7620315c6 100644 --- a/pkg/kv/kvserver/batcheval/cmd_end_transaction.go +++ b/pkg/kv/kvserver/batcheval/cmd_end_transaction.go @@ -451,7 +451,9 @@ func resolveLocalLocks( desc = &mergeTrigger.LeftDesc } - var resolveAllowance int64 = lockResolutionBatchSize + // Any intent resolved here is resolved synchronously with the txn commit. + const asyncResolution = false + resolveAllowance := int64(lockResolutionBatchSize) if args.InternalCommitTrigger != nil { // If this is a system transaction (such as a split or merge), don't enforce the resolve allowance. // These transactions rely on having their locks resolved synchronously. @@ -486,7 +488,7 @@ func resolveLocalLocks( // // Note that the underlying pebbleIterator will still be reused // since readWriter is a pebbleBatch in the typical case. - ok, err := storage.MVCCResolveWriteIntent(ctx, readWriter, ms, update) + ok, err := storage.MVCCResolveWriteIntent(ctx, readWriter, ms, update, asyncResolution) if err != nil { return err } @@ -504,7 +506,7 @@ func resolveLocalLocks( if inSpan != nil { update.Span = *inSpan num, resumeSpan, err := storage.MVCCResolveWriteIntentRange( - ctx, readWriter, ms, update, resolveAllowance, onlySeparatedIntents) + ctx, readWriter, ms, update, resolveAllowance, asyncResolution, onlySeparatedIntents) if err != nil { return err } diff --git a/pkg/kv/kvserver/batcheval/cmd_refresh_range_test.go b/pkg/kv/kvserver/batcheval/cmd_refresh_range_test.go index 6c673c2fea34..7d0ca667eda7 100644 --- a/pkg/kv/kvserver/batcheval/cmd_refresh_range_test.go +++ b/pkg/kv/kvserver/batcheval/cmd_refresh_range_test.go @@ -86,7 +86,7 @@ func TestRefreshRangeTimeBoundIterator(t *testing.T) { // would not have any timestamp bounds and would be selected for every read. intent := roachpb.MakeLockUpdate(txn, roachpb.Span{Key: k}) intent.Status = roachpb.COMMITTED - if _, err := storage.MVCCResolveWriteIntent(ctx, db, nil, intent); err != nil { + if _, err := storage.MVCCResolveWriteIntent(ctx, db, nil, intent, false); err != nil { t.Fatal(err) } if err := storage.MVCCPut(ctx, db, nil, roachpb.Key("unused2"), ts1, v, nil); err != nil { diff --git a/pkg/kv/kvserver/batcheval/cmd_resolve_intent.go b/pkg/kv/kvserver/batcheval/cmd_resolve_intent.go index 63bb90c25fb8..51aa6c1e869f 100644 --- a/pkg/kv/kvserver/batcheval/cmd_resolve_intent.go +++ b/pkg/kv/kvserver/batcheval/cmd_resolve_intent.go @@ -84,7 +84,10 @@ func ResolveIntent( } update := args.AsLockUpdate() - ok, err := storage.MVCCResolveWriteIntent(ctx, readWriter, ms, update) + // This intent resolution operation is asynchronous with its transaction and + // may be occurring after the transaction has already committed. + const asyncResolution = true + ok, err := storage.MVCCResolveWriteIntent(ctx, readWriter, ms, update, asyncResolution) if err != nil { return result.Result{}, err } diff --git a/pkg/kv/kvserver/batcheval/cmd_resolve_intent_range.go b/pkg/kv/kvserver/batcheval/cmd_resolve_intent_range.go index 945046ab9382..f877ae39ddf7 100644 --- a/pkg/kv/kvserver/batcheval/cmd_resolve_intent_range.go +++ b/pkg/kv/kvserver/batcheval/cmd_resolve_intent_range.go @@ -45,11 +45,14 @@ func ResolveIntentRange( update := args.AsLockUpdate() + // This intent resolution operation is asynchronous with its transaction and + // may be occurring after the transaction has already committed. + const asyncResolution = true onlySeparatedIntents := cArgs.EvalCtx.ClusterSettings().Version.ActiveVersionOrEmpty(ctx).IsActive( clusterversion.PostSeparatedIntentsMigration) numKeys, resumeSpan, err := storage.MVCCResolveWriteIntentRange( - ctx, readWriter, ms, update, h.MaxSpanRequestKeys, onlySeparatedIntents) + ctx, readWriter, ms, update, h.MaxSpanRequestKeys, asyncResolution, onlySeparatedIntents) if err != nil { return result.Result{}, err } diff --git a/pkg/kv/kvserver/below_raft_protos_test.go b/pkg/kv/kvserver/below_raft_protos_test.go index 86c91b7fc4bb..8c26ac23603f 100644 --- a/pkg/kv/kvserver/below_raft_protos_test.go +++ b/pkg/kv/kvserver/below_raft_protos_test.go @@ -72,7 +72,7 @@ var belowRaftGoldenProtos = map[reflect.Type]fixture{ return m }, emptySum: 7551962144604783939, - populatedSum: 6784975417727259950, + populatedSum: 4745626903798922387, }, reflect.TypeOf(&enginepb.RangeAppliedState{}): { populatedConstructor: func(r *rand.Rand) protoutil.Message { diff --git a/pkg/kv/kvserver/client_replica_test.go b/pkg/kv/kvserver/client_replica_test.go index 5c010e567ad9..9fd44bee9c3e 100644 --- a/pkg/kv/kvserver/client_replica_test.go +++ b/pkg/kv/kvserver/client_replica_test.go @@ -577,6 +577,244 @@ func TestTxnReadWithinUncertaintyInterval(t *testing.T) { }) } +// TestTxnReadWithinUncertaintyIntervalAfterIntentResolution tests cases where a +// reader transaction observes a committed value that was committed before the +// reader began, but that was resolved after the reader began. The test ensures +// that even if the reader has collected an observed timestamp from the node +// that holds the intent, and even if this observed timestamp is less than the +// timestamp that the intent is eventually committed at, the reader still +// considers the value to be in its uncertainty interval. Not doing so could +// allow for stale read, which would be a violation of linearizability. +// +// This is a regression test for #36431. Before this issue was addressed, +// it was possible for the following series of events to lead to a stale +// read: +// - txn W is coordinated by node B. It lays down an intent on node A (key k) at +// ts 95. +// - txn W gets pushed to ts 105 (taken from B's clock). It refreshes +// successfully and commits at 105. Node A's clock is at, say, 100; this is +// within clock offset bounds. +// - after all this, txn R starts on node A. It gets assigned ts 100. The txn +// has no uncertainty for node A. +// - txn W's async intent resolution comes around and resolves the intent on +// node A, moving the value fwd from ts 95 to 105. +// - txn R reads key k and doesn't see anything. There's a value at 105, but the +// txn have no uncertainty due to an observed timestamp. This is a stale read. +// +// The test's rangedResolution parameter dictates whether the intent is +// asynchronously resolved using point or ranged intent resolution. +// +// The test's movedWhilePending parameter dictates whether the intent is moved +// to a higher timestamp first by a PENDING intent resolution and then COMMITTED +// at that same timestamp, or whether it is moved to a higher timestamp at the +// same time as it is COMMITTED. +// +func TestTxnReadWithinUncertaintyIntervalAfterIntentResolution(t *testing.T) { + defer leaktest.AfterTest(t)() + defer log.Scope(t).Close(t) + + testutils.RunTrueAndFalse(t, "rangedResolution", func(t *testing.T, rangedResolution bool) { + testutils.RunTrueAndFalse(t, "movedWhilePending", func(t *testing.T, movedWhilePending bool) { + testTxnReadWithinUncertaintyIntervalAfterIntentResolution(t, rangedResolution, movedWhilePending) + }) + }) +} + +func testTxnReadWithinUncertaintyIntervalAfterIntentResolution( + t *testing.T, rangedResolution, movedWhilePending bool, +) { + const numNodes = 2 + var manuals []*hlc.HybridManualClock + var clocks []*hlc.Clock + for i := 0; i < numNodes; i++ { + manuals = append(manuals, hlc.NewHybridManualClock()) + } + serverArgs := make(map[int]base.TestServerArgs) + for i := 0; i < numNodes; i++ { + serverArgs[i] = base.TestServerArgs{ + Knobs: base.TestingKnobs{ + Server: &server.TestingKnobs{ + ClockSource: manuals[i].UnixNano, + }, + Store: &kvserver.StoreTestingKnobs{ + IntentResolverKnobs: kvserverbase.IntentResolverTestingKnobs{ + // Disable async intent resolution, so that the test can carefully + // control when intent resolution occurs. + DisableAsyncIntentResolution: true, + }, + }, + }, + } + } + ctx := context.Background() + tc := testcluster.StartTestCluster(t, numNodes, base.TestClusterArgs{ + ReplicationMode: base.ReplicationManual, + ServerArgsPerNode: serverArgs, + }) + defer tc.Stopper().Stop(ctx) + + // Split off two scratch ranges. + keyA, keyB := roachpb.Key("a"), roachpb.Key("b") + tc.SplitRangeOrFatal(t, keyA) + _, keyBDesc := tc.SplitRangeOrFatal(t, keyB) + // Place key A's sole replica on node 1 and key B's sole replica on node 2. + tc.AddVotersOrFatal(t, keyB, tc.Target(1)) + tc.TransferRangeLeaseOrFatal(t, keyBDesc, tc.Target(1)) + tc.RemoveVotersOrFatal(t, keyB, tc.Target(0)) + + // Pause the servers' clocks going forward. + var maxNanos int64 + for i, m := range manuals { + m.Pause() + if cur := m.UnixNano(); cur > maxNanos { + maxNanos = cur + } + clocks = append(clocks, tc.Servers[i].Clock()) + } + // After doing so, perfectly synchronize them. + for _, m := range manuals { + m.Increment(maxNanos - m.UnixNano()) + } + + // Create a new writer transaction. + maxOffset := clocks[0].MaxOffset().Nanoseconds() + require.NotZero(t, maxOffset) + writerTxn := roachpb.MakeTransaction("test_writer", keyA, 1, clocks[0].Now(), maxOffset) + + // Write to key A and key B in the writer transaction. + for _, key := range []roachpb.Key{keyA, keyB} { + put := putArgs(key, []byte("val")) + resp, pErr := kv.SendWrappedWith(ctx, tc.Servers[0].DistSender(), roachpb.Header{Txn: &writerTxn}, put) + require.Nil(t, pErr) + writerTxn.Update(resp.Header().Txn) + } + + // Move the clock on just the first server and bump the transaction commit + // timestamp to this value. The clock on the second server will trail behind. + manuals[0].Increment(100) + require.True(t, writerTxn.WriteTimestamp.Forward(clocks[0].Now())) + + // Refresh the writer transaction's timestamp. + writerTxn.ReadTimestamp.Forward(writerTxn.WriteTimestamp) + + // Commit the writer transaction. Key A will be synchronously resolved because + // it is on the same range as the transaction record. However, key B will be + // handed to the IntentResolver for asynchronous resolution. Because we + // disabled async resolution, it will not be resolved yet. + et, etH := endTxnArgs(&writerTxn, true /* commit */) + et.LockSpans = []roachpb.Span{ + {Key: keyA}, {Key: keyB}, + } + if rangedResolution { + for i := range et.LockSpans { + et.LockSpans[i].EndKey = et.LockSpans[i].Key.Next() + } + } + etResp, pErr := kv.SendWrappedWith(ctx, tc.Servers[0].DistSender(), etH, et) + require.Nil(t, pErr) + writerTxn.Update(etResp.Header().Txn) + + // Create a new reader transaction. The reader uses the second server as a + // gateway, so its initial read timestamp actually trails the commit timestamp + // of the writer transaction due to clock skew between the two servers. This + // is the classic case where the reader's uncertainty interval is needed to + // avoid stale reads. Remember that the reader transaction began after the + // writer transaction committed and received an ack, so it must observe the + // writer's writes if it is to respect real-time ordering. + readerTxn := roachpb.MakeTransaction("test_reader", keyA, 1, clocks[1].Now(), maxOffset) + require.True(t, readerTxn.ReadTimestamp.Less(writerTxn.WriteTimestamp)) + require.False(t, readerTxn.GlobalUncertaintyLimit.Less(writerTxn.WriteTimestamp)) + + // Collect an observed timestamp from each of the nodes. We read the key + // following each of written keys to avoid conflicting with read values. + // + // NOTE: this wasn't even a necessary step to hit #36431, because new + // transactions are always an observed timestamp from their own gateway node. + for i, key := range []roachpb.Key{keyA, keyB} { + get := getArgs(key.Next()) + resp, pErr := kv.SendWrappedWith(ctx, tc.Servers[0].DistSender(), roachpb.Header{Txn: &readerTxn}, get) + require.Nil(t, pErr) + require.Nil(t, resp.(*roachpb.GetResponse).Value) + readerTxn.Update(resp.Header().Txn) + require.Len(t, readerTxn.ObservedTimestamps, i+1) + } + + // Resolve the intent on key B. + { + resolveIntentArgs := func(status roachpb.TransactionStatus) roachpb.Request { + if rangedResolution { + return &roachpb.ResolveIntentRangeRequest{ + RequestHeader: roachpb.RequestHeader{Key: keyB, EndKey: keyB.Next()}, + IntentTxn: writerTxn.TxnMeta, + Status: status, + } + } else { + return &roachpb.ResolveIntentRequest{ + RequestHeader: roachpb.RequestHeader{Key: keyB}, + IntentTxn: writerTxn.TxnMeta, + Status: status, + } + } + } + + if movedWhilePending { + // First change the intent's timestamp without committing it. This + // exercises the case where the intent's timestamp is moved forward by a + // PENDING intent resolution request and kept the same when the intent is + // eventually COMMITTED. This PENDING intent resolution may still be + // evaluated after the transaction commit has been acknowledged in + // real-time, so it still needs to lead to the committed value being + // marked as synthetic. This works by recording the written timestamp into + // the intent when it is originally moved and then consulting this written + // value when later committing the intent. + // + // For instance, consider the following timeline: + // + // 1. txn W writes intent on key A @ time 10 + // 2. txn W writes intent on key B @ time 10 + // 3. high priority reader @ 15 reads key B + // 4. high priority reader pushes txn W to time 15 + // 5. txn W commits @ 15 and resolves key A synchronously + // 6. txn R begins and collects observed timestamp from key B's node @ + // time 11 + // 7. high priority reader moves intent on key B to time 15 + // 8. async intent resolution commits intent on key B, still @ time 15 + // 9. txn R reads key B with read ts 11, observed ts 11, and uncertainty + // interval [11, 21]. If step 7 did not mark the intent's timestamp as + // synthetic when changing its timestamp, step 8 also would not do so + // because it isn't changing the intent's timestamp, so txn R could + // use its observed timestamp to avoid an uncertainty error, leading + // to a stale read. + // + resolve := resolveIntentArgs(roachpb.PENDING) + _, pErr = kv.SendWrapped(ctx, tc.Servers[0].DistSender(), resolve) + require.Nil(t, pErr) + } + + // Resolve the committed value on key B to COMMITTED. + resolve := resolveIntentArgs(roachpb.COMMITTED) + _, pErr = kv.SendWrapped(ctx, tc.Servers[0].DistSender(), resolve) + require.Nil(t, pErr) + } + + // Read key A and B in the reader transaction. Both should produce + // ReadWithinUncertaintyIntervalErrors. + for _, key := range []roachpb.Key{keyA, keyB} { + get := getArgs(key) + _, pErr := kv.SendWrappedWith(ctx, tc.Servers[0].DistSender(), roachpb.Header{Txn: &readerTxn}, get) + require.NotNil(t, pErr) + var rwuiErr *roachpb.ReadWithinUncertaintyIntervalError + require.True(t, errors.As(pErr.GetDetail(), &rwuiErr)) + require.Equal(t, readerTxn.ReadTimestamp, rwuiErr.ReadTimestamp) + require.Equal(t, readerTxn.GlobalUncertaintyLimit, rwuiErr.GlobalUncertaintyLimit) + require.Equal(t, readerTxn.ObservedTimestamps, rwuiErr.ObservedTimestamps) + // Only the value on key B should be marked as synthetic, because only the + // intent on key B was resolved asynchronously from its transaction commit. + expSyn := key.Equal(keyB) + require.Equal(t, writerTxn.WriteTimestamp.WithSynthetic(expSyn), rwuiErr.ExistingTimestamp) + } +} + // TestRangeLookupUseReverse tests whether the results and the results count // are correct when scanning in reverse order. func TestRangeLookupUseReverse(t *testing.T) { diff --git a/pkg/kv/kvserver/client_test.go b/pkg/kv/kvserver/client_test.go index 54405929f758..a300b9dcdc4b 100644 --- a/pkg/kv/kvserver/client_test.go +++ b/pkg/kv/kvserver/client_test.go @@ -102,6 +102,15 @@ func heartbeatArgs( }, roachpb.Header{Txn: txn} } +func endTxnArgs(txn *roachpb.Transaction, commit bool) (*roachpb.EndTxnRequest, roachpb.Header) { + return &roachpb.EndTxnRequest{ + RequestHeader: roachpb.RequestHeader{ + Key: txn.Key, // not allowed when going through TxnCoordSender, but we're not + }, + Commit: commit, + }, roachpb.Header{Txn: txn} +} + func pushTxnArgs( pusher, pushee *roachpb.Transaction, pushType roachpb.PushTxnType, ) *roachpb.PushTxnRequest { diff --git a/pkg/kv/kvserver/observedts/doc.go b/pkg/kv/kvserver/observedts/doc.go index d049499f5938..a62afae75303 100644 --- a/pkg/kv/kvserver/observedts/doc.go +++ b/pkg/kv/kvserver/observedts/doc.go @@ -112,14 +112,7 @@ var D4 = (&roachpb.Transaction{}).UpdateObservedTimestamp // // There are two invariants necessary for this property to hold: // 1. a leaseholder's clock must always be equal to or greater than the timestamp -// of all writes that it has served. This is trivial to enforce for -// non-transactional writes. It is more complicated for transactional writes -// which may move their commit timestamp forward over their lifetime before -// committing, even after writing intents on remote Ranges. To accommodate -// this situation, transactions ensure that at the time of their commit, any -// leaseholder for a Range that contains one of its intent has an HLC clock -// with an equal or greater timestamp than the transaction's commit timestamp. -// TODO(nvanbenschoten): This is violated by txn refreshes. See #36431. +// of all writes that it has served. // 2. a leaseholder's clock must always be equal to or greater than the timestamp // of all writes that previous leaseholders for its Range have served. We // enforce that when a Replica acquires a lease it bumps its node's clock to a @@ -130,6 +123,66 @@ var D4 = (&roachpb.Transaction{}).UpdateObservedTimestamp // invariant holds for all leaseholders, given that a Range's initial // leaseholder assumes responsibility for an empty range with no writes. // +// Unfortunately, this first invariant does not always hold. It does hold for +// non-transactional writes. It also holds for transactions that perform all of +// their intent writes at the same timestamp and then commit at this timestamp. +// However, it does not hold for transactions which move their commit timestamp +// forward over their lifetime before committing, writing intents at different +// timestamps along the way and "pulling them up" to the commit timestamp after +// committing. +// +// In violating invariant 1, this third case reveals an ambiguity in what it +// means for a leaseholder to "serve a write at a timestamp". The meaning of +// this phrase is straightforward for non-transactional writes. However, for an +// intent write whose original timestamp is provisional and whose eventual +// commit timestamp is stored indirectly in its transaction record at its time +// of commit, the meaning is less clear. This reconciliation to move the intent +// write's timestamp up to its transaction's commit timestamp is asynchronous +// from the transaction commit (and after it has been externally acknowledged). +// So even if a leaseholder has only served writes with provisional timestamps +// up to timestamp 100 (placing a lower bound on its clock of 100), it can be in +// possession of intents that, when resolved, will carry a timestamp of 200. To +// uphold the real-time ordering property, this value must be observed by any +// transaction that begins after the value's transaction committed and was +// acknowledged. So for observed timestamps to be correct as currently written, +// we would need a guarantee that this value's leaseholder would never return an +// observed timestamp < 200 at any point after the transaction commits. But with +// the transaction commit possibly occurring on another node and with +// communication to resolve the intent occurring asynchronously, this seems like +// an impossible guarantee to make. +// +// This would appear to undermine observed timestamps to the point where they +// cannot be used. However, we can claw back some utility by recognizing that +// only a small fraction of transactions commit at a different timestamps than +// the one they used while writing intents. We can also recognize that if we +// were to compare observed timestamps against the timestamp that a committed +// value was originally written (its provisional value if it was once an intent) +// instead of the timestamp that it had been moved to on commit, then invariant +// 1 would hold. +// +// We currently do not take full advantage of this second observation, because +// we do not retain "written timestamps" in committed values. Instead, we do +// something less optimal but cheaper and more convenient. Any intent whose +// timestamp is changed during asynchronous intent resolution is marked as +// "synthetic". Doing so is a compressed way of saying that the value could have +// originally been written as an intent at any time (even min_timestamp) and so +// observed timestamps cannot be used to limit uncertainty by pulling a read's +// uncertainty interval below the value's timestamp. +// +// As a result of this refinement, the following property does properly hold: +// +// Any writes that the transaction may later see written by leaseholders on +// this node at higher timestamps than the observed timestamp *and that are +// not marked as synthetic* could not have taken place causally before this +// transaction and can be ignored for the purposes of uncertainty. +// +// This application of the synthetic bit prevents observed timestamps from being +// used to avoid uncertainty restarts with the set of committed values whose +// timestamps do not reflect their original write time and therefore do not make +// a claim about the clock of their leaseholder at the time that they were +// committed. It does not change the interaction between observed timestamps and +// any other committed value. +// // Usage // // The property ensures that when this list holds a corresponding entry for the diff --git a/pkg/storage/bench_test.go b/pkg/storage/bench_test.go index 6954d4588304..37d645188677 100644 --- a/pkg/storage/bench_test.go +++ b/pkg/storage/bench_test.go @@ -261,7 +261,7 @@ func setupKeysWithIntent( // is not one that should be resolved. continue } - found, err := MVCCResolveWriteIntent(context.Background(), batch, nil, lu) + found, err := MVCCResolveWriteIntent(context.Background(), batch, nil, lu, false) require.Equal(b, true, found) require.NoError(b, err) } @@ -494,7 +494,7 @@ func BenchmarkIntentResolution(b *testing.B) { b.StartTimer() } lockUpdate.Key = keys[i%numIntentKeys] - found, err := MVCCResolveWriteIntent(context.Background(), batch, nil, lockUpdate) + found, err := MVCCResolveWriteIntent(context.Background(), batch, nil, lockUpdate, false) if !found || err != nil { b.Fatalf("intent not found or err %s", err) } @@ -559,7 +559,7 @@ func BenchmarkIntentRangeResolution(b *testing.B) { lockUpdate.Key = keys[rangeNum*numKeysPerRange] lockUpdate.EndKey = keys[(rangeNum+1)*numKeysPerRange] resolved, span, err := MVCCResolveWriteIntentRange( - context.Background(), batch, nil, lockUpdate, 1000 /* max */, sep) + context.Background(), batch, nil, lockUpdate, 1000 /* max */, false, sep) if err != nil { b.Fatal(err) } @@ -776,7 +776,7 @@ func setupMVCCData( Span: roachpb.Span{Key: key}, Status: roachpb.COMMITTED, Txn: txnMeta, - }); err != nil { + }, false); err != nil { b.Fatal(err) } } diff --git a/pkg/storage/metamorphic/operations.go b/pkg/storage/metamorphic/operations.go index 6b047945bd88..4686f107ddad 100644 --- a/pkg/storage/metamorphic/operations.go +++ b/pkg/storage/metamorphic/operations.go @@ -432,7 +432,7 @@ func (t txnCommitOp) run(ctx context.Context) string { for _, span := range txn.LockSpans { intent := roachpb.MakeLockUpdate(txn, span) intent.Status = roachpb.COMMITTED - _, err := storage.MVCCResolveWriteIntent(context.TODO(), t.m.engine, nil, intent) + _, err := storage.MVCCResolveWriteIntent(context.TODO(), t.m.engine, nil, intent, false) if err != nil { panic(err) } @@ -455,7 +455,7 @@ func (t txnAbortOp) run(ctx context.Context) string { for _, span := range txn.LockSpans { intent := roachpb.MakeLockUpdate(txn, span) intent.Status = roachpb.ABORTED - _, err := storage.MVCCResolveWriteIntent(context.TODO(), t.m.engine, nil, intent) + _, err := storage.MVCCResolveWriteIntent(context.TODO(), t.m.engine, nil, intent, false /* asyncResolution */) if err != nil { panic(err) } diff --git a/pkg/storage/mvcc.go b/pkg/storage/mvcc.go index f0d8989ad981..3cdd7a719842 100644 --- a/pkg/storage/mvcc.go +++ b/pkg/storage/mvcc.go @@ -2798,8 +2798,15 @@ func MVCCIterate( // MVCCResolveWriteIntent either commits, aborts (rolls back), or moves forward // in time an extant write intent for a given txn according to commit parameter. -// ResolveWriteIntent will skip write intents of other txns. It returns -// whether or not an intent was found to resolve. +// ResolveWriteIntent will skip write intents of other txns. It returns whether +// or not an intent was found to resolve. +// +// The function's asyncResolution parameter indicates whether intent resolution +// is being performed asynchronously from the corresponding transaction's +// commit. It must be true if the resolution could be taking place after the +// transaction has committed and acknowledged that commit to its client. It must +// be false if the resolution is synchronous with the transaction commit. The +// parameter is ignored if the intent is not being committed. // // Transaction epochs deserve a bit of explanation. The epoch for a // transaction is incremented on transaction retries. A transaction @@ -2817,7 +2824,11 @@ func MVCCIterate( // epoch matching the commit epoch), and which intents get aborted, // even if the transaction succeeds. func MVCCResolveWriteIntent( - ctx context.Context, rw ReadWriter, ms *enginepb.MVCCStats, intent roachpb.LockUpdate, + ctx context.Context, + rw ReadWriter, + ms *enginepb.MVCCStats, + intent roachpb.LockUpdate, + asyncResolution bool, ) (bool, error) { if len(intent.Key) == 0 { return false, emptyKeyError() @@ -2828,7 +2839,7 @@ func MVCCResolveWriteIntent( iterAndBuf := GetBufUsingIter(rw.NewMVCCIterator(MVCCKeyAndIntentsIterKind, IterOptions{Prefix: true})) iterAndBuf.iter.SeekIntentGE(intent.Key, intent.Txn.ID) - ok, err := mvccResolveWriteIntent(ctx, rw, iterAndBuf.iter, ms, intent, iterAndBuf.buf) + ok, err := mvccResolveWriteIntent(ctx, rw, iterAndBuf.iter, ms, intent, asyncResolution, iterAndBuf.buf) // Using defer would be more convenient, but it is measurably slower. iterAndBuf.Cleanup() return ok, err @@ -3113,6 +3124,7 @@ func mvccResolveWriteIntent( iter iterForKeyVersions, ms *enginepb.MVCCStats, intent roachpb.LockUpdate, + asyncResolution bool, buf *putBuffer, ) (bool, error) { metaKey := MakeMVCCMetadataKey(intent.Key) @@ -3250,6 +3262,29 @@ func mvccResolveWriteIntent( // getting pushed. newTimestamp := intent.Txn.WriteTimestamp + // If the intent is being resolved asynchronously from its commit and its + // timestamp is either being changed now or has changed due to a push since + // the last time that its transaction wrote it (as indicated by the presence + // of a WrittenTimestamp), then we mark the timestamp of the committed value + // as synthetic. This indicates to readers that the value's timestamp was + // disconnected from the clock on the value's leaseholder and that observed + // timestamps from that leaseholder cannot be used to avoid uncertainty. + // + // This is a form of compression that avoids the need for us to store the + // intent's written timestamp in its committed value (in cases where it + // differs from the commit timestamp). If we were to do that, then we could + // apply observed timestamps to this written timestamp and still avoid + // uncertainty in some cases. However, this would require us to store two + // timestamps in some committed values. To avoid this cost and complexity, + // we instead use the synthetic bit as a blunt but convenient instrument to + // disable all use of observed timestamps with this value. + // + // See commentary in pkg/kv/kvserver/observedts/doc.go for more details. + if commit && asyncResolution && (timestampChanged || meta.WrittenTimestamp != nil) { + newTimestamp.Synthetic = true + timestampChanged = newTimestamp != metaTimestamp // recompute + } + // Assert that the intent timestamp never regresses. The logic above should // not allow this, regardless of the input to this function. if newTimestamp.Less(metaTimestamp) { @@ -3548,6 +3583,7 @@ func MVCCResolveWriteIntentRange( ms *enginepb.MVCCStats, intent roachpb.LockUpdate, max int64, + asyncResolution bool, onlySeparatedIntents bool, ) (int64, *roachpb.Span, error) { if max < 0 { @@ -3663,7 +3699,7 @@ func MVCCResolveWriteIntentRange( if !key.IsValue() { // NB: This if-condition is always true for the sepIter != nil path. intent.Key = key.Key - ok, err = mvccResolveWriteIntent(ctx, rw, iter, ms, intent, putBuf) + ok, err = mvccResolveWriteIntent(ctx, rw, iter, ms, intent, asyncResolution, putBuf) } if err != nil { log.Warningf(ctx, "failed to resolve intent for key %q: %+v", key.Key, err) diff --git a/pkg/storage/mvcc_history_test.go b/pkg/storage/mvcc_history_test.go index b3c3f8d5b914..8b5e456355e4 100644 --- a/pkg/storage/mvcc_history_test.go +++ b/pkg/storage/mvcc_history_test.go @@ -48,7 +48,7 @@ import ( // txn_advance t= ts=[,] // txn_status t= status= // -// resolve_intent t= k= [status=] +// resolve_intent t= k= [status=] [asyncResolution] // check_intent k= [none] // // cput [t=] [ts=[,]] [resolve [status=]] k= v= [raw] [cond=] @@ -576,15 +576,20 @@ func cmdResolveIntent(e *evalCtx) error { txn := e.getTxn(mandatory) key := e.getKey() status := e.getTxnStatus() - return e.resolveIntent(e.tryWrapForIntentPrinting(e.engine), key, txn, status) + asyncResolution := e.hasArg("asyncResolution") + return e.resolveIntent(e.tryWrapForIntentPrinting(e.engine), key, txn, status, asyncResolution) } func (e *evalCtx) resolveIntent( - rw ReadWriter, key roachpb.Key, txn *roachpb.Transaction, resolveStatus roachpb.TransactionStatus, + rw ReadWriter, + key roachpb.Key, + txn *roachpb.Transaction, + resolveStatus roachpb.TransactionStatus, + asyncResolution bool, ) error { intent := roachpb.MakeLockUpdate(txn, roachpb.Span{Key: key}) intent.Status = resolveStatus - _, err := MVCCResolveWriteIntent(e.ctx, rw, nil, intent) + _, err := MVCCResolveWriteIntent(e.ctx, rw, nil, intent, asyncResolution) return err } @@ -640,7 +645,7 @@ func cmdCPut(e *evalCtx) error { return err } if resolve { - return e.resolveIntent(rw, key, txn, resolveStatus) + return e.resolveIntent(rw, key, txn, resolveStatus, false /* asyncResolution */) } return nil }) @@ -656,7 +661,7 @@ func cmdDelete(e *evalCtx) error { return err } if resolve { - return e.resolveIntent(rw, key, txn, resolveStatus) + return e.resolveIntent(rw, key, txn, resolveStatus, false /* asyncResolution */) } return nil }) @@ -687,7 +692,7 @@ func cmdDeleteRange(e *evalCtx) error { } if resolve { - return e.resolveIntent(rw, key, txn, resolveStatus) + return e.resolveIntent(rw, key, txn, resolveStatus, false /* asyncResolution */) } return nil }) @@ -747,7 +752,7 @@ func cmdIncrement(e *evalCtx) error { } e.results.buf.Printf("inc: current value = %d\n", curVal) if resolve { - return e.resolveIntent(rw, key, txn, resolveStatus) + return e.resolveIntent(rw, key, txn, resolveStatus, false /* asyncResolution */) } return nil }) @@ -783,7 +788,7 @@ func cmdPut(e *evalCtx) error { return err } if resolve { - return e.resolveIntent(rw, key, txn, resolveStatus) + return e.resolveIntent(rw, key, txn, resolveStatus, false /* asyncResolution */) } return nil }) diff --git a/pkg/storage/mvcc_incremental_iterator_test.go b/pkg/storage/mvcc_incremental_iterator_test.go index b1a397281a24..f3882d827ab1 100644 --- a/pkg/storage/mvcc_incremental_iterator_test.go +++ b/pkg/storage/mvcc_incremental_iterator_test.go @@ -891,12 +891,12 @@ func TestMVCCIncrementalIterator(t *testing.T) { intent1 := roachpb.MakeLockUpdate(&txn1, roachpb.Span{Key: testKey1}) intent1.Status = roachpb.COMMITTED - if _, err := MVCCResolveWriteIntent(ctx, e, nil, intent1); err != nil { + if _, err := MVCCResolveWriteIntent(ctx, e, nil, intent1, false); err != nil { t.Fatal(err) } intent2 := roachpb.MakeLockUpdate(&txn2, roachpb.Span{Key: testKey2}) intent2.Status = roachpb.ABORTED - if _, err := MVCCResolveWriteIntent(ctx, e, nil, intent2); err != nil { + if _, err := MVCCResolveWriteIntent(ctx, e, nil, intent2, false); err != nil { t.Fatal(err) } t.Run("intents-resolved", assertEqualKVs(e, localMax, keyMax, tsMin, tsMax, latest, kvs(kv1_4_4, kv2_2_2))) @@ -960,12 +960,12 @@ func TestMVCCIncrementalIterator(t *testing.T) { intent1 := roachpb.MakeLockUpdate(&txn1, roachpb.Span{Key: testKey1}) intent1.Status = roachpb.COMMITTED - if _, err := MVCCResolveWriteIntent(ctx, e, nil, intent1); err != nil { + if _, err := MVCCResolveWriteIntent(ctx, e, nil, intent1, false); err != nil { t.Fatal(err) } intent2 := roachpb.MakeLockUpdate(&txn2, roachpb.Span{Key: testKey2}) intent2.Status = roachpb.ABORTED - if _, err := MVCCResolveWriteIntent(ctx, e, nil, intent2); err != nil { + if _, err := MVCCResolveWriteIntent(ctx, e, nil, intent2, false); err != nil { t.Fatal(err) } t.Run("intents-resolved", assertEqualKVs(e, localMax, keyMax, tsMin, tsMax, all, kvs(kv1_4_4, kv1Deleted3, kv1_2_2, kv1_1_1, kv2_2_2))) @@ -1147,9 +1147,9 @@ func TestMVCCIncrementalIteratorIntentDeletion(t *testing.T) { require.NoError(t, MVCCPut(ctx, db, nil, kC, txnC1.ReadTimestamp, vC1, txnC1)) require.NoError(t, db.Flush()) require.NoError(t, db.Compact()) - _, err := MVCCResolveWriteIntent(ctx, db, nil, intent(txnA1)) + _, err := MVCCResolveWriteIntent(ctx, db, nil, intent(txnA1), false) require.NoError(t, err) - _, err = MVCCResolveWriteIntent(ctx, db, nil, intent(txnB1)) + _, err = MVCCResolveWriteIntent(ctx, db, nil, intent(txnB1), false) require.NoError(t, err) require.NoError(t, MVCCPut(ctx, db, nil, kA, ts2, vA2, nil)) require.NoError(t, MVCCPut(ctx, db, nil, kA, txnA3.WriteTimestamp, vA3, txnA3)) diff --git a/pkg/storage/mvcc_logical_ops_test.go b/pkg/storage/mvcc_logical_ops_test.go index 42cc1c096d06..5cc013f7b8f1 100644 --- a/pkg/storage/mvcc_logical_ops_test.go +++ b/pkg/storage/mvcc_logical_ops_test.go @@ -80,14 +80,14 @@ func TestMVCCOpLogWriter(t *testing.T) { roachpb.MakeLockUpdate( &txn1CommitTS, roachpb.Span{Key: testKey1, EndKey: testKey2.Next()}), - math.MaxInt64, onlySeparatedIntents); err != nil { + math.MaxInt64, false, onlySeparatedIntents); err != nil { t.Fatal(err) } if _, _, err := MVCCResolveWriteIntentRange(ctx, ol, nil, roachpb.MakeLockUpdate( &txn1CommitTS, roachpb.Span{Key: localKey, EndKey: localKey.Next()}), - math.MaxInt64, onlySeparatedIntents); err != nil { + math.MaxInt64, false, onlySeparatedIntents); err != nil { t.Fatal(err) } @@ -99,14 +99,14 @@ func TestMVCCOpLogWriter(t *testing.T) { txn2Pushed := *txn2 txn2Pushed.WriteTimestamp = hlc.Timestamp{Logical: 6} if _, err := MVCCResolveWriteIntent(ctx, ol, nil, - roachpb.MakeLockUpdate(&txn2Pushed, roachpb.Span{Key: testKey3}), + roachpb.MakeLockUpdate(&txn2Pushed, roachpb.Span{Key: testKey3}), false, ); err != nil { t.Fatal(err) } txn2Abort := txn2Pushed txn2Abort.Status = roachpb.ABORTED if _, err := MVCCResolveWriteIntent(ctx, ol, nil, - roachpb.MakeLockUpdate(&txn2Abort, roachpb.Span{Key: testKey3}), + roachpb.MakeLockUpdate(&txn2Abort, roachpb.Span{Key: testKey3}), false, ); err != nil { t.Fatal(err) } diff --git a/pkg/storage/mvcc_stats_test.go b/pkg/storage/mvcc_stats_test.go index b457a3a7c437..cfa1dd890991 100644 --- a/pkg/storage/mvcc_stats_test.go +++ b/pkg/storage/mvcc_stats_test.go @@ -132,7 +132,7 @@ func TestMVCCStatsDeleteCommitMovesTimestamp(t *testing.T) { txn.Status = roachpb.COMMITTED txn.WriteTimestamp.Forward(ts4) if _, err := MVCCResolveWriteIntent(ctx, engine, aggMS, - roachpb.MakeLockUpdate(txn, roachpb.Span{Key: key}), + roachpb.MakeLockUpdate(txn, roachpb.Span{Key: key}), false, ); err != nil { t.Fatal(err) } @@ -219,7 +219,7 @@ func TestMVCCStatsPutCommitMovesTimestamp(t *testing.T) { txn.Status = roachpb.COMMITTED txn.WriteTimestamp.Forward(ts4) if _, err := MVCCResolveWriteIntent(ctx, engine, aggMS, - roachpb.MakeLockUpdate(txn, roachpb.Span{Key: key}), + roachpb.MakeLockUpdate(txn, roachpb.Span{Key: key}), false, ); err != nil { t.Fatal(err) } @@ -304,7 +304,7 @@ func TestMVCCStatsPutPushMovesTimestamp(t *testing.T) { ts4 := hlc.Timestamp{WallTime: 4 * 1e9} txn.WriteTimestamp.Forward(ts4) if _, err := MVCCResolveWriteIntent(ctx, engine, aggMS, - roachpb.MakeLockUpdate(txn, roachpb.Span{Key: key}), + roachpb.MakeLockUpdate(txn, roachpb.Span{Key: key}), false, ); err != nil { t.Fatal(err) } @@ -664,7 +664,7 @@ func TestMVCCStatsDelDelCommitMovesTimestamp(t *testing.T) { txnCommit.Status = roachpb.COMMITTED txnCommit.WriteTimestamp.Forward(ts3) if _, err := MVCCResolveWriteIntent(ctx, engine, &aggMS, - roachpb.MakeLockUpdate(txnCommit, roachpb.Span{Key: key}), + roachpb.MakeLockUpdate(txnCommit, roachpb.Span{Key: key}), false, ); err != nil { t.Fatal(err) } @@ -693,7 +693,7 @@ func TestMVCCStatsDelDelCommitMovesTimestamp(t *testing.T) { txnAbort.Status = roachpb.ABORTED txnAbort.WriteTimestamp.Forward(ts3) if _, err := MVCCResolveWriteIntent(ctx, engine, &aggMS, - roachpb.MakeLockUpdate(txnAbort, roachpb.Span{Key: key}), + roachpb.MakeLockUpdate(txnAbort, roachpb.Span{Key: key}), false, ); err != nil { t.Fatal(err) } @@ -830,7 +830,7 @@ func TestMVCCStatsPutDelPutMovesTimestamp(t *testing.T) { txnAbort := txn.Clone() txnAbort.Status = roachpb.ABORTED // doesn't change m2ValSize, fortunately if _, err := MVCCResolveWriteIntent(ctx, engine, &aggMS, - roachpb.MakeLockUpdate(txnAbort, roachpb.Span{Key: key}), + roachpb.MakeLockUpdate(txnAbort, roachpb.Span{Key: key}), false, ); err != nil { t.Fatal(err) } @@ -1316,7 +1316,7 @@ func TestMVCCStatsTxnSysPutAbort(t *testing.T) { // Now abort the intent. txn.Status = roachpb.ABORTED if _, err := MVCCResolveWriteIntent(ctx, engine, aggMS, - roachpb.MakeLockUpdate(txn, roachpb.Span{Key: key}), + roachpb.MakeLockUpdate(txn, roachpb.Span{Key: key}), false, ); err != nil { t.Fatal(err) } @@ -1574,14 +1574,14 @@ func TestMVCCStatsRandomized(t *testing.T) { desc := fmt.Sprintf("ranged=%t", ranged) if s.Txn != nil { if !ranged { - if _, err := MVCCResolveWriteIntent(ctx, s.eng, s.MS, s.intent(status)); err != nil { + if _, err := MVCCResolveWriteIntent(ctx, s.eng, s.MS, s.intent(status), false); err != nil { return desc + ": " + err.Error() } } else { max := s.rng.Int63n(5) desc += fmt.Sprintf(", max=%d", max) if _, _, err := MVCCResolveWriteIntentRange( - ctx, s.eng, s.MS, s.intentRange(status), max, onlySeparatedIntents); err != nil { + ctx, s.eng, s.MS, s.intentRange(status), max, false, onlySeparatedIntents); err != nil { return desc + ": " + err.Error() } } diff --git a/pkg/storage/mvcc_test.go b/pkg/storage/mvcc_test.go index 0e42a433f4ba..46957df31a77 100644 --- a/pkg/storage/mvcc_test.go +++ b/pkg/storage/mvcc_test.go @@ -2599,7 +2599,7 @@ func TestMVCCInitPutWithTxn(t *testing.T) { txnCommit.Status = roachpb.COMMITTED txnCommit.WriteTimestamp = clock.Now().Add(1, 0) if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(&txnCommit, roachpb.Span{Key: testKey1})); err != nil { + roachpb.MakeLockUpdate(&txnCommit, roachpb.Span{Key: testKey1}), false); err != nil { t.Fatal(err) } @@ -2917,7 +2917,7 @@ func TestMVCCResolveTxn(t *testing.T) { // Resolve will write with txn1's timestamp which is 0,1. if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn1Commit, roachpb.Span{Key: testKey1})); err != nil { + roachpb.MakeLockUpdate(txn1Commit, roachpb.Span{Key: testKey1}), false); err != nil { t.Fatal(err) } @@ -2960,7 +2960,7 @@ func TestMVCCResolveNewerIntent(t *testing.T) { // Resolve will succeed but should remove the intent. if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn1Commit, roachpb.Span{Key: testKey1})); err != nil { + roachpb.MakeLockUpdate(txn1Commit, roachpb.Span{Key: testKey1}), false); err != nil { t.Fatal(err) } @@ -3003,7 +3003,7 @@ func TestMVCCResolveIntentTxnTimestampMismatch(t *testing.T) { // A bug (see #7654) caused intents to just stay where they were instead // of being moved forward in the situation set up above. - if _, err := MVCCResolveWriteIntent(ctx, engine, nil, intent); err != nil { + if _, err := MVCCResolveWriteIntent(ctx, engine, nil, intent, false); err != nil { t.Fatal(err) } @@ -3226,7 +3226,7 @@ func TestMVCCAbortTxn(t *testing.T) { txn1AbortWithTS.WriteTimestamp = hlc.Timestamp{Logical: 1} if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn1AbortWithTS, roachpb.Span{Key: testKey1}), + roachpb.MakeLockUpdate(txn1AbortWithTS, roachpb.Span{Key: testKey1}), false, ); err != nil { t.Fatal(err) } @@ -3272,7 +3272,7 @@ func TestMVCCAbortTxnWithPreviousVersion(t *testing.T) { txn1AbortWithTS.WriteTimestamp = hlc.Timestamp{WallTime: 2} if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn1AbortWithTS, roachpb.Span{Key: testKey1}), + roachpb.MakeLockUpdate(txn1AbortWithTS, roachpb.Span{Key: testKey1}), false, ); err != nil { t.Fatal(err) } @@ -3341,7 +3341,7 @@ func TestMVCCWriteWithDiffTimestampsAndEpochs(t *testing.T) { txne2Commit.Status = roachpb.COMMITTED txne2Commit.WriteTimestamp = hlc.Timestamp{WallTime: 1} if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(&txne2Commit, roachpb.Span{Key: testKey1})); err != nil { + roachpb.MakeLockUpdate(&txne2Commit, roachpb.Span{Key: testKey1}), false); err != nil { t.Fatal(err) } @@ -3651,7 +3651,7 @@ func TestMVCCGetWithPushedTimestamp(t *testing.T) { // Resolve the intent, pushing its timestamp forward. txn := makeTxn(*txn1, hlc.Timestamp{WallTime: 1}) if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn, roachpb.Span{Key: testKey1})); err != nil { + roachpb.MakeLockUpdate(txn, roachpb.Span{Key: testKey1}), false); err != nil { t.Fatal(err) } // Attempt to read using naive txn's previous timestamp. @@ -3683,7 +3683,7 @@ func TestMVCCResolveWithDiffEpochs(t *testing.T) { } num, _, err := MVCCResolveWriteIntentRange(ctx, engine, nil, roachpb.MakeLockUpdate(txn1e2Commit, roachpb.Span{Key: testKey1, EndKey: testKey2.Next()}), - 2, engine.IsSeparatedIntentsEnabledForTesting(ctx)) + 2, false, engine.IsSeparatedIntentsEnabledForTesting(ctx)) if err != nil { t.Fatal(err) } @@ -3718,48 +3718,39 @@ func TestMVCCResolveWithUpdatedTimestamp(t *testing.T) { ctx := context.Background() for _, engineImpl := range mvccEngineImpls { t.Run(engineImpl.name, func(t *testing.T) { - engine := engineImpl.create() - defer engine.Close() + testutils.RunTrueAndFalse(t, "asyncResolution", func(t *testing.T, asyncResolution bool) { + engine := engineImpl.create() + defer engine.Close() - if err := MVCCPut(ctx, engine, nil, testKey1, txn1.ReadTimestamp, value1, txn1); err != nil { - t.Fatal(err) - } + err := MVCCPut(ctx, engine, nil, testKey1, txn1.ReadTimestamp, value1, txn1) + require.NoError(t, err) - value, _, err := MVCCGet(ctx, engine, testKey1, hlc.Timestamp{WallTime: 1}, MVCCGetOptions{ - Txn: txn1, - }) - if err != nil { - t.Fatal(err) - } - if !bytes.Equal(value1.RawBytes, value.RawBytes) { - t.Fatalf("the value %s in get result does not match the value %s in request", - value1.RawBytes, value.RawBytes) - } + value, _, err := MVCCGet(ctx, engine, testKey1, hlc.Timestamp{WallTime: 1}, MVCCGetOptions{ + Txn: txn1, + }) + require.NoError(t, err) + require.Equal(t, value1.RawBytes, value.RawBytes) - // Resolve with a higher commit timestamp -- this should rewrite the - // intent when making it permanent. - txn := makeTxn(*txn1Commit, hlc.Timestamp{WallTime: 1}) - if _, err = MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn, roachpb.Span{Key: testKey1})); err != nil { - t.Fatal(err) - } + // Resolve with a higher commit timestamp -- this should rewrite the + // intent when making it permanent. + txn := makeTxn(*txn1Commit, hlc.Timestamp{WallTime: 1}) + update := roachpb.MakeLockUpdate(txn, roachpb.Span{Key: testKey1}) + _, err = MVCCResolveWriteIntent(ctx, engine, nil, update, asyncResolution) + require.NoError(t, err) - value, _, err = MVCCGet(ctx, engine, testKey1, hlc.Timestamp{Logical: 1}, MVCCGetOptions{}) - if value != nil || err != nil { - t.Fatalf("expected both value and err to be nil: %+v, %v", value, err) - } + value, _, err = MVCCGet(ctx, engine, testKey1, hlc.Timestamp{Logical: 1}, MVCCGetOptions{}) + require.NoError(t, err) + require.Nil(t, value) - value, _, err = MVCCGet(ctx, engine, testKey1, hlc.Timestamp{WallTime: 1}, MVCCGetOptions{}) - if err != nil { - t.Error(err) - } - if expTS := (hlc.Timestamp{WallTime: 1}); value.Timestamp != expTS { - t.Fatalf("expected timestamp %+v == %+v", value.Timestamp, expTS) - } - if !bytes.Equal(value1.RawBytes, value.RawBytes) { - t.Fatalf("the value %s in get result does not match the value %s in request", - value1.RawBytes, value.RawBytes) - } + value, _, err = MVCCGet(ctx, engine, testKey1, hlc.Timestamp{WallTime: 1}, MVCCGetOptions{}) + require.NoError(t, err) + require.Equal(t, value1.RawBytes, value.RawBytes) + // If intent resolution was asynchronous with the transaction commit, + // the value's timestamp should have been marked as synthetic when it + // was moved. + expTS := hlc.Timestamp{WallTime: 1}.WithSynthetic(asyncResolution) + require.Equal(t, expTS, value.Timestamp) + }) }) } } @@ -3774,47 +3765,34 @@ func TestMVCCResolveWithPushedTimestamp(t *testing.T) { engine := engineImpl.create() defer engine.Close() - if err := MVCCPut(ctx, engine, nil, testKey1, txn1.ReadTimestamp, value1, txn1); err != nil { - t.Fatal(err) - } + err := MVCCPut(ctx, engine, nil, testKey1, txn1.ReadTimestamp, value1, txn1) + require.NoError(t, err) + value, _, err := MVCCGet(ctx, engine, testKey1, hlc.Timestamp{WallTime: 1}, MVCCGetOptions{ Txn: txn1, }) - if err != nil { - t.Fatal(err) - } - if !bytes.Equal(value1.RawBytes, value.RawBytes) { - t.Fatalf("the value %s in get result does not match the value %s in request", - value1.RawBytes, value.RawBytes) - } + require.NoError(t, err) + require.Equal(t, value1.RawBytes, value.RawBytes) // Resolve with a higher commit timestamp, but with still-pending transaction. // This represents a straightforward push (i.e. from a read/write conflict). txn := makeTxn(*txn1, hlc.Timestamp{WallTime: 1}) - if _, err = MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn, roachpb.Span{Key: testKey1})); err != nil { - t.Fatal(err) - } + update := roachpb.MakeLockUpdate(txn, roachpb.Span{Key: testKey1}) + _, err = MVCCResolveWriteIntent(ctx, engine, nil, update, false /* asyncResolution */) + require.NoError(t, err) value, _, err = MVCCGet(ctx, engine, testKey1, hlc.Timestamp{WallTime: 1}, MVCCGetOptions{}) - if value != nil || err == nil { - t.Fatalf("expected both value nil and err to be a writeIntentError: %+v", value) - } + require.Nil(t, value) + require.Error(t, err) + require.True(t, errors.HasType(err, &roachpb.WriteIntentError{})) // Can still fetch the value using txn1. value, _, err = MVCCGet(ctx, engine, testKey1, hlc.Timestamp{WallTime: 1}, MVCCGetOptions{ Txn: txn1, }) - if err != nil { - t.Error(err) - } - if expTS := (hlc.Timestamp{WallTime: 1}); value.Timestamp != expTS { - t.Fatalf("expected timestamp %+v == %+v", value.Timestamp, expTS) - } - if !bytes.Equal(value1.RawBytes, value.RawBytes) { - t.Fatalf("the value %s in get result does not match the value %s in request", - value1.RawBytes, value.RawBytes) - } + require.NoError(t, err) + require.Equal(t, value1.RawBytes, value.RawBytes) + require.Equal(t, hlc.Timestamp{WallTime: 1}, value.Timestamp) }) } } @@ -3831,7 +3809,7 @@ func TestMVCCResolveTxnNoOps(t *testing.T) { // Resolve a non existent key; noop. if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn1Commit, roachpb.Span{Key: testKey1})); err != nil { + roachpb.MakeLockUpdate(txn1Commit, roachpb.Span{Key: testKey1}), false); err != nil { t.Fatal(err) } @@ -3840,7 +3818,7 @@ func TestMVCCResolveTxnNoOps(t *testing.T) { t.Fatal(err) } if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn2Commit, roachpb.Span{Key: testKey1})); err != nil { + roachpb.MakeLockUpdate(txn2Commit, roachpb.Span{Key: testKey1}), false); err != nil { t.Fatal(err) } @@ -3852,7 +3830,7 @@ func TestMVCCResolveTxnNoOps(t *testing.T) { txn1CommitWithTS := txn2Commit.Clone() txn1CommitWithTS.WriteTimestamp = hlc.Timestamp{WallTime: 1} if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn1CommitWithTS, roachpb.Span{Key: testKey2})); err != nil { + roachpb.MakeLockUpdate(txn1CommitWithTS, roachpb.Span{Key: testKey2}), false); err != nil { t.Fatal(err) } }) @@ -3884,7 +3862,7 @@ func TestMVCCResolveTxnRange(t *testing.T) { num, resumeSpan, err := MVCCResolveWriteIntentRange(ctx, engine, nil, roachpb.MakeLockUpdate(txn1Commit, roachpb.Span{Key: testKey1, EndKey: testKey4.Next()}), - math.MaxInt64, engine.IsSeparatedIntentsEnabledForTesting(ctx)) + math.MaxInt64, false, engine.IsSeparatedIntentsEnabledForTesting(ctx)) if err != nil { t.Fatal(err) } @@ -3977,7 +3955,7 @@ func TestMVCCResolveTxnRangeResume(t *testing.T) { // Resolve up to 6 intents: the keys are 000, 033, 066, 099, 1212, 1515. num, resumeSpan, err := MVCCResolveWriteIntentRange(ctx, rw, nil, roachpb.MakeLockUpdate(txn1Commit, roachpb.Span{Key: roachpb.Key("00"), EndKey: roachpb.Key("33")}), - 6, engine.IsSeparatedIntentsEnabledForTesting(ctx)) + 6, false, engine.IsSeparatedIntentsEnabledForTesting(ctx)) if err != nil { t.Fatal(err) } @@ -4025,7 +4003,7 @@ func TestMVCCResolveTxnRangeResumeWithManyVersions(t *testing.T) { for { // Resolve up to 20 intents. num, resumeSpan, err := MVCCResolveWriteIntentRange(ctx, engine, nil, lockUpdate, - 20, engine.IsSeparatedIntentsEnabledForTesting(ctx)) + 20, false, engine.IsSeparatedIntentsEnabledForTesting(ctx)) require.NoError(t, err) if resumeSpan == nil { // Last call resolves 0 intents. @@ -4247,7 +4225,7 @@ func TestRandomizedMVCCResolveWriteIntentRange(t *testing.T) { func() { batch := engs[i].eng.NewBatch() defer batch.Close() - _, _, err := MVCCResolveWriteIntentRange(ctx, batch, &engs[i].stats, lu, 0, i == 0) + _, _, err := MVCCResolveWriteIntentRange(ctx, batch, &engs[i].stats, lu, 0, false, i == 0) require.NoError(t, err) require.NoError(t, batch.Commit(false)) }() @@ -4265,7 +4243,7 @@ func TestRandomizedMVCCResolveWriteIntentRange(t *testing.T) { func() { batch := engs[i].eng.NewBatch() defer batch.Close() - _, _, err := MVCCResolveWriteIntentRange(ctx, batch, &engs[i].stats, lu, 0, i == 0) + _, _, err := MVCCResolveWriteIntentRange(ctx, batch, &engs[i].stats, lu, 0, false, i == 0) require.NoError(t, err) require.NoError(t, batch.Commit(false)) }() @@ -4354,7 +4332,7 @@ func TestRandomizedSavepointRollbackAndIntentResolution(t *testing.T) { } // All the writes are ignored, so DEL is written for the intent. These // should be buffered in the memtable. - _, _, err = MVCCResolveWriteIntentRange(ctx, eng, nil, lu, 0, true) + _, _, err = MVCCResolveWriteIntentRange(ctx, eng, nil, lu, 0, false, true) require.NoError(t, err) { iter := eng.NewMVCCIterator(MVCCKeyAndIntentsIterKind, @@ -4386,7 +4364,7 @@ func TestRandomizedSavepointRollbackAndIntentResolution(t *testing.T) { if debug { log.Infof(ctx, "LockUpdate: %s", lu.String()) } - _, _, err = MVCCResolveWriteIntentRange(ctx, eng, nil, lu, 0, false) + _, _, err = MVCCResolveWriteIntentRange(ctx, eng, nil, lu, 0, false, false) require.NoError(t, err) // Compact the engine so that SINGLEDEL consumes the SETWITHDEL, becoming a // DEL. @@ -5439,7 +5417,7 @@ func TestResolveIntentWithLowerEpoch(t *testing.T) { } // Resolve the intent with a low epoch. if _, err := MVCCResolveWriteIntent(ctx, engine, nil, - roachpb.MakeLockUpdate(txn1, roachpb.Span{Key: testKey1})); err != nil { + roachpb.MakeLockUpdate(txn1, roachpb.Span{Key: testKey1}), false); err != nil { t.Fatal(err) } diff --git a/pkg/storage/testdata/mvcc_histories/ignored_seq_nums b/pkg/storage/testdata/mvcc_histories/ignored_seq_nums index c0b2587c2a4a..5fa2f9a63c42 100644 --- a/pkg/storage/testdata/mvcc_histories/ignored_seq_nums +++ b/pkg/storage/testdata/mvcc_histories/ignored_seq_nums @@ -174,7 +174,7 @@ data: "k/20"/11.000000000,0 -> /BYTES/20 meta: "k/30"/0,0 -> txn={id=00000000 key=/Min pri=0.00000000 epo=0 ts=11.000000000,0 min=0,0 seq=30} ts=11.000000000,0 del=false klen=12 vlen=7 txnDidNotUpdateMeta=true data: "k/30"/11.000000000,0 -> /BYTES/30 -# Ensure that the deleted value doens't surface. Instead, if we ignore the +# Ensure that the deleted value doesn't surface. Instead, if we ignore the # now-newest seq, we get the write before it. run ok diff --git a/pkg/storage/testdata/mvcc_histories/resolve_intent_with_async_resolution b/pkg/storage/testdata/mvcc_histories/resolve_intent_with_async_resolution new file mode 100644 index 000000000000..309e45b9fba5 --- /dev/null +++ b/pkg/storage/testdata/mvcc_histories/resolve_intent_with_async_resolution @@ -0,0 +1,157 @@ +# Test cases: +# +# for asyncResolution in (false, true): +# for pushWhilePending in (false, true): +# for pushWhileCommitting in (false, true): +# testCase() +# + +run ok +txn_begin t=A ts=5 +---- +>> at end: +txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=5.000000000,0 min=0,0 seq=0} lock=true stat=PENDING rts=5.000000000,0 wto=false gul=0,0 + +run ok +with t=A k=k1 + # setup + txn_advance ts=5 + put v=a + # test + resolve_intent status=COMMITTED +---- +>> at end: +txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=5.000000000,0 min=0,0 seq=0} lock=true stat=PENDING rts=5.000000000,0 wto=false gul=0,0 +data: "k1"/5.000000000,0 -> /BYTES/a + +run ok +with t=A k=k2 + # setup + txn_advance ts=5 + put v=a + # test + txn_advance ts=10 + resolve_intent status=COMMITTED +---- +>> at end: +txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=10.000000000,0 min=0,0 seq=0} lock=true stat=PENDING rts=5.000000000,0 wto=false gul=0,0 +data: "k1"/5.000000000,0 -> /BYTES/a +data: "k2"/10.000000000,0 -> /BYTES/a + +run ok +with t=A k=k3 + # setup + txn_advance ts=5 + put v=a + # test + txn_advance ts=10 + resolve_intent status=PENDING + check_intent + resolve_intent status=COMMITTED +---- +meta: "k3" -> txn={id=00000000 key=/Min pri=0.00000000 epo=0 ts=10.000000000,0 min=0,0 seq=0} ts=10.000000000,0 del=false klen=12 vlen=6 txnDidNotUpdateMeta=false writtenTs=5.000000000,0 +>> at end: +txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=10.000000000,0 min=0,0 seq=0} lock=true stat=PENDING rts=5.000000000,0 wto=false gul=0,0 +data: "k1"/5.000000000,0 -> /BYTES/a +data: "k2"/10.000000000,0 -> /BYTES/a +data: "k3"/10.000000000,0 -> /BYTES/a + +run ok +with t=A k=k4 + # setup + txn_advance ts=5 + put v=a + # test + txn_advance ts=10 + resolve_intent status=PENDING + check_intent + txn_advance ts=15 + resolve_intent status=COMMITTED +---- +meta: "k4" -> txn={id=00000000 key=/Min pri=0.00000000 epo=0 ts=10.000000000,0 min=0,0 seq=0} ts=10.000000000,0 del=false klen=12 vlen=6 txnDidNotUpdateMeta=false writtenTs=5.000000000,0 +>> at end: +txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=15.000000000,0 min=0,0 seq=0} lock=true stat=PENDING rts=5.000000000,0 wto=false gul=0,0 +data: "k1"/5.000000000,0 -> /BYTES/a +data: "k2"/10.000000000,0 -> /BYTES/a +data: "k3"/10.000000000,0 -> /BYTES/a +data: "k4"/15.000000000,0 -> /BYTES/a + +run ok +with t=A k=k5 + # setup + txn_advance ts=5 + put v=a + # test + resolve_intent status=COMMITTED asyncResolution +---- +>> at end: +txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=5.000000000,0 min=0,0 seq=0} lock=true stat=PENDING rts=5.000000000,0 wto=false gul=0,0 +data: "k1"/5.000000000,0 -> /BYTES/a +data: "k2"/10.000000000,0 -> /BYTES/a +data: "k3"/10.000000000,0 -> /BYTES/a +data: "k4"/15.000000000,0 -> /BYTES/a +data: "k5"/5.000000000,0 -> /BYTES/a + +run ok +with t=A k=k6 + # setup + txn_advance ts=5 + put v=a + # test + txn_advance ts=10 + resolve_intent status=COMMITTED asyncResolution +---- +>> at end: +txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=10.000000000,0 min=0,0 seq=0} lock=true stat=PENDING rts=5.000000000,0 wto=false gul=0,0 +data: "k1"/5.000000000,0 -> /BYTES/a +data: "k2"/10.000000000,0 -> /BYTES/a +data: "k3"/10.000000000,0 -> /BYTES/a +data: "k4"/15.000000000,0 -> /BYTES/a +data: "k5"/5.000000000,0 -> /BYTES/a +data: "k6"/10.000000000,0? -> /BYTES/a + +run ok +with t=A k=k7 + # setup + txn_advance ts=5 + put v=a + # test + txn_advance ts=10 + resolve_intent status=PENDING + check_intent + resolve_intent status=COMMITTED asyncResolution +---- +meta: "k7" -> txn={id=00000000 key=/Min pri=0.00000000 epo=0 ts=10.000000000,0 min=0,0 seq=0} ts=10.000000000,0 del=false klen=12 vlen=6 txnDidNotUpdateMeta=false writtenTs=5.000000000,0 +>> at end: +txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=10.000000000,0 min=0,0 seq=0} lock=true stat=PENDING rts=5.000000000,0 wto=false gul=0,0 +data: "k1"/5.000000000,0 -> /BYTES/a +data: "k2"/10.000000000,0 -> /BYTES/a +data: "k3"/10.000000000,0 -> /BYTES/a +data: "k4"/15.000000000,0 -> /BYTES/a +data: "k5"/5.000000000,0 -> /BYTES/a +data: "k6"/10.000000000,0? -> /BYTES/a +data: "k7"/10.000000000,0? -> /BYTES/a + +run ok +with t=A k=k8 + # setup + txn_advance ts=5 + put v=a + # test + txn_advance ts=10 + resolve_intent status=PENDING + check_intent + txn_advance ts=15 + resolve_intent status=COMMITTED asyncResolution +---- +meta: "k8" -> txn={id=00000000 key=/Min pri=0.00000000 epo=0 ts=10.000000000,0 min=0,0 seq=0} ts=10.000000000,0 del=false klen=12 vlen=6 txnDidNotUpdateMeta=false writtenTs=5.000000000,0 +>> at end: +txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=15.000000000,0 min=0,0 seq=0} lock=true stat=PENDING rts=5.000000000,0 wto=false gul=0,0 +data: "k1"/5.000000000,0 -> /BYTES/a +data: "k2"/10.000000000,0 -> /BYTES/a +data: "k3"/10.000000000,0 -> /BYTES/a +data: "k4"/15.000000000,0 -> /BYTES/a +data: "k5"/5.000000000,0 -> /BYTES/a +data: "k6"/10.000000000,0? -> /BYTES/a +data: "k7"/10.000000000,0? -> /BYTES/a +data: "k8"/15.000000000,0? -> /BYTES/a diff --git a/pkg/util/hlc/timestamp.pb.go b/pkg/util/hlc/timestamp.pb.go index 087a9ec0a838..2bc454b8275d 100644 --- a/pkg/util/hlc/timestamp.pb.go +++ b/pkg/util/hlc/timestamp.pb.go @@ -59,6 +59,15 @@ type Timestamp struct { // can tighten the uncertainty interval that is applied to MVCC versions // with clock timestamp. // + // This property of disconnecting committed MVCC versions from observed + // timestamps is also used during asynchronous intent resolution in + // cases where an intent is being committed at a later time than it was + // synchronously written at (i.e. before committing) by its transaction. + // In such cases, a clock reading from the leaseholder of the committed + // value does not place an upper bound on the timestamp that the value + // may eventually be resolved to, so observed timestamps cannot be used + // to limit uncertainty when reading the value. + // // This flag does not affect the sort order of Timestamps. However, it // is considered when performing structural equality checks (e.g. using // the == operator). Consider use of the EqOrdering method when testing diff --git a/pkg/util/hlc/timestamp.proto b/pkg/util/hlc/timestamp.proto index 0ee60edcaba4..fc9e0e298422 100644 --- a/pkg/util/hlc/timestamp.proto +++ b/pkg/util/hlc/timestamp.proto @@ -55,6 +55,15 @@ message Timestamp { // can tighten the uncertainty interval that is applied to MVCC versions // with clock timestamp. // + // This property of disconnecting committed MVCC versions from observed + // timestamps is also used during asynchronous intent resolution in + // cases where an intent is being committed at a later time than it was + // synchronously written at (i.e. before committing) by its transaction. + // In such cases, a clock reading from the leaseholder of the committed + // value does not place an upper bound on the timestamp that the value + // may eventually be resolved to, so observed timestamps cannot be used + // to limit uncertainty when reading the value. + // // This flag does not affect the sort order of Timestamps. However, it // is considered when performing structural equality checks (e.g. using // the == operator). Consider use of the EqOrdering method when testing