Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/storage: introduce local timestamps for MVCC versions in MVCCKey #77342

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Mar 3, 2022

Fixes #36431.
Fixes #49360.
Replaces #72121.

This commit fixes the potential for a stale read as detailed in #36431 using the "remember when intents were written" approach described in #36431 (comment) and later expanded on in #72121 (comment).

type MVCCKey struct {
	Key            roachpb.Key
	Timestamp      hlc.Timestamp
+	LocalTimestamp hlc.ClockTimestamp
}

This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we introduced the multi-register test to the Jepsen test suite, and have since observed the test fail when combined with the strobe-skews nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test.

Explanation

The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure1 relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make.

This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction2 of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to each MVCC key called the "local timestamp". The MVCC key's existing version timestamp dictates the key's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to the MVCC key encoding, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads.

Future improvements

As noted in #72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for ClockTimestamps is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock.

The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two ClockTimestamps, the version's local timestamp and the reader's observed timestamp.

Correctness testing

I was not able to stress jepsen/multi-register/strobe-skews hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called TestTxnReadWithinUncertaintyIntervalAfterIntentResolution which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes.

Performance analysis

This correctness fix will lead to an increased rate of transaction retries under some workloads.

TODO(nvanbenschoten):

  • observe TPC-C performance
    -- top-line performance
    -- uncertainty retry rate
    -- commit-wait rate (should be zero)
  • compare YCSB performance
  • compare KV performance

Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew.

Release justification: None. This will not be merged before the v22.1 release branch is cut.

Jira issue: CRDB-14815

Footnotes

  1. see pkg/kv/kvserver/observedts/doc.go for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix.

  2. see analysis in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-714221846.

@nvanbenschoten nvanbenschoten requested review from a team as code owners March 3, 2022 18:45
@nvanbenschoten nvanbenschoten requested a review from a team March 3, 2022 18:45
@nvanbenschoten nvanbenschoten requested review from a team as code owners March 3, 2022 18:45
@nvanbenschoten nvanbenschoten requested review from stevendanna and removed request for a team March 3, 2022 18:45
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/localTimestamp branch 9 times, most recently from ccceff5 to 3f67d42 Compare March 5, 2022 20:06
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(read a bit of this -- flushing comments)

Reviewed 4 of 7 files at r1, 5 of 14 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @stevendanna, and @sumeerbhola)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 648 at r1 (raw file):

	if linearizable && !readOnly {
		waitUntil = waitUntil.Add(tc.clock.MaxOffset().Nanoseconds(), 0)
	}

Would be nice to have a comment here on what kinds of transactions won't need to wait.


pkg/kv/kvserver/replica_send.go, line 268 at r1 (raw file):

	// evaluation.
	if txn.WriteTimestamp.LessEq(r.Clock().Now()) {
		return nil // no wait fast-path

same comment here


pkg/kv/kvserver/replica_tscache.go, line 62 at r1 (raw file):

	// property this is asserting is no longer quite as important, so we can
	// disable the check. However, it would still be nice to track down how we
	// can hit this.

was this TODO stale and has been fixed?


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 573 at r2 (raw file):

		// resolve the intent to timestamp 11 but leave its local timestamp at 8
		// then the reader would consider the value "uncertain" upon re-evaluation.
		// However, if the reader also updates the value's local timestamp to 16

In this example will this mean that we don't need to write the local timestamp since it is >= txn commit timestamp?


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 589 at r2 (raw file):

		// commits and acknowledges its client. This could not lead to the pusher
		// serving a stale read, but it could lead to other transactions serving
		// stale reads.

I see what this is saying, but a small example here would make it super clear for a reader.


pkg/roachpb/api.proto, line 1142 at r2 (raw file):

  // from a different node than the node which evaluates the ResolveIntent
  // request, it will be ignored and the intent's local timestamp will not be
  // changed.

This could use a short example of the hazard in the absence of this ignore behavior.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 9, 2022
This commit fixes the bug revealed in the previous commit and sets the
stage for generalized MVCC version normalization during key comparison,
which will be needed for cockroachdb#77342.

To do so, the commit adds a normalization step to EngineKeyCompare, the
custom `Compare` function we provide to Pebble. This normalization pass
currently strips the synthetic bit from version timestamps, which fixes
the bug revealed in the previous commit. The normalization pass also
strips zero-valued logical components, which are typically not encoded
into MVCCKeys today, but may be in the future (for instance, see
cockroachdb/pebble#1314). In cockroachdb#77342, we can then
extend this to strip the encoded logical timestamp, if present.

In addition to updating the existing custom key comparator function
passed to Pebble, the commit also introduces a new custom key equality
function. This new function, called EngineKeyEqual, is provided to
Pebble as its `Equal` function, replacing the default key equality
function of `bytes.Equal`. EngineKeyEqual uses the same version
normalization rules to strip portions of the key's version that should
not affect ordering.

The relationship between the different comparators is explored in a new
property based unit test called `TestMVCCKeyCompareRandom`. The change
allows us to say that for any two `MVCCKeys` `a` and `b`, the following
identities hold:
```
a.Compare(b) = EngineKeyCompare(encode(a), encode(b))

a.Equal(b)   = EngineKeyEqual(encode(a), encode(b))

(a.Compare(b) == 0) = a.Equal(b)

(a.Compare(b) < 0) = a.Less(b)

(a.Compare(b) > 0) = b.Less(a)
```

Care was taken to minimize the cost of this version normalization. With
EngineKeyCompare, the normalization amounts to four new branches that
should all be easy for a branch predictor to learn.

With EngineKeyEqual, there is more of a concern that this change will
regress performance because we switch from a direct call to `bytes.Equal`
to a custom comparator. To minimize this cost, the change adds a
fast-path to quickly defer to `bytes.Equal` when version normalization
is not needed. Benchmarks show that with this fast-path and with an
expected distribution of keys, the custom key equality function is about
2.5ns more expensive per call. This seems reasonable.

```
name               time/op
MVCCKeyCompare-10  12.2ns ± 1%
MVCCKeyEqual-10    7.10ns ± 6%
BytesEqual-10      4.72ns ± 2%
```

Release note: None.

Release justification: None. Not intended for v22.1.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/localTimestamp branch from 3f67d42 to 2690007 Compare March 9, 2022 20:54
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner March 9, 2022 20:54
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 23, 2022
This commit fixes the bug revealed in the previous commit and sets the
stage for generalized MVCC version normalization during key comparison,
which will be needed for cockroachdb#77342.

To do so, the commit adds a normalization step to EngineKeyCompare, the
custom `Compare` function we provide to Pebble. This normalization pass
currently strips the synthetic bit from version timestamps, which fixes
the bug revealed in the previous commit. The normalization pass also
strips zero-valued logical components, which are typically not encoded
into MVCCKeys today, but may be in the future (for instance, see
cockroachdb/pebble#1314). In cockroachdb#77342, we can then
extend this to strip the encoded logical timestamp, if present.

In addition to updating the existing custom key comparator function
passed to Pebble, the commit also introduces a new custom key equality
function. This new function, called EngineKeyEqual, is provided to
Pebble as its `Equal` function, replacing the default key equality
function of `bytes.Equal`. EngineKeyEqual uses the same version
normalization rules to strip portions of the key's version that should
not affect ordering.

The relationship between the different comparators is explored in a new
property based unit test called `TestMVCCKeyCompareRandom`. The change
allows us to say that for any two `MVCCKeys` `a` and `b`, the following
identities hold:
```
a.Compare(b) = EngineKeyCompare(encode(a), encode(b))

a.Equal(b)   = EngineKeyEqual(encode(a), encode(b))

(a.Compare(b) == 0) = a.Equal(b)

(a.Compare(b) < 0) = a.Less(b)

(a.Compare(b) > 0) = b.Less(a)
```

Care was taken to minimize the cost of this version normalization. With
EngineKeyCompare, the normalization amounts to four new branches that
should all be easy for a branch predictor to learn.

With EngineKeyEqual, there is more of a concern that this change will
regress performance because we switch from a direct call to `bytes.Equal`
to a custom comparator. To minimize this cost, the change adds a
fast-path to quickly defer to `bytes.Equal` when version normalization
is not needed. Benchmarks show that with this fast-path and with an
expected distribution of keys, the custom key equality function is about
2.5ns more expensive per call. This seems reasonable.

```
name               time/op
MVCCKeyCompare-10  12.2ns ± 1%
MVCCKeyEqual-10    7.10ns ± 6%
BytesEqual-10      4.72ns ± 2%
```

Release note: None.

Release justification: None. Not intended for v22.1.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/localTimestamp branch from 2690007 to 6f9b94c Compare March 25, 2022 20:38
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 29, 2022
This commit fixes the bug revealed in the previous commit and sets the
stage for generalized MVCC version normalization during key comparison,
which will be needed for cockroachdb#77342.

To do so, the commit adds a normalization step to EngineKeyCompare, the
custom `Compare` function we provide to Pebble. This normalization pass
currently strips the synthetic bit from version timestamps, which fixes
the bug revealed in the previous commit. The normalization pass also
strips zero-valued logical components, which are typically not encoded
into MVCCKeys today, but may be in the future (for instance, see
cockroachdb/pebble#1314). In cockroachdb#77342, we can then
extend this to strip the encoded logical timestamp, if present.

In addition to updating the existing custom key comparator function
passed to Pebble, the commit also introduces a new custom key equality
function. This new function, called EngineKeyEqual, is provided to
Pebble as its `Equal` function, replacing the default key equality
function of `bytes.Equal`. EngineKeyEqual uses the same version
normalization rules to strip portions of the key's version that should
not affect ordering.

The relationship between the different comparators is explored in a new
property based unit test called `TestMVCCKeyCompareRandom`. The change
allows us to say that for any two `MVCCKeys` `a` and `b`, the following
identities hold:
```
a.Compare(b) = EngineKeyCompare(encode(a), encode(b))

a.Equal(b)   = EngineKeyEqual(encode(a), encode(b))

(a.Compare(b) == 0) = a.Equal(b)

(a.Compare(b) < 0) = a.Less(b)

(a.Compare(b) > 0) = b.Less(a)
```

Care was taken to minimize the cost of this version normalization. With
EngineKeyCompare, the normalization amounts to four new branches that
should all be easy for a branch predictor to learn.

With EngineKeyEqual, there is more of a concern that this change will
regress performance because we switch from a direct call to `bytes.Equal`
to a custom comparator. To minimize this cost, the change adds a
fast-path to quickly defer to `bytes.Equal` when version normalization
is not needed. Benchmarks show that with this fast-path and with an
expected distribution of keys, the custom key equality function is about
2.5ns more expensive per call. This seems reasonable.

```
name               time/op
MVCCKeyCompare-10  12.2ns ± 1%
MVCCKeyEqual-10    7.10ns ± 6%
BytesEqual-10      4.72ns ± 2%
```

Release note: None.

Release justification: None. Not intended for v22.1.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 30, 2022
This commit fixes the bug revealed in the previous commit and sets the
stage for generalized MVCC version normalization during key comparison,
which will be needed for cockroachdb#77342.

To do so, the commit adds a normalization step to EngineKeyCompare, the
custom `Compare` function we provide to Pebble. This normalization pass
currently strips the synthetic bit from version timestamps, which fixes
the bug revealed in the previous commit. The normalization pass also
strips zero-valued logical components, which are typically not encoded
into MVCCKeys today, but may be in the future (for instance, see
cockroachdb/pebble#1314). In cockroachdb#77342, we can then
extend this to strip the encoded logical timestamp, if present.

In addition to updating the existing custom key comparator function
passed to Pebble, the commit also introduces a new custom key equality
function. This new function, called EngineKeyEqual, is provided to
Pebble as its `Equal` function, replacing the default key equality
function of `bytes.Equal`. EngineKeyEqual uses the same version
normalization rules to strip portions of the key's version that should
not affect ordering.

The relationship between the different comparators is explored in a new
property based unit test called `TestMVCCKeyCompareRandom`. The change
allows us to say that for any two `MVCCKeys` `a` and `b`, the following
identities hold:
```
a.Compare(b) = EngineKeyCompare(encode(a), encode(b))

a.Equal(b)   = EngineKeyEqual(encode(a), encode(b))

(a.Compare(b) == 0) = a.Equal(b)

(a.Compare(b) < 0) = a.Less(b)

(a.Compare(b) > 0) = b.Less(a)
```

Care was taken to minimize the cost of this version normalization. With
EngineKeyCompare, the normalization amounts to four new branches that
should all be easy for a branch predictor to learn.

With EngineKeyEqual, there is more of a concern that this change will
regress performance because we switch from a direct call to `bytes.Equal`
to a custom comparator. To minimize this cost, the change adds a
fast-path to quickly defer to `bytes.Equal` when version normalization
is not needed. Benchmarks show that with this fast-path and with an
expected distribution of keys, the custom key equality function is about
2.5ns more expensive per call. This seems reasonable.

```
name               time/op
MVCCKeyCompare-10  12.2ns ± 1%
MVCCKeyEqual-10    7.10ns ± 6%
BytesEqual-10      4.72ns ± 2%
```

Release note: None.

Release justification: None. Not intended for v22.1.
craig bot pushed a commit that referenced this pull request Mar 30, 2022
77520: storage: normalize MVCC version timestamps during key comparison r=nvanbenschoten a=nvanbenschoten

Related to #77342.

This commit fixes the bug revealed in the previous commit and sets the stage for generalized MVCC version normalization during key comparison, which will be needed for #77342.

To do so, the commit adds a normalization step to EngineKeyCompare, the custom `Compare` function we provide to Pebble. This normalization pass currently strips the synthetic bit from version timestamps, which fixes the bug revealed in the previous commit. The normalization pass also strips zero-valued logical components, which are typically not encoded into MVCCKeys today, but may be in the future (for instance, see cockroachdb/pebble#1314). In #77342, we can then extend this to strip the encoded logical timestamp, if present.

In addition to updating the existing custom key comparator function passed to Pebble, the commit also introduces a new custom key equality function. This new function, called EngineKeyEqual, is provided to Pebble as its `Equal` function, replacing the default key equality function of `bytes.Equal`. EngineKeyEqual uses the same version normalization rules to strip portions of the key's version that should not affect ordering.

The relationship between the different comparators is explored in a new property based unit test called `TestMVCCKeyCompareRandom`. The change allows us to say that for any two `MVCCKeys` `a` and `b`, the following identities hold:
```
a.Compare(b) = EngineKeyCompare(encode(a), encode(b))

a.Equal(b)   = EngineKeyEqual(encode(a), encode(b))

(a.Compare(b) == 0) = a.Equal(b)

(a.Compare(b) < 0) = a.Less(b)

(a.Compare(b) > 0) = b.Less(a)
```

Care was taken to minimize the cost of this version normalization. With EngineKeyCompare, the normalization amounts to four new branches that should all be easy for a branch predictor to learn.

With EngineKeyEqual, there is more of a concern that this change will regress performance because we switch from a direct call to `bytes.Equal` to a custom comparator. To minimize this cost, the change adds a fast-path to quickly defer to `bytes.Equal` when version normalization is not needed. Benchmarks show that with this fast-path and with an expected distribution of keys, the custom key equality function is about 2.5ns more expensive per call. This seems reasonable. 

```
name               time/op
MVCCKeyCompare-10  12.2ns ± 1%
MVCCKeyEqual-10    7.10ns ± 6%
BytesEqual-10      4.72ns ± 2%
```

Release note: None.

Release justification: None. Not intended for v22.1.

78350: backupccl: refactor getBackupDetailsAndManifest r=benbardin a=benbardin

This should be a complete no-op from a functionality perspective.
The smaller, more encapsulated pieces should be easier to reason about.

Specifically, getBackupDetailsAndManifest has now become a scaffolding function, calling the new updateBackupDetails, createBackupManifest, and validateBackupDetailsAndManifest in turn. These functions have a narrower focus and are a bit easier to follow.

It's also now a bit clearer where one would go to, say, add more validation conditions.

Release note: None

79091: bazel: upgrade to bazel 5.1.0 r=mari-crl a=rickystewart

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Ben Bardin <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/localTimestamp branch from 6f9b94c to d28b7ed Compare April 1, 2022 04:40
This commit adds a cluster version gate and a cluster setting for
local timestamps, to assist with their migration into an existing
cluster.

This fixes mixed-version clusters' interaction with local timestamps.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/localTimestamp branch from 3e85d95 to fb24b1a Compare April 15, 2022 18:00
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look like MVCCStats computation has been updated to account for the extra timestamp.

The MVCCStats computation deliberately avoids taking the variable-length timestamp component into account and assumes all engine key versions are 12 bytes large, which can be an overestimate or an underestimate. This has always been the case.
However, with this change, the degree to which we can underestimate gets 12 bytes worse. Still, I'm inclined to not change this for two reasons. First, it will be difficult to make this precise. Doing so will require lifting and leaking the decision about when to omit the local timestamp from the encoding routines up into most MVCC logic. Second, I think it might be wise to keep the door open to LSM compaction filters stripping this local timestamp once it is no longer needed. As a strawman, we could do this for key-values once they fall below their range's GC threshold.

I see. The reason I ask is that I've implemented MVCC stats for range keys in #78085 using the actual encoded size of timestamps, with the hope that we would eventually fix this for point keys too (we have TODOs for it). Since range tombstones would presumably also need local timestamps, these would contribute their actual encoded size to the statistics.

Today with the way that KV, MVCC, and Storage are layered, the MVCC and storage layers operate on decoded engine/mvcc keys but encoded (opaque) values. That makes manipulation of keys more natural than manipulation of values in the layer that owns this "local timestamp" concept.

Sort of. MVCCPut takes a roachpb.Value, and MVCCGet and MVCCScan both return roachpb.Value, which would store the local timestamp -- but it's true that this encoding/decoding happens near the MVCC API boundaries, and in the case of scans returning encoded binary KV pairs it might get prohibitively expensive to decode these on the server-side during scanning. Similarly for intent rewrites. So it's quite likely that the performance gains here are worth it, despite the unpleasant aesthetics.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @stevendanna and @sumeerbhola)

Copy link
Contributor

@HonoreDB HonoreDB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm here to look at the cdc impact: have you considered storing this as an offset or xor from the hlc.Timestamp with a smaller or variable-length encoding?

Reviewed 11 of 137 files at r21.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @stevendanna and @sumeerbhola)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 134 files at r15, 1 of 132 files at r20, 2 of 137 files at r21.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @stevendanna, and @sumeerbhola)


pkg/storage/mvcc.go, line 1285 at r21 (raw file):

// assigned an explicit local timestamp. The effect of this is that
// readers treat the local timestamp as being equal to the version
// timestamp.

When is it safe for the the putter to say "assume local timestamp == version timestamp"?
Is it always safe for non-blocking txns (including a txn rewriting its own intent) because this is a put and the use of a version timestamp implies that the local clock has been advanced? Or is it not true even in this case because "we never need to "push" transaction-domain timestamps into HLC clocks"?
I would like the comment to also define the safety requirement for using hlc.ClockTimestamp{}

@nvanbenschoten
Copy link
Member Author

Since range tombstones would presumably also need local timestamps, ...

This is a good point. We'll want to make sure that whatever solution we come to here works well for range tombstones. My reading of #77762 and of cockroachdb/pebble@5d99779 is that the timestamp suffix that we typically append to the key for point KeyKinds is stored in the value of range keys (InternalKeyKindRangeKeySet, InternalKeyKindRangeKeyUnset, InternalKeyKindRangeKeyDelete). On the surface, that would seem to indicate that the choice of where to store the local timestamp for point keys does not impact range tombstones (suffix vs. value).

However, the semantics of RangeKeyUnset make this more nuanced. Currently, RangeKeyUnset says that "only range keys with suffixes that exactly match the unset suffix" will be removed. One way to read this is that anything that does not affect key equality cannot end up in the range key suffix and that range key equality is defined using byte equality of the suffix. Another way to read this is that range key equality should be using the pebble.Comparer functions for range keys to define ordering and equality of range keys like it does for point keys, but that it is not doing so yet.

I'd like to make sure that any choice we make here will be consistently applied to range keys and that I'm not landing a large hurdle in the path of range keys. @erikgrinaker do you mind if I set up a quick chat for when you're back to bottom out on a few of these questions?

MVCCPut takes a roachpb.Value, and MVCCGet and MVCCScan both return roachpb.Value, which would store the local timestamp -- but it's true that this encoding/decoding happens near the MVCC API boundaries

In many cases, the encoding/decoding is actually far away from the MVCC API boundaries. It's often all the way up in the SQL layer on the client. In the common-path of a SQL INSERT, the roachpb.Value (which is little more than a wrapper around a custom encoded byte slice) is encoded in SQL and treated as opaque below this level. That makes extending roachpb.Value with additional metadata unpleasant and also difficult to do without a meaningful performance impact (i.e. due to re-allocating and re-encoding the value in MVCC, then decoding and stripping the metadata during performance-sensitive scans).

An alternative that I've begun to explore is introducing a wrapper value in the MVCC layer that carries a roachpb.Value along with some optional metadata. Imagine a storage.MVCCValue struct which wraps a roachpb.Value in much the same way that a storage.MVCCKey wraps a roachpb.Key. There would be some challenges to a solution like this, including backwards compatibility and avoiding the encoding/decoding performance hit in the common case, but we may be able to rely on local_timestamps being rare and (ab)use the roachpb.Value tag to indicate that the value is wrapped or not, allowing for an extended encoding scheme only in the cases where it is.

While I'm here to look at the cdc impact: have you considered storing this as an offset or xor from the hlc.Timestamp with a smaller or variable-length encoding?

This change should have little impact to CDC. Local timestamps will be stripped from the key below Raft before they end up in a rangefeed and will never make it up to a changefeed.

I considered storing the local timestamp as an offset from the hlc.Timestamp, but that doesn't get us anything without a variable-length encoding. A variable-length encoding itself is difficult because we rely on known key suffix sizes when decoding MVCCKeys. Given how rarely we'll need to include this field in practice (< 0.3% of writes in TPC-C), it didn't seem worth the complexity beyond making the encoding variable-length with respect to the presence or absence of a logical timestamp component.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 19, 2022

Sure @nvanbenschoten, let's have a chat when I'm back.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 23, 2022
This commit fixes a latent bug where `updateStatsOnResolve`'s attempt to
discount the effect of the previous intent's key and value size on the
stats' `LiveBytes` field used the new value's size instead of the old
value's size. I've tracked this back to 92cad17.

This is currently harmless because intent value sizes never change
during intent resolution. However, this may not be the case in the
future. For instance, this will not be the case if we store local
timestamps in an `MVCCValue` wrapper object, as is being explored
in cockroachdb#77342.
craig bot pushed a commit that referenced this pull request Apr 23, 2022
80443: storage: fix accounting for orig LiveBytes in updateStatsOnResolve r=nvanbenschoten a=nvanbenschoten

This commit fixes a latent bug where `updateStatsOnResolve`'s attempt to
discount the effect of the previous intent's key and value size on the
stats' `LiveBytes` field used the new value's size instead of the old
value's size. I've tracked this back to 92cad17.

This is currently harmless because intent value sizes never change
during intent resolution. However, this may not be the case in the
future. For instance, this will not be the case if we store local
timestamps in an `MVCCValue` wrapper object, as is being explored
in #77342.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@nvanbenschoten nvanbenschoten changed the title kv/storage: introduce and track local timestamps for MVCC versions kv/storage: introduce and track local timestamps for MVCC versions in MVCCKey Apr 28, 2022
@nvanbenschoten nvanbenschoten changed the title kv/storage: introduce and track local timestamps for MVCC versions in MVCCKey kv/storage: introduce local timestamps for MVCC versions in MVCCKey Apr 28, 2022
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 28, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 4, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
craig bot pushed a commit that referenced this pull request May 9, 2022
80706: kv/storage: introduce local timestamps for MVCC versions in MVCCValue r=sumeerbhola a=nvanbenschoten

Fixes #36431.
Fixes #49360.
Replaces #72121.
Replaces #77342.

**NOTE: this is an alternative to #77342 that stores the new LocalTimestamp in a key-value's value instead of its key.**

This commit fixes the potential for a stale read as detailed in #36431 using the "remember when intents were written" approach described in #36431 (comment) and later expanded on in #72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test.

### Explanation

The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make.

This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart.

For more, see the updates made in this commit to `pkg/kv/kvserver/observedts/doc.go`.

To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads.

### MVCCValue

```go
type MVCCValue struct {
    MVCCValueHeader
    Value roachpb.Value
}
```

To store the local timestamp, the commit introduces a new `MVCCValue` type to parallel the `MVCCKey` type. `MVCCValue` wraps a `roachpb.Value` and extends it with MVCC-level metadata which is stored in an `enginepb.MVCCValueHeader` protobuf struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the `roachpb.Value` encoding.

To further avoid any performance hit, `MVCCValue` has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of `roachpb.Value`. This provided backwards compatibility and ensures that the `MVCCValue` optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which is either the `roachpb.Value` tag (which has many possible values) or a sentinel tag not used by the `roachpb.Value` encoding which indicates the extended encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. See microbenchmarks below.

### Future improvements

As noted in #72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock.

The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp.

### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes.

### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under some workloads.

<details>
<summary><b>MVCCValue Encoding and Decoding Microbenchmarks</b></summary>
  
```
name                                                                   time/op
EncodeMVCCValue/header=empty/value=tombstone-10                        3.11ns ± 0%
EncodeMVCCValue/header=empty/value=short-10                            3.11ns ± 0%
EncodeMVCCValue/header=empty/value=long-10                             3.10ns ± 0%
EncodeMVCCValue/header=local_walltime/value=tombstone-10               38.9ns ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                   42.1ns ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                     533ns ± 3%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10       40.5ns ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10           42.9ns ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10             541ns ± 4%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10           7.81ns ± 1%
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10            0.93ns ± 0%
DecodeMVCCValue/header=empty/value=short/inline=false-10               8.39ns ± 0%
DecodeMVCCValue/header=empty/value=short/inline=true-10                1.55ns ± 0%
DecodeMVCCValue/header=empty/value=long/inline=false-10                8.38ns ± 0%
DecodeMVCCValue/header=empty/value=long/inline=true-10                 1.55ns ± 0%
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10       32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10        22.7ns ± 0%
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10  32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10   22.7ns ± 0%
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10      32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10       22.7ns ± 0%

name                                                                   alloc/op
EncodeMVCCValue/header=empty/value=tombstone-10                         0.00B
EncodeMVCCValue/header=empty/value=short-10                             0.00B
EncodeMVCCValue/header=empty/value=long-10                              0.00B
EncodeMVCCValue/header=local_walltime/value=tombstone-10                24.0B ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                    32.0B ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                    4.86kB ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10        24.0B ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10            32.0B ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10            4.86kB ± 0%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10            0.00B
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10             0.00B
DecodeMVCCValue/header=empty/value=short/inline=false-10                0.00B
DecodeMVCCValue/header=empty/value=short/inline=true-10                 0.00B
DecodeMVCCValue/header=empty/value=long/inline=false-10                 0.00B
DecodeMVCCValue/header=empty/value=long/inline=true-10                  0.00B
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10        0.00B
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10         0.00B
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10   0.00B
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10    0.00B
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10       0.00B
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10        0.00B

name                                                                   allocs/op
EncodeMVCCValue/header=empty/value=tombstone-10                          0.00
EncodeMVCCValue/header=empty/value=short-10                              0.00
EncodeMVCCValue/header=empty/value=long-10                               0.00
EncodeMVCCValue/header=local_walltime/value=tombstone-10                 1.00 ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                     1.00 ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                      1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10         1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10             1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10              1.00 ± 0%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10             0.00
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10              0.00
DecodeMVCCValue/header=empty/value=short/inline=false-10                 0.00
DecodeMVCCValue/header=empty/value=short/inline=true-10                  0.00
DecodeMVCCValue/header=empty/value=long/inline=false-10                  0.00
DecodeMVCCValue/header=empty/value=long/inline=true-10                   0.00
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10         0.00
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10          0.00
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10    0.00
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10     0.00
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10        0.00
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10         0.00
```
</details>

<details>
<summary><b>pkg/sql/tests End-To-End Microbenchmarks</b></summary>
  
```
name                                old time/op    new time/op    delta
KV/Delete/Native/rows=1-30             106µs ± 2%     104µs ± 2%  -1.38%  (p=0.012 n=8+10)
KV/Insert/SQL/rows=100-30             1.26ms ± 2%    1.24ms ± 2%  -1.25%  (p=0.029 n=10+10)
KV/Scan/SQL/rows=1-30                  281µs ± 2%     277µs ± 2%  -1.17%  (p=0.009 n=10+10)
KV/Insert/Native/rows=1-30             107µs ± 2%     107µs ± 2%    ~     (p=0.353 n=10+10)
KV/Insert/Native/rows=10-30            156µs ± 2%     155µs ± 4%    ~     (p=0.481 n=10+10)
KV/Insert/Native/rows=100-30           570µs ± 1%     576µs ± 2%    ~     (p=0.075 n=10+10)
KV/Insert/Native/rows=1000-30         4.18ms ± 2%    4.23ms ± 2%    ~     (p=0.143 n=10+10)
KV/Insert/Native/rows=10000-30        43.9ms ± 2%    44.1ms ± 2%    ~     (p=0.393 n=10+10)
KV/Insert/SQL/rows=1-30                412µs ± 2%     410µs ± 2%    ~     (p=0.280 n=10+10)
KV/Insert/SQL/rows=10-30               511µs ± 2%     508µs ± 1%    ~     (p=0.436 n=10+10)
KV/Insert/SQL/rows=1000-30            8.18ms ± 1%    8.11ms ± 2%    ~     (p=0.165 n=10+10)
KV/Insert/SQL/rows=10000-30           85.2ms ± 2%    84.7ms ± 2%    ~     (p=0.481 n=10+10)
KV/Update/Native/rows=1-30             163µs ± 2%     162µs ± 2%    ~     (p=0.436 n=10+10)
KV/Update/Native/rows=10-30            365µs ± 1%     365µs ± 1%    ~     (p=0.739 n=10+10)
KV/Update/Native/rows=10000-30         143ms ± 1%     144ms ± 2%    ~     (p=0.075 n=10+10)
KV/Update/SQL/rows=1-30                525µs ± 2%     522µs ± 3%    ~     (p=0.190 n=10+10)
KV/Update/SQL/rows=10-30               815µs ± 1%     810µs ± 1%    ~     (p=0.190 n=10+10)
KV/Update/SQL/rows=100-30             2.47ms ± 1%    2.48ms ± 1%    ~     (p=0.356 n=9+10)
KV/Update/SQL/rows=1000-30            16.6ms ± 1%    16.7ms ± 1%    ~     (p=0.065 n=9+10)
KV/Update/SQL/rows=10000-30            193ms ± 2%     195ms ± 3%    ~     (p=0.315 n=10+10)
KV/Delete/Native/rows=10-30            173µs ± 2%     171µs ± 3%    ~     (p=0.247 n=10+10)
KV/Delete/Native/rows=100-30           677µs ± 1%     674µs ± 0%    ~     (p=0.475 n=10+7)
KV/Delete/Native/rows=1000-30         5.20ms ± 2%    5.19ms ± 1%    ~     (p=0.853 n=10+10)
KV/Delete/Native/rows=10000-30        53.7ms ± 1%    53.9ms ± 1%    ~     (p=0.684 n=10+10)
KV/Delete/SQL/rows=1-30                377µs ± 3%     375µs ± 1%    ~     (p=0.305 n=10+9)
KV/Delete/SQL/rows=10-30               503µs ± 1%     500µs ± 2%    ~     (p=0.123 n=10+10)
KV/Delete/SQL/rows=100-30             1.52ms ± 4%    1.51ms ± 2%    ~     (p=0.529 n=10+10)
KV/Delete/SQL/rows=1000-30            2.53ms ± 3%    2.53ms ± 3%    ~     (p=1.000 n=10+10)
KV/Delete/SQL/rows=10000-30           21.9ms ± 1%    21.8ms ± 1%    ~     (p=0.315 n=9+10)
KV/Scan/Native/rows=1-30              29.6µs ± 2%    29.8µs ± 1%    ~     (p=0.143 n=10+10)
KV/Scan/Native/rows=100-30            54.6µs ± 1%    55.0µs ± 2%    ~     (p=0.052 n=10+10)
KV/Scan/SQL/rows=10-30                 292µs ± 1%     290µs ± 1%    ~     (p=0.190 n=10+10)
KV/Scan/SQL/rows=100-30                364µs ± 2%     363µs ± 1%    ~     (p=0.684 n=10+10)
KV/Scan/SQL/rows=10000-30             5.34ms ± 2%    5.32ms ± 1%    ~     (p=0.631 n=10+10)
KVAndStorageUsingSQL/versions=2-30    2.59ms ± 1%    2.59ms ± 2%    ~     (p=0.842 n=9+10)
KVAndStorageUsingSQL/versions=4-30    2.57ms ± 3%    2.56ms ± 2%    ~     (p=0.684 n=10+10)
KVAndStorageUsingSQL/versions=8-30    2.60ms ± 3%    2.59ms ± 2%    ~     (p=0.315 n=10+10)
KV/Update/Native/rows=100-30          2.02ms ± 1%    2.04ms ± 1%  +0.95%  (p=0.015 n=10+10)
KV/Update/Native/rows=1000-30         16.2ms ± 2%    16.5ms ± 2%  +1.30%  (p=0.001 n=10+9)
KV/Scan/Native/rows=10-30             32.6µs ± 2%    33.0µs ± 3%  +1.39%  (p=0.019 n=10+10)
KV/Scan/Native/rows=1000-30            266µs ± 1%     270µs ± 1%  +1.51%  (p=0.000 n=9+10)
KV/Scan/SQL/rows=1000-30               982µs ± 1%     997µs ± 1%  +1.60%  (p=0.000 n=10+10)
KV/Scan/Native/rows=10000-30          2.18ms ± 1%    2.23ms ± 1%  +2.55%  (p=0.000 n=10+10)

name                                old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1-30            15.6kB ± 0%    15.6kB ± 0%    ~     (p=0.631 n=10+10)
KV/Insert/Native/rows=10000-30        34.7MB ± 2%    35.0MB ± 2%    ~     (p=0.089 n=10+10)
KV/Insert/SQL/rows=1-30               41.3kB ± 1%    41.4kB ± 1%    ~     (p=0.353 n=10+10)
KV/Insert/SQL/rows=10-30              90.8kB ± 1%    90.8kB ± 0%    ~     (p=0.945 n=10+8)
KV/Insert/SQL/rows=1000-30            5.69MB ± 1%    5.71MB ± 1%    ~     (p=0.436 n=10+10)
KV/Insert/SQL/rows=10000-30           69.6MB ± 1%    69.6MB ± 1%    ~     (p=0.853 n=10+10)
KV/Update/Native/rows=1-30            21.4kB ± 1%    21.4kB ± 0%    ~     (p=0.915 n=9+9)
KV/Update/Native/rows=10000-30        67.8MB ± 1%    68.1MB ± 2%    ~     (p=0.315 n=10+10)
KV/Update/SQL/rows=1-30               47.2kB ± 1%    47.3kB ± 1%    ~     (p=0.280 n=10+10)
KV/Update/SQL/rows=10-30               116kB ± 1%     117kB ± 1%    ~     (p=0.353 n=10+10)
KV/Update/SQL/rows=1000-30            5.20MB ± 2%    5.18MB ± 1%    ~     (p=0.278 n=10+9)
KV/Delete/Native/rows=10000-30        30.5MB ± 2%    30.5MB ± 1%    ~     (p=0.631 n=10+10)
KV/Delete/SQL/rows=1-30               42.9kB ± 0%    43.0kB ± 1%    ~     (p=0.393 n=10+10)
KV/Delete/SQL/rows=10-30              66.9kB ± 3%    66.5kB ± 1%    ~     (p=0.853 n=10+10)
KV/Delete/SQL/rows=1000-30            1.19MB ± 0%    1.19MB ± 0%    ~     (p=0.447 n=9+10)
KV/Delete/SQL/rows=10000-30           14.3MB ± 1%    14.3MB ± 0%    ~     (p=0.353 n=10+10)
KV/Scan/Native/rows=1-30              7.17kB ± 0%    7.18kB ± 0%    ~     (p=0.061 n=10+9)
KV/Scan/Native/rows=10-30             8.59kB ± 0%    8.59kB ± 0%    ~     (p=0.926 n=10+10)
KV/Scan/Native/rows=100-30            21.2kB ± 0%    21.2kB ± 0%    ~     (p=0.781 n=10+10)
KV/Scan/Native/rows=1000-30            172kB ± 0%     172kB ± 0%    ~     (p=0.264 n=10+8)
KV/Scan/Native/rows=10000-30          1.51MB ± 0%    1.51MB ± 0%    ~     (p=0.968 n=9+10)
KV/Scan/SQL/rows=1-30                 19.3kB ± 0%    19.3kB ± 1%    ~     (p=0.968 n=9+10)
KV/Scan/SQL/rows=10-30                20.6kB ± 1%    20.7kB ± 0%    ~     (p=0.897 n=10+8)
KV/Scan/SQL/rows=100-30               32.8kB ± 1%    32.9kB ± 0%    ~     (p=0.400 n=10+9)
KV/Scan/SQL/rows=1000-30               185kB ± 0%     185kB ± 0%    ~     (p=0.247 n=10+10)
KV/Scan/SQL/rows=10000-30             1.10MB ± 0%    1.10MB ± 0%    ~     (p=0.631 n=10+10)
KVAndStorageUsingSQL/versions=2-30     793kB ± 3%     793kB ± 2%    ~     (p=0.796 n=10+10)
KVAndStorageUsingSQL/versions=4-30     788kB ± 3%     783kB ± 3%    ~     (p=0.353 n=10+10)
KVAndStorageUsingSQL/versions=8-30     787kB ± 3%     785kB ± 2%    ~     (p=0.853 n=10+10)
KV/Delete/Native/rows=1-30            15.1kB ± 0%    15.2kB ± 0%  +0.26%  (p=0.029 n=10+10)
KV/Update/Native/rows=10-30           66.8kB ± 0%    67.0kB ± 0%  +0.38%  (p=0.029 n=10+10)
KV/Insert/SQL/rows=100-30              550kB ± 1%     553kB ± 1%  +0.44%  (p=0.043 n=10+10)
KV/Update/SQL/rows=100-30              578kB ± 1%     580kB ± 1%  +0.46%  (p=0.019 n=10+10)
KV/Update/Native/rows=100-30           503kB ± 0%     506kB ± 0%  +0.59%  (p=0.000 n=10+10)
KV/Update/SQL/rows=10000-30            153MB ± 1%     154MB ± 1%  +0.62%  (p=0.006 n=9+10)
KV/Delete/SQL/rows=100-30              306kB ± 0%     308kB ± 0%  +0.67%  (p=0.000 n=10+8)
KV/Delete/Native/rows=10-30           36.1kB ± 1%    36.4kB ± 1%  +0.82%  (p=0.001 n=10+9)
KV/Update/Native/rows=1000-30         4.75MB ± 1%    4.79MB ± 1%  +0.83%  (p=0.002 n=10+10)
KV/Insert/Native/rows=10-30           39.7kB ± 1%    40.1kB ± 1%  +0.83%  (p=0.023 n=10+10)
KV/Insert/Native/rows=1000-30         2.56MB ± 1%    2.58MB ± 1%  +1.00%  (p=0.000 n=9+10)
KV/Insert/Native/rows=100-30           279kB ± 1%     282kB ± 1%  +1.02%  (p=0.007 n=10+10)
KV/Delete/Native/rows=100-30           243kB ± 1%     246kB ± 1%  +1.07%  (p=0.000 n=10+10)
KV/Delete/Native/rows=1000-30         2.23MB ± 1%    2.26MB ± 1%  +1.33%  (p=0.000 n=10+9)

name                                old allocs/op  new allocs/op  delta
KV/Update/SQL/rows=1000-30             58.6k ± 0%     58.4k ± 0%  -0.28%  (p=0.000 n=9+9)
KV/Scan/SQL/rows=10-30                   277 ± 0%       276 ± 0%  -0.25%  (p=0.001 n=8+10)
KV/Update/SQL/rows=10000-30             740k ± 0%      739k ± 0%  -0.12%  (p=0.040 n=8+7)
KV/Insert/Native/rows=1-30               129 ± 0%       129 ± 0%    ~     (all equal)
KV/Insert/Native/rows=10-30              272 ± 0%       272 ± 0%    ~     (all equal)
KV/Insert/Native/rows=1000-30          14.3k ± 0%     14.3k ± 0%    ~     (p=0.221 n=10+10)
KV/Insert/Native/rows=10000-30          141k ± 0%      141k ± 0%    ~     (p=0.356 n=10+9)
KV/Insert/SQL/rows=1-30                  349 ± 0%       348 ± 0%    ~     (p=0.776 n=9+10)
KV/Insert/SQL/rows=10-30                 625 ± 0%       625 ± 0%    ~     (p=0.068 n=9+8)
KV/Insert/SQL/rows=100-30              3.12k ± 0%     3.12k ± 0%    ~     (p=0.351 n=10+9)
KV/Insert/SQL/rows=1000-30             29.3k ± 0%     29.3k ± 0%    ~     (p=0.644 n=9+10)
KV/Insert/SQL/rows=10000-30             294k ± 0%      293k ± 0%    ~     (p=0.134 n=9+7)
KV/Update/Native/rows=1-30               181 ± 0%       181 ± 0%    ~     (all equal)
KV/Update/Native/rows=10-30              458 ± 0%       458 ± 0%    ~     (all equal)
KV/Update/Native/rows=1000-30          26.9k ± 0%     27.0k ± 0%    ~     (p=0.232 n=9+10)
KV/Update/Native/rows=10000-30          273k ± 0%      274k ± 0%    ~     (p=0.315 n=9+10)
KV/Update/SQL/rows=1-30                  503 ± 0%       503 ± 0%    ~     (p=1.000 n=10+10)
KV/Update/SQL/rows=10-30                 904 ± 1%       905 ± 0%    ~     (p=0.223 n=10+10)
KV/Update/SQL/rows=100-30              6.79k ± 0%     6.79k ± 0%    ~     (p=0.669 n=10+10)
KV/Delete/Native/rows=1-30               128 ± 0%       128 ± 0%    ~     (all equal)
KV/Delete/Native/rows=10-30              248 ± 0%       248 ± 0%    ~     (all equal)
KV/Delete/Native/rows=10000-30          112k ± 0%      112k ± 0%    ~     (p=0.825 n=10+9)
KV/Delete/SQL/rows=1-30                  331 ± 0%       331 ± 0%    ~     (all equal)
KV/Delete/SQL/rows=10-30                 512 ± 0%       512 ± 0%    ~     (p=0.406 n=8+10)
KV/Delete/SQL/rows=100-30              2.01k ± 0%     2.01k ± 0%    ~     (p=0.947 n=10+8)
KV/Delete/SQL/rows=1000-30             7.51k ± 0%     7.51k ± 0%    ~     (p=0.204 n=9+10)
KV/Delete/SQL/rows=10000-30            71.2k ± 0%     71.2k ± 0%    ~     (p=0.986 n=9+10)
KV/Scan/Native/rows=1-30                54.0 ± 0%      54.0 ± 0%    ~     (all equal)
KV/Scan/Native/rows=10-30               58.0 ± 0%      58.0 ± 0%    ~     (all equal)
KV/Scan/Native/rows=100-30              62.3 ± 1%      62.3 ± 1%    ~     (p=1.000 n=10+10)
KV/Scan/Native/rows=1000-30             72.3 ± 1%      72.0 ± 0%    ~     (p=0.137 n=10+8)
KV/Scan/Native/rows=10000-30             115 ± 1%       115 ± 1%    ~     (p=0.193 n=9+10)
KV/Scan/SQL/rows=1-30                    242 ± 0%       242 ± 0%    ~     (p=0.828 n=9+10)
KV/Scan/SQL/rows=100-30                  648 ± 0%       648 ± 0%    ~     (all equal)
KV/Scan/SQL/rows=1000-30               5.04k ± 0%     5.04k ± 0%    ~     (p=1.000 n=10+10)
KV/Scan/SQL/rows=10000-30              50.3k ± 0%     50.3k ± 0%    ~     (p=0.208 n=10+10)
KVAndStorageUsingSQL/versions=2-30     7.61k ± 0%     7.61k ± 0%    ~     (p=0.223 n=8+10)
KVAndStorageUsingSQL/versions=4-30     7.61k ± 0%     7.60k ± 0%    ~     (p=0.643 n=10+10)
KVAndStorageUsingSQL/versions=8-30     7.61k ± 0%     7.61k ± 0%    ~     (p=0.888 n=10+9)
KV/Update/Native/rows=100-30           2.76k ± 0%     2.76k ± 0%  +0.03%  (p=0.044 n=10+10)
KV/Delete/Native/rows=1000-30          11.3k ± 0%     11.3k ± 0%  +0.03%  (p=0.006 n=10+9)
KV/Insert/Native/rows=100-30           1.57k ± 0%     1.57k ± 0%  +0.06%  (p=0.000 n=8+7)
KV/Delete/Native/rows=100-30           1.27k ± 0%     1.28k ± 0%  +0.08%  (p=0.000 n=8+8)
```
</details>

<details>
<summary><b>YCSB benchmark suite</b></summary>
  
```
name                   old ops/s    new ops/s    delta
ycsb/A/nodes=3          15.2k ± 6%   15.4k ± 3%    ~     (p=0.579 n=10+10)
ycsb/B/nodes=3          28.7k ± 4%   28.9k ± 1%    ~     (p=0.780 n=10+9)
ycsb/C/nodes=3          41.3k ± 4%   41.4k ± 2%    ~     (p=0.529 n=10+10)
ycsb/D/nodes=3          33.9k ± 2%   33.5k ± 2%  -1.37%  (p=0.003 n=9+9)
ycsb/E/nodes=3          2.10k ± 2%   2.10k ± 1%    ~     (p=0.813 n=10+7)
ycsb/F/nodes=3          7.06k ± 5%   7.05k ± 4%    ~     (p=0.853 n=10+10)
ycsb/A/nodes=3/cpu=32   30.0k ± 9%   31.4k ± 3%    ~     (p=0.095 n=10+9)
ycsb/B/nodes=3/cpu=32   97.6k ± 2%   98.1k ± 2%    ~     (p=0.237 n=8+10)
ycsb/C/nodes=3/cpu=32    129k ± 2%    130k ± 1%    ~     (p=0.243 n=10+9)
ycsb/D/nodes=3/cpu=32    105k ± 2%    106k ± 1%  +1.06%  (p=0.034 n=10+8)
ycsb/E/nodes=3/cpu=32   3.33k ± 1%   3.33k ± 1%    ~     (p=0.951 n=10+9)
ycsb/F/nodes=3/cpu=32   15.5k ±10%   16.4k ± 5%  +5.35%  (p=0.029 n=10+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.32 ± 6%    6.23 ± 3%    ~     (p=0.635 n=10+10)
ycsb/B/nodes=3           5.02 ± 4%    4.97 ± 3%    ~     (p=0.542 n=10+10)
ycsb/C/nodes=3           3.49 ± 3%    3.50 ± 0%    ~     (p=0.443 n=10+9)
ycsb/D/nodes=3           2.80 ± 0%    2.87 ± 2%  +2.50%  (p=0.008 n=8+10)
ycsb/E/nodes=3           45.8 ± 1%    45.8 ± 1%    ~     (p=0.718 n=10+7)
ycsb/F/nodes=3           13.5 ± 3%    13.6 ± 4%    ~     (p=0.509 n=9+10)
ycsb/A/nodes=3/cpu=32    4.82 ±10%    4.62 ± 6%    ~     (p=0.092 n=10+10)
ycsb/B/nodes=3/cpu=32    2.00 ± 0%    1.96 ± 3%    ~     (p=0.065 n=9+10)
ycsb/C/nodes=3/cpu=32    1.50 ± 0%    1.50 ± 0%    ~     (all equal)
ycsb/D/nodes=3/cpu=32    1.40 ± 0%    1.36 ± 4%    ~     (p=0.065 n=9+10)
ycsb/E/nodes=3/cpu=32    43.3 ± 1%    43.3 ± 0%    ~     (p=0.848 n=10+9)
ycsb/F/nodes=3/cpu=32    9.30 ±11%    8.79 ± 5%  -5.48%  (p=0.022 n=10+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           52.0 ±17%    48.4 ±18%    ~     (p=0.379 n=10+10)
ycsb/B/nodes=3           32.2 ±11%    32.2 ± 9%    ~     (p=0.675 n=10+10)
ycsb/C/nodes=3           13.8 ± 6%    13.8 ± 3%    ~     (p=1.000 n=10+10)
ycsb/D/nodes=3           12.6 ± 0%    12.9 ± 2%  +2.38%  (p=0.023 n=8+10)
ycsb/E/nodes=3            200 ± 8%     197 ± 6%    ~     (p=0.375 n=10+8)
ycsb/F/nodes=3            124 ±19%     125 ±21%    ~     (p=0.856 n=9+10)
ycsb/A/nodes=3/cpu=32    68.1 ±17%    63.9 ± 5%    ~     (p=0.211 n=10+8)
ycsb/B/nodes=3/cpu=32    15.7 ± 0%    15.5 ± 2%    ~     (p=0.071 n=6+10)
ycsb/C/nodes=3/cpu=32    10.5 ± 0%    10.2 ± 3%  -2.86%  (p=0.003 n=8+10)
ycsb/D/nodes=3/cpu=32    8.70 ± 3%    8.40 ± 0%  -3.45%  (p=0.003 n=10+8)
ycsb/E/nodes=3/cpu=32     151 ± 0%     151 ± 0%    ~     (all equal)
ycsb/F/nodes=3/cpu=32     148 ±15%     145 ±10%    ~     (p=0.503 n=9+10)
```
</details>

----

Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix.

[^2]: see analysis in #36431 (comment).

Co-authored-by: Nathan VanBenschoten <[email protected]>
@nvanbenschoten
Copy link
Member Author

Replaced by #80706.

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/localTimestamp branch May 9, 2022 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants