Skip to content

Commit

Permalink
kv/storage: introduce local timestamps for MVCC versions in MVCCValue
Browse files Browse the repository at this point in the history
Fixes #36431.
Fixes #49360.
Replaces #72121.
Replaces #77342.

This commit fixes the potential for a stale read as detailed in #36431 using the
"remember when intents were written" approach described in
#36431 (comment) and
later expanded on in
#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in #49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in #72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in #36431 (comment).
  • Loading branch information
nvanbenschoten committed May 9, 2022
1 parent d5b3fe7 commit 24c56df
Show file tree
Hide file tree
Showing 61 changed files with 5,499 additions and 920 deletions.
8 changes: 4 additions & 4 deletions pkg/ccl/backupccl/backup_metadata.go
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ func writeDescsToMetadata(ctx context.Context, sst storage.SSTWriter, m *BackupM
b = bytes
}
}
if err := sst.PutMVCC(storage.MVCCKey{Key: k, Timestamp: i.Time}, b); err != nil {
if err := sst.PutRawMVCC(storage.MVCCKey{Key: k, Timestamp: i.Time}, b); err != nil {
return err
}

Expand All @@ -214,7 +214,7 @@ func writeDescsToMetadata(ctx context.Context, sst storage.SSTWriter, m *BackupM
return err
}
} else {
if err := sst.PutMVCC(storage.MVCCKey{Key: k, Timestamp: m.StartTime}, b); err != nil {
if err := sst.PutRawMVCC(storage.MVCCKey{Key: k, Timestamp: m.StartTime}, b); err != nil {
return err
}
}
Expand Down Expand Up @@ -340,7 +340,7 @@ func writeNamesToMetadata(ctx context.Context, sst storage.SSTWriter, m *BackupM
}
k := encodeNameSSTKey(rev.parent, rev.parentSchema, rev.name)
v := encoding.EncodeUvarintAscending(nil, uint64(rev.id))
if err := sst.PutMVCC(storage.MVCCKey{Key: k, Timestamp: rev.ts}, v); err != nil {
if err := sst.PutRawMVCC(storage.MVCCKey{Key: k, Timestamp: rev.ts}, v); err != nil {
return err
}
}
Expand Down Expand Up @@ -384,7 +384,7 @@ func writeSpansToMetadata(ctx context.Context, sst storage.SSTWriter, m *BackupM
}
} else {
k := storage.MVCCKey{Key: encodeSpanSSTKey(sp), Timestamp: ts}
if err := sst.PutMVCC(k, nil); err != nil {
if err := sst.PutRawMVCC(k, nil); err != nil {
return err
}
}
Expand Down
2 changes: 1 addition & 1 deletion pkg/ccl/backupccl/backup_processor.go
Original file line number Diff line number Diff line change
Expand Up @@ -846,7 +846,7 @@ func (s *sstSink) write(ctx context.Context, resp returnedSST) error {
return err
}
} else {
if err := s.sst.PutMVCC(sst.UnsafeKey(), sst.UnsafeValue()); err != nil {
if err := s.sst.PutRawMVCC(sst.UnsafeKey(), sst.UnsafeValue()); err != nil {
return err
}
}
Expand Down
5 changes: 3 additions & 2 deletions pkg/ccl/backupccl/restore_data_processor.go
Original file line number Diff line number Diff line change
Expand Up @@ -476,9 +476,10 @@ func (rd *restoreDataProcessor) processRestoreSpanEntry(
continue
}

keyScratch = append(keyScratch[:0], iter.UnsafeKey().Key...)
key := iter.UnsafeKey()
keyScratch = append(keyScratch[:0], key.Key...)
key.Key = keyScratch
valueScratch = append(valueScratch[:0], iter.UnsafeValue()...)
key := storage.MVCCKey{Key: keyScratch, Timestamp: iter.UnsafeKey().Timestamp}
value := roachpb.Value{RawBytes: valueScratch}
iter.NextKey()

Expand Down
25 changes: 14 additions & 11 deletions pkg/ccl/backupccl/restore_data_processor_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -81,20 +81,23 @@ func slurpSSTablesLatestKey(
if !sst.UnsafeKey().Less(end) {
break
}
var ok bool
var newKv storage.MVCCKeyValue
key := sst.UnsafeKey()
newKv.Value = append(newKv.Value, sst.UnsafeValue()...)
newKv.Key.Key = append(newKv.Key.Key, key.Key...)
newKv.Key.Timestamp = key.Timestamp
newKv.Key.Key, ok = kr.rewriteKey(newKv.Key.Key)
value, err := storage.DecodeMVCCValue(sst.UnsafeValue())
if err != nil {
t.Fatal(err)
}
newKey := key
newKey.Key = append([]byte(nil), newKey.Key...)
var ok bool
newKey.Key, ok = kr.rewriteKey(newKey.Key)
if !ok {
t.Fatalf("could not rewrite key: %s", newKv.Key.Key)
t.Fatalf("could not rewrite key: %s", newKey.Key)
}
v := roachpb.Value{RawBytes: newKv.Value}
v.ClearChecksum()
v.InitChecksum(newKv.Key.Key)
if err := batch.PutMVCC(newKv.Key, v.RawBytes); err != nil {
newValue := value
newValue.Value.RawBytes = append([]byte(nil), newValue.Value.RawBytes...)
newValue.Value.ClearChecksum()
newValue.Value.InitChecksum(newKey.Key)
if err := batch.PutMVCC(newKey, newValue); err != nil {
t.Fatal(err)
}
sst.Next()
Expand Down
2 changes: 1 addition & 1 deletion pkg/cli/debug_check_store_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ func TestDebugCheckStore(t *testing.T) {
// Should not error out randomly.
for _, dir := range storePaths {
out, err := check(dir)
require.NoError(t, err, dir)
require.NoError(t, err, "dir=%s\nout=%s\n", dir, out)
require.Contains(t, out, "total stats", dir)
}

Expand Down
1 change: 1 addition & 0 deletions pkg/cmd/roachtest/tests/acceptance.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ func registerAcceptance(r registry.Registry) {
},
{
name: "version-upgrade",
skip: "WIP: unskip when version checks are added to local_timestamp writes",
fn: func(ctx context.Context, t test.Test, c cluster.Cluster) {
runVersionUpgrade(ctx, t, c)
},
Expand Down
23 changes: 17 additions & 6 deletions pkg/kv/kvnemesis/engine.go
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,18 @@ func (e *Engine) Get(key roachpb.Key, ts hlc.Timestamp) roachpb.Value {
if !mvccKey.Key.Equal(key) {
return roachpb.Value{}
}
if len(iter.Value()) == 0 {
return roachpb.Value{}
}
var valCopy []byte
e.b, valCopy = e.b.Copy(iter.Value(), 0 /* extraCap */)
return roachpb.Value{RawBytes: valCopy, Timestamp: mvccKey.Timestamp}
mvccVal, err := storage.DecodeMVCCValue(valCopy)
if err != nil {
panic(err)
}
if mvccVal.IsTombstone() {
return roachpb.Value{}
}
val := mvccVal.Value
val.Timestamp = mvccKey.Timestamp
return val
}

// Put inserts a key/value/timestamp tuple. If an exact key/timestamp pair is
Expand Down Expand Up @@ -124,8 +130,13 @@ func (e *Engine) DebugPrint(indent string) string {
if err != nil {
fmt.Fprintf(&buf, "(err:%s)", err)
} else {
fmt.Fprintf(&buf, "%s%s %s -> %s",
indent, key.Key, key.Timestamp, roachpb.Value{RawBytes: value}.PrettyPrint())
v, err := storage.DecodeMVCCValue(value)
if err != nil {
fmt.Fprintf(&buf, "(err:%s)", err)
} else {
fmt.Fprintf(&buf, "%s%s %s -> %s",
indent, key.Key, key.Timestamp, v.Value.PrettyPrint())
}
}
})
return buf.String()
Expand Down
20 changes: 14 additions & 6 deletions pkg/kv/kvnemesis/validator.go
Original file line number Diff line number Diff line change
Expand Up @@ -273,8 +273,12 @@ func makeValidator(kvs *Engine) (*validator, error) {
err = errors.CombineErrors(err, iterErr)
return
}
v := roachpb.Value{RawBytes: value}
if v.GetTag() != roachpb.ValueType_UNKNOWN {
v, decodeErr := storage.DecodeMVCCValue(value)
if err != nil {
err = errors.CombineErrors(err, decodeErr)
return
}
if v.Value.GetTag() != roachpb.ValueType_UNKNOWN {
valueStr := mustGetStringValue(value)
if existing, ok := kvByValue[valueStr]; ok {
// TODO(dan): This may be too strict. Some operations (db.Run on a
Expand All @@ -287,7 +291,7 @@ func makeValidator(kvs *Engine) (*validator, error) {
// globally over a run, so there's a 1:1 relationship between a value that
// was written and the operation that wrote it.
kvByValue[valueStr] = storage.MVCCKeyValue{Key: key, Value: value}
} else if len(value) == 0 {
} else if !v.Value.IsPresent() {
rawKey := string(key.Key)
if _, ok := tombstonesForKey[rawKey]; !ok {
tombstonesForKey[rawKey] = make(map[hlc.Timestamp]bool)
Expand Down Expand Up @@ -914,14 +918,18 @@ func resultIsErrorStr(r Result, msgRE string) bool {
}

func mustGetStringValue(value []byte) string {
if len(value) == 0 {
v, err := storage.DecodeMVCCValue(value)
if err != nil {
panic(errors.Wrapf(err, "decoding %x", value))
}
if v.IsTombstone() {
return `<nil>`
}
v, err := roachpb.Value{RawBytes: value}.GetBytes()
b, err := v.Value.GetBytes()
if err != nil {
panic(errors.Wrapf(err, "decoding %x", value))
}
return string(v)
return string(b)
}

func validReadTimes(
Expand Down
10 changes: 7 additions & 3 deletions pkg/kv/kvserver/batcheval/cmd_add_sstable.go
Original file line number Diff line number Diff line change
Expand Up @@ -335,7 +335,7 @@ func EvalAddSSTable(
return result.Result{}, err
}
} else {
if err := readWriter.PutMVCC(k, sstIter.UnsafeValue()); err != nil {
if err := readWriter.PutRawMVCC(k, sstIter.UnsafeValue()); err != nil {
return result.Result{}, err
}
}
Expand Down Expand Up @@ -401,11 +401,15 @@ func assertSSTContents(sst []byte, sstTimestamp hlc.Timestamp, stats *enginepb.M
break
}

key, value := iter.UnsafeKey(), iter.UnsafeValue()
key, valueRaw := iter.UnsafeKey(), iter.UnsafeValue()
value, err := storage.DecodeMVCCValue(valueRaw)
if err != nil {
return err
}
if key.Timestamp.IsEmpty() {
return errors.AssertionFailedf("SST contains inline value or intent for key %s", key)
}
if len(value) == 0 {
if value.IsTombstone() {
return errors.AssertionFailedf("SST contains tombstone for key %s", key)
}
if sstTimestamp.IsSet() && key.Timestamp != sstTimestamp {
Expand Down
6 changes: 3 additions & 3 deletions pkg/kv/kvserver/batcheval/cmd_add_sstable_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1117,7 +1117,7 @@ func TestAddSSTableMVCCStats(t *testing.T) {
{"e", 1, "e"},
{"z", 2, "zzzzzz"},
} {
require.NoError(t, engine.PutMVCC(kv.MVCCKey(), kv.ValueBytes()))
require.NoError(t, engine.PutMVCC(kv.MVCCKey(), kv.MVCCValue()))
}

sst, start, end := sstutil.MakeSST(t, st, []sstutil.KV{
Expand Down Expand Up @@ -1228,7 +1228,7 @@ func TestAddSSTableMVCCStatsDisallowShadowing(t *testing.T) {
{"y", 5, "yyy"},
{"z", 2, "zz"},
} {
require.NoError(t, engine.PutMVCC(kv.MVCCKey(), kv.ValueBytes()))
require.NoError(t, engine.PutMVCC(kv.MVCCKey(), kv.MVCCValue()))
}

// This test ensures accuracy of MVCCStats in the situation that successive
Expand Down Expand Up @@ -1270,7 +1270,7 @@ func TestAddSSTableMVCCStatsDisallowShadowing(t *testing.T) {
// ingesting the perfectly shadowing KVs (same ts and same value) in the
// second SST.
for _, kv := range kvs {
require.NoError(t, engine.PutMVCC(kv.MVCCKey(), kv.ValueBytes()))
require.NoError(t, engine.PutMVCC(kv.MVCCKey(), kv.MVCCValue()))
}

// Evaluate the second SST. Both the KVs are perfectly shadowing and should
Expand Down
Loading

0 comments on commit 24c56df

Please sign in to comment.