Skip to content

Commit

Permalink
storage: add MVCC range tombstone handling in scans and gets
Browse files Browse the repository at this point in the history
This patch adds MVCC range tombstone handling for scans and gets. In the
basic case, this simply means that point keys below an MVCC range
tombstone are not visible.

When tombstones are requested by the caller, the MVCC range tombstones
themselves are never exposed, to avoid having to explicitly handle these
throughout the codebase. Instead, synthetic MVCC point tombstones are
emitted at the start of MVCC range tombstones and wherever they overlap
a point key (above and below). Additionally, point gets return synthetic
point tombstones if they overlap an MVCC range tombstone even if no
existing point key exists. This is based on `pointSynthesizingIter`,
which avoids additional logic in `pebbleMVCCScanner`.

Synthetic MVCC point tombstones emitted for MVCC range tombstones are
not stable, nor are they fully deterministic. For example, the start key
will be truncated by iterator bounds, so an `MVCCScan` over a given key
span may see a synthetic point tombstone at its start (if it overlaps an
MVCC range tombstone), but this will not be emitted if a broader span is
used (a different point tombstone will be emitted instead). Similarly, a
CRDB range split/merge will split/merge MVCC range tombstones, changing
which point tombstones are emitted. Furthermore, `MVCCGet` will
synthesize an MVCC point tombstone if it overlaps an MVCC range
tombstone and there is no existing point key there, while an `MVCCScan`
will not emit these. Callers must take care not to rely on such
semantics for MVCC tombstones. Existing callers have been audited to
ensure they are not affected.

Point tombstone synthesis must be enabled even when the caller has not
requested tombstones, because they must always be taken into account for
conflict/uncertainty checks. However, in these cases we enable range key
masking below the read timestamp, omitting any covered points since
these are no longer needed.

Release note: None
  • Loading branch information
erikgrinaker committed Jun 17, 2022
1 parent 57c79ae commit 8504e23
Show file tree
Hide file tree
Showing 6 changed files with 718 additions and 54 deletions.
2 changes: 2 additions & 0 deletions pkg/kv/kvserver/batcheval/cmd_refresh_range.go
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ func refreshRange(
// Construct an incremental iterator with the desired time bounds. Incremental
// iterators will emit MVCC tombstones by default and will emit intents when
// configured to do so (see IntentPolicy).
//
// TODO(erikgrinaker): This needs to handle MVCC range tombstones.
iter := storage.NewMVCCIncrementalIterator(reader, storage.MVCCIncrementalIterOptions{
EndKey: span.EndKey,
StartTime: refreshFrom, // exclusive
Expand Down
71 changes: 45 additions & 26 deletions pkg/storage/mvcc.go
Original file line number Diff line number Diff line change
Expand Up @@ -738,7 +738,14 @@ func newMVCCIterator(
//
// In tombstones mode, if the most recent value is a deletion tombstone, the
// result will be a non-nil roachpb.Value whose RawBytes field is nil.
// Otherwise, a deletion tombstone results in a nil roachpb.Value.
// Otherwise, a deletion tombstone results in a nil roachpb.Value. MVCC range
// tombstones will be emitted as synthetic point tombstones, regardless of whether
// there is an existing point key.
//
// NB: Synthetic tombstones generated for MVCC range tombstones may not be
// visible to an MVCCScan if there is no existing point key at the key, since
// MVCCScan only synthesizes them at the MVCC range tombstone's start key and
// around existing keys. Callers must not rely on such semantics.
//
// In inconsistent mode, if an intent is encountered, it will be placed in the
// dedicated return parameter. By contrast, in consistent mode, an intent will
Expand All @@ -759,7 +766,10 @@ func newMVCCIterator(
func MVCCGet(
ctx context.Context, reader Reader, key roachpb.Key, timestamp hlc.Timestamp, opts MVCCGetOptions,
) (*roachpb.Value, *roachpb.Intent, error) {
iter := newMVCCIterator(reader, timestamp, !opts.Tombstones, IterOptions{Prefix: true})
iter := newMVCCIterator(reader, timestamp, !opts.Tombstones, IterOptions{
KeyTypes: IterKeyTypePointsAndRanges,
Prefix: true,
})
defer iter.Close()
value, intent, err := mvccGet(ctx, iter, key, timestamp, opts)
return value.ToPointer(), intent, err
Expand All @@ -785,11 +795,14 @@ func mvccGet(
mvccScanner := pebbleMVCCScannerPool.Get().(*pebbleMVCCScanner)
defer mvccScanner.release()

pointIter := newPointSynthesizingIter(iter, true /* emitOnSeekGE */)
defer pointIter.release()

// MVCCGet is implemented as an MVCCScan where we retrieve a single key. We
// specify an empty key for the end key which will ensure we don't retrieve a
// key different than the start key. This is a bit of a hack.
*mvccScanner = pebbleMVCCScanner{
parent: iter,
parent: pointIter,
memAccount: opts.MemoryAccount,
start: key,
ts: timestamp,
Expand Down Expand Up @@ -840,20 +853,6 @@ func mvccGet(
return value, intent, nil
}

// TODO(erikgrinaker): This is temporary until mvccGet always uses point
// synthesis (which requires read-path optimizations).
func mvccGetWithPointSynthesis(
ctx context.Context,
iter MVCCIterator,
key roachpb.Key,
timestamp hlc.Timestamp,
opts MVCCGetOptions,
) (value optionalValue, intent *roachpb.Intent, err error) {
pointIter := newPointSynthesizingIter(iter, true /* emitOnSeekGE */)
defer pointIter.release() // NB: not Close(), since we don't own iter
return mvccGet(ctx, pointIter, key, timestamp, opts)
}

// MVCCGetAsTxn constructs a temporary transaction from the given transaction
// metadata and calls MVCCGet as that transaction. This method is required
// only for reading intents of a transaction when only its metadata is known
Expand Down Expand Up @@ -1194,9 +1193,7 @@ func maybeGetValue(
var exVal optionalValue
if exists {
var err error
exVal, _, err = mvccGetWithPointSynthesis(ctx, iter, key, readTimestamp, MVCCGetOptions{
Tombstones: true,
})
exVal, _, err = mvccGet(ctx, iter, key, readTimestamp, MVCCGetOptions{Tombstones: true})
if err != nil {
return roachpb.Value{}, err
}
Expand Down Expand Up @@ -1264,7 +1261,7 @@ func replayTransactionalWrite(
// This is a special case. This is when the intent hasn't made it
// to the intent history yet. We must now assert the value written
// in the intent to the value we're trying to write.
writtenValue, _, err = mvccGetWithPointSynthesis(ctx, iter, key, timestamp, MVCCGetOptions{
writtenValue, _, err = mvccGet(ctx, iter, key, timestamp, MVCCGetOptions{
Txn: txn,
Tombstones: true,
})
Expand Down Expand Up @@ -1319,7 +1316,7 @@ func replayTransactionalWrite(
// last committed value on the key. Since we want the last committed
// value on the key, we must make an inconsistent read so we ignore
// our previous intents here.
exVal, _, err = mvccGetWithPointSynthesis(ctx, iter, key, timestamp, MVCCGetOptions{
exVal, _, err = mvccGet(ctx, iter, key, timestamp, MVCCGetOptions{
Inconsistent: true,
Tombstones: true,
})
Expand Down Expand Up @@ -1560,7 +1557,7 @@ func mvccPutInternal(
//
// Since we want the last committed value on the key, we must make
// an inconsistent read so we ignore our previous intents here.
exVal, _, err = mvccGetWithPointSynthesis(ctx, iter, key, readTimestamp, MVCCGetOptions{
exVal, _, err = mvccGet(ctx, iter, key, readTimestamp, MVCCGetOptions{
Inconsistent: true,
Tombstones: true,
})
Expand Down Expand Up @@ -2528,8 +2525,11 @@ func mvccScanToBytes(
mvccScanner := pebbleMVCCScannerPool.Get().(*pebbleMVCCScanner)
defer mvccScanner.release()

pointIter := newPointSynthesizingIter(iter, false /* emitOnSeekGE */)
defer pointIter.release()

*mvccScanner = pebbleMVCCScanner{
parent: iter,
parent: pointIter,
memAccount: opts.MemoryAccount,
reverse: opts.Reverse,
start: key,
Expand Down Expand Up @@ -2738,7 +2738,21 @@ type MVCCScanResult struct {
// In tombstones mode, if the most recent value for a key is a deletion
// tombstone, the scan result will contain a roachpb.KeyValue for that key whose
// RawBytes field is nil. Otherwise, the key-value pair will be omitted from the
// result entirely.
// result entirely. For MVCC range tombstones, synthetic MVCC point tombstones
// will be emitted at the start of the range tombstone and where they overlap a
// point key.
//
// NB: Synthetic MVCC point tombstones emitted for MVCC range tombstones are
// not stable, nor are they fully deterministic. For example, the start key will
// be truncated by iterator bounds, so an MVCCScan over a given key span may see
// a synthetic point tombstone at its start (if it overlaps an MVCC range
// tombstone), but this will not be emitted if a broader span is used (a
// different point tombstone will be emitted instead). Similarly, a CRDB range
// split/merge will split/merge MVCC range tombstones, changing which point
// tombstones are emitted. Furthermore, MVCCGet will synthesize an MVCC point
// tombstone if it overlaps an MVCC range tombstone and there is no existing
// point key there, while an MVCCScan will not emit these. Callers must take
// care not to rely on such semantics for MVCC tombstones.
//
// When scanning inconsistently, any encountered intents will be placed in the
// dedicated result parameter. By contrast, when scanning consistently, any
Expand All @@ -2762,6 +2776,7 @@ func MVCCScan(
opts MVCCScanOptions,
) (MVCCScanResult, error) {
iter := newMVCCIterator(reader, timestamp, !opts.Tombstones, IterOptions{
KeyTypes: IterKeyTypePointsAndRanges,
LowerBound: key,
UpperBound: endKey,
})
Expand All @@ -2778,6 +2793,7 @@ func MVCCScanToBytes(
opts MVCCScanOptions,
) (MVCCScanResult, error) {
iter := newMVCCIterator(reader, timestamp, !opts.Tombstones, IterOptions{
KeyTypes: IterKeyTypePointsAndRanges,
LowerBound: key,
UpperBound: endKey,
})
Expand Down Expand Up @@ -2814,7 +2830,9 @@ func MVCCScanAsTxn(
// the reverse flag is set, the iterator will be moved in reverse order. If the
// scan options specify an inconsistent scan, all "ignored" intents will be
// returned. In consistent mode, intents are only ever returned as part of a
// WriteIntentError.
// WriteIntentError. In Tombstones mode, MVCC range tombstones are emitted as
// synthetic point tombstones at their start key and around overlapping point
// keys.
func MVCCIterate(
ctx context.Context,
reader Reader,
Expand All @@ -2824,6 +2842,7 @@ func MVCCIterate(
f func(roachpb.KeyValue) error,
) ([]roachpb.Intent, error) {
iter := newMVCCIterator(reader, timestamp, !opts.Tombstones, IterOptions{
KeyTypes: IterKeyTypePointsAndRanges,
LowerBound: key,
UpperBound: endKey,
})
Expand Down
16 changes: 14 additions & 2 deletions pkg/storage/pebble_mvcc_scanner.go
Original file line number Diff line number Diff line change
Expand Up @@ -281,8 +281,11 @@ func extractResultKey(repr []byte) roachpb.Key {
return key.Key
}

// Go port of mvccScanner in libroach/mvcc.h. Stores all variables relating to
// one MVCCGet / MVCCScan call.
// pebbleMVCCScanner performs an MVCCGet / MVCCScan.
//
// It does not support MVCC range tombstones, and will error if encountered.
// The caller is expected to use a pointSynthesizingIter to synthesize MVCC
// point tombstones for these.
type pebbleMVCCScanner struct {
parent MVCCIterator
// memAccount is used to account for the size of the scan results.
Expand Down Expand Up @@ -1124,6 +1127,15 @@ func (p *pebbleMVCCScanner) iterValid() bool {
}
return false
}
// We don't support range keys (i.e. MVCC range tombstones), and expect the
// caller to use a pointSynthesizingIter instead. Assert that we never see one
// of these, but only in race builds to avoid a performance hit.
if util.RaceEnabled {
if _, hasRange := p.parent.HasPointAndRange(); hasRange {
p.err = errors.AssertionFailedf("unexpected range key at %s", p.parent.UnsafeKey())
return false
}
}
return true
}

Expand Down
37 changes: 11 additions & 26 deletions pkg/storage/testdata/mvcc_histories/range_tombstone_conflicts
Original file line number Diff line number Diff line change
Expand Up @@ -75,17 +75,12 @@ data: "j"/1.000000000,0 -> /INT/1
error: (*withstack.withStack:) "b"/0,0: put is inline=true, but existing value is inline=false

# DeleteRange at ts=5 should error with WriteTooOldError.
#
# TODO(erikgrinaker): This should error on c rather than d, but won't do so
# until MVCCScan respects MVCC range tombstones. Until it does, the
# put will actually do a write at the new timestamp.
run error
del_range k=a end=f ts=5
----
>> at end:
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "d"/5.000000000,1 -> /<empty>
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/2.000000000,0 -> /<empty>
data: "e"/1.000000000,0 -> /BYTES/e1
Expand All @@ -95,7 +90,7 @@ data: "g"/7.000000000,0 -> /BYTES/7
data: "g"/1.000000000,0 -> /BYTES/g1
data: "i"/1.000000000,0 -> /INT/1
data: "j"/1.000000000,0 -> /INT/1
error: (*roachpb.WriteTooOldError:) WriteTooOldError: write for key "d" at timestamp 5.000000000,0 too old; wrote at 5.000000000,1
error: (*roachpb.WriteTooOldError:) WriteTooOldError: write for key "c" at timestamp 5.000000000,0 too old; wrote at 5.000000000,1

# Point key below range tombstones should error, but is written anyway at a
# higher timestamp.
Expand All @@ -110,7 +105,6 @@ put k=c ts=3 v=c3
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/2.000000000,0 -> /<empty>
data: "e"/1.000000000,0 -> /BYTES/e1
Expand All @@ -129,8 +123,7 @@ put k=d ts=3 v=d3
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/2.000000000,0 -> /<empty>
data: "e"/1.000000000,0 -> /BYTES/e1
Expand All @@ -140,7 +133,7 @@ data: "g"/7.000000000,0 -> /BYTES/7
data: "g"/1.000000000,0 -> /BYTES/g1
data: "i"/1.000000000,0 -> /INT/1
data: "j"/1.000000000,0 -> /INT/1
error: (*roachpb.WriteTooOldError:) WriteTooOldError: write for key "d" at timestamp 3.000000000,0 too old; wrote at 5.000000000,2
error: (*roachpb.WriteTooOldError:) WriteTooOldError: write for key "d" at timestamp 3.000000000,0 too old; wrote at 5.000000000,1

run error
put k=e ts=3 v=e3
Expand All @@ -149,8 +142,7 @@ put k=e ts=3 v=e3
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -171,8 +163,7 @@ cput k=f ts=7 v=f7 cond=f1
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -195,8 +186,7 @@ with t=A ts=7
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -222,8 +212,7 @@ txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=7.000000000,0 min=0,
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -244,8 +233,7 @@ initput k=f ts=7 v=f7 failOnTombstones
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -269,8 +257,7 @@ initput k=f ts=7 v=f7
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -292,8 +279,7 @@ increment k=i ts=2
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -317,8 +303,7 @@ inc: current value = 1
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand Down
Loading

0 comments on commit 8504e23

Please sign in to comment.