Skip to content

Commit

Permalink
Merge #82045
Browse files Browse the repository at this point in the history
82045: storage: add MVCC range tombstone handling in scans and gets r=jbowens a=erikgrinaker

This patch adds MVCC range tombstone handling for scans and gets. In the
basic case, this simply means that point keys below an MVCC range
tombstone are not visible.

When tombstones are requested by the caller, the MVCC range tombstones
themselves are never exposed, to avoid having to explicitly handle these
throughout the codebase. Instead, synthetic MVCC point tombstones are
emitted at the start of MVCC range tombstones and wherever they overlap
a point key (above and below). Additionally, point gets return synthetic
point tombstones if they overlap an MVCC range tombstone even if no
existing point key exists. This is based on `pointSynthesizingIter`,
which avoids additional logic in `pebbleMVCCScanner`.

Synthetic MVCC point tombstones emitted for MVCC range tombstones are
not stable, nor are they fully deterministic. For example, the start key
will be truncated by iterator bounds, so an `MVCCScan` over a given key
span may see a synthetic point tombstone at its start (if it overlaps an
MVCC range tombstone), but this will not be emitted if a broader span is
used (a different point tombstone will be emitted instead). Similarly, a
CRDB range split/merge will split/merge MVCC range tombstones, changing
which point tombstones are emitted. Furthermore, `MVCCGet` will
synthesize an MVCC point tombstone if it overlaps an MVCC range
tombstone and there is no existing point key there, while an `MVCCScan`
will not emit these. Callers must take care not to rely on such
semantics for MVCC tombstones. Existing callers have been audited to
ensure they are not affected.

Point tombstone synthesis must be enabled even when the caller has not
requested tombstones, because they must always be taken into account for
conflict/uncertainty checks. However, in these cases we enable range key
masking below the read timestamp, omitting any covered points since
these are no longer needed.

Touches #70412.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
  • Loading branch information
craig[bot] and erikgrinaker committed Jun 26, 2022
2 parents b47dc37 + 6733f95 commit 460ce6a
Show file tree
Hide file tree
Showing 6 changed files with 954 additions and 53 deletions.
61 changes: 37 additions & 24 deletions pkg/storage/mvcc.go
Original file line number Diff line number Diff line change
Expand Up @@ -738,7 +738,14 @@ func newMVCCIterator(
//
// In tombstones mode, if the most recent value is a deletion tombstone, the
// result will be a non-nil roachpb.Value whose RawBytes field is nil.
// Otherwise, a deletion tombstone results in a nil roachpb.Value.
// Otherwise, a deletion tombstone results in a nil roachpb.Value. MVCC range
// tombstones will be emitted as synthetic point tombstones, regardless of whether
// there is an existing point key.
//
// NB: Synthetic tombstones generated for MVCC range tombstones may not be
// visible to an MVCCScan if there is no existing point key at the key, since
// MVCCScan only synthesizes them at the MVCC range tombstone's start key and
// around existing keys. Callers must not rely on such semantics.
//
// In inconsistent mode, if an intent is encountered, it will be placed in the
// dedicated return parameter. By contrast, in consistent mode, an intent will
Expand All @@ -759,7 +766,10 @@ func newMVCCIterator(
func MVCCGet(
ctx context.Context, reader Reader, key roachpb.Key, timestamp hlc.Timestamp, opts MVCCGetOptions,
) (*roachpb.Value, *roachpb.Intent, error) {
iter := newMVCCIterator(reader, timestamp, !opts.Tombstones, IterOptions{Prefix: true})
iter := newMVCCIterator(reader, timestamp, !opts.Tombstones, IterOptions{
KeyTypes: IterKeyTypePointsAndRanges,
Prefix: true,
})
defer iter.Close()
value, intent, err := mvccGet(ctx, iter, key, timestamp, opts)
return value.ToPointer(), intent, err
Expand Down Expand Up @@ -840,20 +850,6 @@ func mvccGet(
return value, intent, nil
}

// TODO(erikgrinaker): This is temporary until mvccGet always uses point
// synthesis (which requires read-path optimizations).
func mvccGetWithPointSynthesis(
ctx context.Context,
iter MVCCIterator,
key roachpb.Key,
timestamp hlc.Timestamp,
opts MVCCGetOptions,
) (value optionalValue, intent *roachpb.Intent, err error) {
pointIter := newPointSynthesizingIter(iter, true /* emitOnSeekGE */)
defer pointIter.release() // NB: not Close(), since we don't own iter
return mvccGet(ctx, pointIter, key, timestamp, opts)
}

// MVCCGetAsTxn constructs a temporary transaction from the given transaction
// metadata and calls MVCCGet as that transaction. This method is required
// only for reading intents of a transaction when only its metadata is known
Expand Down Expand Up @@ -1194,9 +1190,7 @@ func maybeGetValue(
var exVal optionalValue
if exists {
var err error
exVal, _, err = mvccGetWithPointSynthesis(ctx, iter, key, readTimestamp, MVCCGetOptions{
Tombstones: true,
})
exVal, _, err = mvccGet(ctx, iter, key, readTimestamp, MVCCGetOptions{Tombstones: true})
if err != nil {
return roachpb.Value{}, err
}
Expand Down Expand Up @@ -1264,7 +1258,7 @@ func replayTransactionalWrite(
// This is a special case. This is when the intent hasn't made it
// to the intent history yet. We must now assert the value written
// in the intent to the value we're trying to write.
writtenValue, _, err = mvccGetWithPointSynthesis(ctx, iter, key, timestamp, MVCCGetOptions{
writtenValue, _, err = mvccGet(ctx, iter, key, timestamp, MVCCGetOptions{
Txn: txn,
Tombstones: true,
})
Expand Down Expand Up @@ -1319,7 +1313,7 @@ func replayTransactionalWrite(
// last committed value on the key. Since we want the last committed
// value on the key, we must make an inconsistent read so we ignore
// our previous intents here.
exVal, _, err = mvccGetWithPointSynthesis(ctx, iter, key, timestamp, MVCCGetOptions{
exVal, _, err = mvccGet(ctx, iter, key, timestamp, MVCCGetOptions{
Inconsistent: true,
Tombstones: true,
})
Expand Down Expand Up @@ -1560,7 +1554,7 @@ func mvccPutInternal(
//
// Since we want the last committed value on the key, we must make
// an inconsistent read so we ignore our previous intents here.
exVal, _, err = mvccGetWithPointSynthesis(ctx, iter, key, readTimestamp, MVCCGetOptions{
exVal, _, err = mvccGet(ctx, iter, key, readTimestamp, MVCCGetOptions{
Inconsistent: true,
Tombstones: true,
})
Expand Down Expand Up @@ -2738,7 +2732,21 @@ type MVCCScanResult struct {
// In tombstones mode, if the most recent value for a key is a deletion
// tombstone, the scan result will contain a roachpb.KeyValue for that key whose
// RawBytes field is nil. Otherwise, the key-value pair will be omitted from the
// result entirely.
// result entirely. For MVCC range tombstones, synthetic MVCC point tombstones
// will be emitted at the start of the range tombstone and where they overlap a
// point key.
//
// NB: Synthetic MVCC point tombstones emitted for MVCC range tombstones are
// not stable, nor are they fully deterministic. For example, the start key will
// be truncated by iterator bounds, so an MVCCScan over a given key span may see
// a synthetic point tombstone at its start (if it overlaps an MVCC range
// tombstone), but this will not be emitted if a broader span is used (a
// different point tombstone will be emitted instead). Similarly, a CRDB range
// split/merge will split/merge MVCC range tombstones, changing which point
// tombstones are emitted. Furthermore, MVCCGet will synthesize an MVCC point
// tombstone if it overlaps an MVCC range tombstone and there is no existing
// point key there, while an MVCCScan will not emit these. Callers must take
// care not to rely on such semantics for MVCC tombstones.
//
// When scanning inconsistently, any encountered intents will be placed in the
// dedicated result parameter. By contrast, when scanning consistently, any
Expand All @@ -2762,6 +2770,7 @@ func MVCCScan(
opts MVCCScanOptions,
) (MVCCScanResult, error) {
iter := newMVCCIterator(reader, timestamp, !opts.Tombstones, IterOptions{
KeyTypes: IterKeyTypePointsAndRanges,
LowerBound: key,
UpperBound: endKey,
})
Expand All @@ -2778,6 +2787,7 @@ func MVCCScanToBytes(
opts MVCCScanOptions,
) (MVCCScanResult, error) {
iter := newMVCCIterator(reader, timestamp, !opts.Tombstones, IterOptions{
KeyTypes: IterKeyTypePointsAndRanges,
LowerBound: key,
UpperBound: endKey,
})
Expand Down Expand Up @@ -2814,7 +2824,9 @@ func MVCCScanAsTxn(
// the reverse flag is set, the iterator will be moved in reverse order. If the
// scan options specify an inconsistent scan, all "ignored" intents will be
// returned. In consistent mode, intents are only ever returned as part of a
// WriteIntentError.
// WriteIntentError. In Tombstones mode, MVCC range tombstones are emitted as
// synthetic point tombstones at their start key and around overlapping point
// keys.
func MVCCIterate(
ctx context.Context,
reader Reader,
Expand All @@ -2824,6 +2836,7 @@ func MVCCIterate(
f func(roachpb.KeyValue) error,
) ([]roachpb.Intent, error) {
iter := newMVCCIterator(reader, timestamp, !opts.Tombstones, IterOptions{
KeyTypes: IterKeyTypePointsAndRanges,
LowerBound: key,
UpperBound: endKey,
})
Expand Down
2 changes: 1 addition & 1 deletion pkg/storage/mvcc_history_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1193,7 +1193,7 @@ func cmdScan(e *evalCtx) error {
// ascertain no result is populated in the intents when an error
// occurs.
for _, intent := range res.Intents {
e.results.buf.Printf("scan: %v -> intent {%s}\n", key, intent.Txn)
e.results.buf.Printf("scan: intent %v {%s}\n", intent.Intent_SingleKeySpan.Key, intent.Txn)
}
for _, val := range res.KVs {
e.results.buf.Printf("scan: %v -> %v @%v\n", val.Key, val.Value.PrettyPrint(), val.Value.Timestamp)
Expand Down
46 changes: 44 additions & 2 deletions pkg/storage/pebble_mvcc_scanner.go
Original file line number Diff line number Diff line change
Expand Up @@ -281,10 +281,19 @@ func extractResultKey(repr []byte) roachpb.Key {
return key.Key
}

// Go port of mvccScanner in libroach/mvcc.h. Stores all variables relating to
// one MVCCGet / MVCCScan call.
// pebbleMVCCScanner handles MVCCScan / MVCCGet using a Pebble iterator. If any
// MVCC range tombstones are encountered, it synthesizes MVCC point tombstones
// by switching to a pointSynthesizingIter.
type pebbleMVCCScanner struct {
parent MVCCIterator
// pointIter is a point synthesizing iterator that wraps and replaces parent
// when an MVCC range tombstone is encountered. A separate reference to it is
// kept in order to release it back to its pool when the scanner is done.
//
// TODO(erikgrinaker): pooling and reusing this together with the
// pebbleMVCCScanner would be more efficient, but only if MVCC range
// tombstones are frequent. We expect them to be rare, so we don't for now.
pointIter *pointSynthesizingIter
// memAccount is used to account for the size of the scan results.
memAccount *mon.BoundAccount
reverse bool
Expand Down Expand Up @@ -373,6 +382,9 @@ var pebbleMVCCScannerPool = sync.Pool{
}

func (p *pebbleMVCCScanner) release() {
if p.pointIter != nil {
p.pointIter.release()
}
// Discard most memory references before placing in pool.
*p = pebbleMVCCScanner{
keyBuf: p.keyBuf,
Expand Down Expand Up @@ -1070,6 +1082,9 @@ func (p *pebbleMVCCScanner) updateCurrent() bool {
if !p.iterValid() {
return false
}
if !p.maybeEnablePointSynthesis() {
return false
}

p.curRawKey = p.parent.UnsafeRawMVCCKey()

Expand All @@ -1089,6 +1104,30 @@ func (p *pebbleMVCCScanner) updateCurrent() bool {
return true
}

// maybeEnablePointSynthesis checks if p.parent is on an MVCC range tombstone.
// If it is, it wraps and replaces p.parent with a pointSynthesizingIter, which
// synthesizes MVCC point tombstones for MVCC range tombstones and never emits
// range keys itself.
//
// Returns the iterator validity after a switch, and otherwise assumes the
// iterator was valid when called and returns true if there is no change.
func (p *pebbleMVCCScanner) maybeEnablePointSynthesis() bool {
if _, hasRange := p.parent.HasPointAndRange(); hasRange {
// TODO(erikgrinaker): We have to seek to the current iterator position to
// correctly initialize pointSynthesizingIter, but we could just load the
// underlying iterator state instead. We'll optimize this later.
//
// NB: The seek must use a cloned key, because it repositions the underlying
// iterator and then uses the given seek key for comparisons, which is not
// safe with UnsafeKey().
p.pointIter = newPointSynthesizingIter(p.parent, p.isGet)
p.pointIter.SeekGE(p.parent.Key())
p.parent = p.pointIter
return p.iterValid()
}
return true
}

func (p *pebbleMVCCScanner) decodeCurrentMetadata() bool {
if len(p.curRawValue) == 0 {
p.err = errors.Errorf("zero-length mvcc metadata")
Expand Down Expand Up @@ -1213,6 +1252,9 @@ func (p *pebbleMVCCScanner) iterPeekPrev() ([]byte, bool) {
// iterator at the first entry, and in the latter iteration will be done.
return nil, false
}
if !p.maybeEnablePointSynthesis() {
return nil, false
}
} else if !p.iterValid() {
return nil, false
}
Expand Down
37 changes: 11 additions & 26 deletions pkg/storage/testdata/mvcc_histories/range_tombstone_conflicts
Original file line number Diff line number Diff line change
Expand Up @@ -75,17 +75,12 @@ data: "j"/1.000000000,0 -> /INT/1
error: (*withstack.withStack:) "b"/0,0: put is inline=true, but existing value is inline=false

# DeleteRange at ts=5 should error with WriteTooOldError.
#
# TODO(erikgrinaker): This should error on c rather than d, but won't do so
# until MVCCScan respects MVCC range tombstones. Until it does, the
# put will actually do a write at the new timestamp.
run error
del_range k=a end=f ts=5
----
>> at end:
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "d"/5.000000000,1 -> /<empty>
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/2.000000000,0 -> /<empty>
data: "e"/1.000000000,0 -> /BYTES/e1
Expand All @@ -95,7 +90,7 @@ data: "g"/7.000000000,0 -> /BYTES/7
data: "g"/1.000000000,0 -> /BYTES/g1
data: "i"/1.000000000,0 -> /INT/1
data: "j"/1.000000000,0 -> /INT/1
error: (*roachpb.WriteTooOldError:) WriteTooOldError: write for key "d" at timestamp 5.000000000,0 too old; wrote at 5.000000000,1
error: (*roachpb.WriteTooOldError:) WriteTooOldError: write for key "c" at timestamp 5.000000000,0 too old; wrote at 5.000000000,1

# Point key below range tombstones should error, but is written anyway at a
# higher timestamp.
Expand All @@ -110,7 +105,6 @@ put k=c ts=3 v=c3
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/2.000000000,0 -> /<empty>
data: "e"/1.000000000,0 -> /BYTES/e1
Expand All @@ -129,8 +123,7 @@ put k=d ts=3 v=d3
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/2.000000000,0 -> /<empty>
data: "e"/1.000000000,0 -> /BYTES/e1
Expand All @@ -140,7 +133,7 @@ data: "g"/7.000000000,0 -> /BYTES/7
data: "g"/1.000000000,0 -> /BYTES/g1
data: "i"/1.000000000,0 -> /INT/1
data: "j"/1.000000000,0 -> /INT/1
error: (*roachpb.WriteTooOldError:) WriteTooOldError: write for key "d" at timestamp 3.000000000,0 too old; wrote at 5.000000000,2
error: (*roachpb.WriteTooOldError:) WriteTooOldError: write for key "d" at timestamp 3.000000000,0 too old; wrote at 5.000000000,1

run error
put k=e ts=3 v=e3
Expand All @@ -149,8 +142,7 @@ put k=e ts=3 v=e3
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -171,8 +163,7 @@ cput k=f ts=7 v=f7 cond=f1
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -195,8 +186,7 @@ with t=A ts=7
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -222,8 +212,7 @@ txn: "A" meta={id=00000000 key=/Min pri=0.00000000 epo=0 ts=7.000000000,0 min=0,
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -244,8 +233,7 @@ initput k=f ts=7 v=f7 failOnTombstones
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -269,8 +257,7 @@ initput k=f ts=7 v=f7
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -292,8 +279,7 @@ increment k=i ts=2
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand All @@ -317,8 +303,7 @@ inc: current value = 1
rangekey: {a-c}/[3.000000000,0=/<empty>]
rangekey: {c-k}/[5.000000000,0=/<empty> 3.000000000,0=/<empty>]
data: "c"/5.000000000,1 -> /BYTES/c3
data: "d"/5.000000000,2 -> /BYTES/d3
data: "d"/5.000000000,1 -> /<empty>
data: "d"/5.000000000,1 -> /BYTES/d3
data: "d"/1.000000000,0 -> /BYTES/d1
data: "e"/5.000000000,1 -> /BYTES/e3
data: "e"/2.000000000,0 -> /<empty>
Expand Down
Loading

0 comments on commit 460ce6a

Please sign in to comment.