Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: add MVCC range tombstone handling for scans and gets #78946

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Mar 29, 2022

This is for the mvcc-range-tombstones branch. Only the last commit should be reviewed here -- the former are from #78085 and its ancestors.

This will need performance optimizations both in MVCC and Pebble before being merged, as it's been seen to have a severe performance penalty (up to 40%). For now, focus on correctness.


This patch adds MVCC range tombstone handling for scans and gets. In the
basic case, this simply means that point keys below an MVCC range
tombstone are not visible.

When tombstones are requested by the caller, the MVCC range tombstones
themselves are never exposed, to avoid having to explicitly handle these
throughout the codebase. Instead, synthetic MVCC point tombstones are
emitted at the start of MVCC range tombstones and wherever they overlap
a point key (above and below). Additionally, point gets return synthetic
point tombstones even if no existing point key exists.

The point tombstone synthesis is implemented by a new
pointSynthesizingIterator, an MVCCIterator implementation that wraps
an existing MVCCIterator. This allows e.g. pebbleMVCCScanner to
remain unchanged, and automatically take MVCC range tombstones into
account for e.g. conflict/uncertainty checks.

Point tombstone synthesis must be enabled even when the caller has not
requested tombstones, because they must always be taken into account for
conflict/uncertainty checks. However, in these cases we enable range key
masking below the read timestamp, omitting any covered points since
these are no longer needed.

Touches #70412.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan-point-synthesis branch from 4e3e1c9 to 855437f Compare March 29, 2022 13:03
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch 3 times, most recently from 34e6ab0 to 4798ae2 Compare April 4, 2022 13:43
Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)


pkg/storage/testdata/mvcc_histories/range_tombstone_scans, line 6 at r8 (raw file):

#
#  T
#  6                 [e6]     

nit: looks like there's some extra whitespace here.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch 5 times, most recently from 2097ac5 to 3ad25e4 Compare April 11, 2022 20:36
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan-point-synthesis branch from 855437f to ed3987c Compare April 12, 2022 09:49
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, and @nicktrav)


pkg/storage/point_synthesizing_iter.go, line 21 at r16 (raw file):

// pointSynthesizingIter wraps an MVCCIterator, and synthesizes point tombstones
// for range tombstones above/below existing point keys. It does not emit range

is "range tombstones" the same as "MVCC range tombstone" accomplished via a RANGEKEYSET?
Just calling it "range tombstone" will confuse folks who are thinking of Pebble range tombstones.
I don't have good naming ideas.


pkg/storage/point_synthesizing_iter.go, line 23 at r16 (raw file):

// for range tombstones above/below existing point keys. It does not emit range
// keys at all, since these would appear to conflict with the synthesized point
// keys.

how does this honor scans that set FailOnMoreRecent or want to do Uncertainty checking? They need to see a range key even if there is no overlapping point, yes?

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general approach of synthesizing point tombstones.

When the Tombstones option is disabled, range key masking is enabled
to omit point keys below older range tombstones during iteration.
However, point synthesis must still be enabled in this case to account
for future range tombstones in uncertainty checks.

Does this mean point tombstones are synthesized at the start key of the range tombstone in this case?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @nicktrav)

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from 3ad25e4 to 6a1756c Compare April 15, 2022 09:41
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean point tombstones are synthesized at the start key of the range tombstone in this case?

Not currently -- they're synthesized around point keys that are above the masking range tombstone. Consider e.g. the case where we're masking @2, and have b@1, [a-c)@2, b@3, [a-c)@4. In this case, b@1 is masked, but we'll synthesize point tombstones for b@2 and b@4.

However, we should likely synthesize points at the start of each range key fragment too -- this handles both FailOnMoreRecent for bare range tombstones and tombstone synthesis for point reads where there is no existing point (replaces emitOnSeek). Not sure yet whether this on-start behavior should always be enabled, or if we should make it optional -- will probably start off with always-on.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nicktrav, and @sumeerbhola)


pkg/storage/point_synthesizing_iter.go, line 21 at r16 (raw file):

Previously, sumeerbhola wrote…

is "range tombstones" the same as "MVCC range tombstone" accomplished via a RANGEKEYSET?
Just calling it "range tombstone" will confuse folks who are thinking of Pebble range tombstones.
I don't have good naming ideas.

Yes, MVCC range tombstones. I don't think it's too confusing -- we already have the same issue with point tombstones and I don't find it too hard to distinguish between Pebble and MVCC point tombstones from context. But I'll be more consistent in specifying MVCC/Pebble tombstones.


pkg/storage/point_synthesizing_iter.go, line 23 at r16 (raw file):

Previously, sumeerbhola wrote…

how does this honor scans that set FailOnMoreRecent or want to do Uncertainty checking? They need to see a range key even if there is no overlapping point, yes?

Good point. We should synthesize points at the start/seek key of range tombstones too, which handles both FailOnMoreRecent and Uncertainty for both scans and gets without point keys. This would replace emitOnSeek.


pkg/storage/testdata/mvcc_histories/range_tombstone_scans, line 6 at r8 (raw file):

Previously, nicktrav (Nick Travers) wrote…

nit: looks like there's some extra whitespace here.

Thanks, fixed.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan-point-synthesis branch 5 times, most recently from b2559d9 to 06c7731 Compare April 16, 2022 18:38
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from 6a1756c to 3bd8605 Compare April 22, 2022 13:06
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan-point-synthesis branch 4 times, most recently from 489161a to 66791c3 Compare April 23, 2022 14:24
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from 2c06ce4 to cb77b2e Compare May 1, 2022 12:31
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan-point-synthesis branch 3 times, most recently from cc54ef9 to b79bdae Compare May 1, 2022 16:11
@erikgrinaker erikgrinaker marked this pull request as ready for review May 1, 2022 16:12
@erikgrinaker
Copy link
Contributor Author

Ok, this should be ready for a high-level review now.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from cb77b2e to 49aacac Compare May 9, 2022 08:36
@erikgrinaker erikgrinaker requested a review from a team as a May 9, 2022 08:36
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan-point-synthesis branch from b79bdae to 35b1535 Compare May 9, 2022 08:49
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from 49aacac to e5bd560 Compare May 9, 2022 09:20
This is necessary to avoid mutating the bounds in use by
`pebble.Iterator`, which can interfere with seek optimizations.

Release note: None
Since we now use dual buffers for bounds encoding, we can fully generate
the new `pebble.IterOptions` during `setOptions()` and compare them
directly with the existing options, which simplifies the code.

Release note: None
This patch adds initial experimental primitives for MVCC range keys,
which will be the foundation for MVCC range tombstones. They are based
on experimental Pebble range keys.

* Data structures:
  * `MVCCRangeKey`
  * internal `nil` value for range tombstones (as with point tombstones)

* `Engine` methods for mutating range keys:
  * `ExperimentalClearMVCCRangeKey()`
  * `ExperimentalClearAllMVCCRangeKeys()`
  * `ExperimentalPutMVCCRangeKey()`
  * `SupportsRangeKeys()`

* `SimpleMVCCIterator` methods for accessing range keys:
  * `HasPointAndRange()`
  * `RangeBounds()`
  * `RangeKeys()`

Range keys do not have a distinct identity, and should instead be
considered a key continuum: they will merge with abutting keys of the
same value, can be partially cleared, can split or merge along with
ranges, and so on. Bounded scans will truncate them to the scan bounds.

Range key support is implemented in `pebbleIterator` and
`intentInterleavingIter`, but not in the rest of the MVCC or KV APIs.
They are not persisted to disk either. Subsequent pull requests will
extend their functionality and integrate them with other components.

Release note: None
This patch adds a parameter `UseExperimentalRangeTombstone` for
`DeleteRange`, which deletes the span using an MVCC range tombstone,
backed by an `ExperimentalDeleteRangeUsingTombstone` MVCC function. It
also adds an `ExperimentalMVCCRangeTombstones` version gate which must
be checked before using it.

This is an experimental implementation to allow writing range keys via
the KV API for testing and development purposes. It has significant
shortcomings, and will be fleshed out at a later time.

Release note: None
This patch adds a `run` keyword `stats` which will compute and output
stats after a test. If `trace` is also enabled, it additionally outputs
the stats delta for each command.

Release note: None
This patch adds `Key.Prevish()`, which returns a previous key in
lexicographical sort order. This is needed to expand a latch span
leftwards to peek at any left-adjacent range keys.

It is impossible to find the exact immediate predecessor of a key,
because it can have an infinite number of `0xff` bytes at the end, so
this returns the nearest previous key right-padded with `0xff` up to the
given length. It is still possible for an infinite number of keys to
exist between `Key` and `Key.Prevish()`, as keys have unbounded length.

Release note: None
This patch adds an iterator option `RangeKeyMaskingBelow`, which will
cause range keys at or below the given timestamp to mask any point
keys below them.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan-point-synthesis branch 4 times, most recently from ec77db7 to 022f7ba Compare May 10, 2022 17:44
This patch adds `MVCCStats` tracking for range keys. Two new fields are
added to `MVCCStats`:

* `RangeKeyCount`: the number of (fragmented) range keys, not counting
  historical versions.

* `RangeKeyBytes`: the logical encoded byte size of all range keys.
  The latest version contributes the encoded key bounds, and all
  versions contribute encoded timestamps. Unlike point keys, which for
  historical reasons use a fixed-size timestamp contribution, this uses
  the actual variable-length timestamp size.

`ComputeStatsForRange()` has been extended to calculate the above
quantities, and additionally account for range tombstones themselves in
`GCBytesAge` along with their effect on point keys. All relevant call
sites have been updated to surface range keys for the MVCC iterators
passed to `ComputeStatsForRange()`.

Rudimentary range tombstone checks have been added during MVCC point
writes and intent resolution to correctly account for them in MVCC
statistics. Any further integration between point writes and range
tombstones (in particular, conflict handling) will be properly dealt
with later.

Range key stats are also adjusted during range splits and merges, which
will split and merge any range keys that straddle the split key. This
requires a single range key seek to the left and right of the split key
during these operations.

Release note: None
This patch adds MVCC range tombstone handling for scans and gets. In the
basic case, this simply means that point keys below an MVCC range
tombstone are not visible.

When tombstones are requested by the caller, the MVCC range tombstones
themselves are never exposed, to avoid having to explicitly handle these
throughout the codebase. Instead, synthetic MVCC point tombstones are
emitted at the start of MVCC range tombstones and wherever they overlap
a point key (above and below). Additionally, point gets return synthetic
point tombstones even if no existing point key exists.

The point tombstone synthesis is implemented by a new
`pointSynthesizingIterator`, an `MVCCIterator` implementation that wraps
an existing `MVCCIterator`. This allows e.g. `pebbleMVCCScanner` to
remain unchanged, and automatically take MVCC range tombstones into
account for e.g. conflict/uncertainty checks.

Point tombstone synthesis must be enabled even when the caller has not
requested tombstones, because they must always be taken into account for
conflict/uncertainty checks. However, in these cases we enable range key
masking below the read timestamp, omitting any covered points since
these are no longer needed.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan-point-synthesis branch from 022f7ba to 2eebe21 Compare May 11, 2022 12:55
@erikgrinaker erikgrinaker marked this pull request as draft May 28, 2022 16:53
@erikgrinaker
Copy link
Contributor Author

Replaced by #82045.

@erikgrinaker erikgrinaker deleted the mvcc-range-tombstones-scan-point-synthesis branch May 29, 2022 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants