-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: add MVCC range tombstone handling for scans and gets #78946
storage: add MVCC range tombstone handling for scans and gets #78946
Conversation
4e3e1c9
to
855437f
Compare
34e6ab0
to
4798ae2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
pkg/storage/testdata/mvcc_histories/range_tombstone_scans, line 6 at r8 (raw file):
# # T # 6 [e6]
nit: looks like there's some extra whitespace here.
2097ac5
to
3ad25e4
Compare
855437f
to
ed3987c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, and @nicktrav)
pkg/storage/point_synthesizing_iter.go, line 21 at r16 (raw file):
// pointSynthesizingIter wraps an MVCCIterator, and synthesizes point tombstones // for range tombstones above/below existing point keys. It does not emit range
is "range tombstones" the same as "MVCC range tombstone" accomplished via a RANGEKEYSET?
Just calling it "range tombstone" will confuse folks who are thinking of Pebble range tombstones.
I don't have good naming ideas.
pkg/storage/point_synthesizing_iter.go, line 23 at r16 (raw file):
// for range tombstones above/below existing point keys. It does not emit range // keys at all, since these would appear to conflict with the synthesized point // keys.
how does this honor scans that set FailOnMoreRecent or want to do Uncertainty checking? They need to see a range key even if there is no overlapping point, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general approach of synthesizing point tombstones.
When the Tombstones option is disabled, range key masking is enabled
to omit point keys below older range tombstones during iteration.
However, point synthesis must still be enabled in this case to account
for future range tombstones in uncertainty checks.
Does this mean point tombstones are synthesized at the start key of the range tombstone in this case?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @nicktrav)
3ad25e4
to
6a1756c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean point tombstones are synthesized at the start key of the range tombstone in this case?
Not currently -- they're synthesized around point keys that are above the masking range tombstone. Consider e.g. the case where we're masking @2, and have b@1, [a-c)@2, b@3, [a-c)@4. In this case, b@1 is masked, but we'll synthesize point tombstones for b@2 and b@4.
However, we should likely synthesize points at the start of each range key fragment too -- this handles both FailOnMoreRecent for bare range tombstones and tombstone synthesis for point reads where there is no existing point (replaces emitOnSeek
). Not sure yet whether this on-start behavior should always be enabled, or if we should make it optional -- will probably start off with always-on.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nicktrav, and @sumeerbhola)
pkg/storage/point_synthesizing_iter.go, line 21 at r16 (raw file):
Previously, sumeerbhola wrote…
is "range tombstones" the same as "MVCC range tombstone" accomplished via a RANGEKEYSET?
Just calling it "range tombstone" will confuse folks who are thinking of Pebble range tombstones.
I don't have good naming ideas.
Yes, MVCC range tombstones. I don't think it's too confusing -- we already have the same issue with point tombstones and I don't find it too hard to distinguish between Pebble and MVCC point tombstones from context. But I'll be more consistent in specifying MVCC/Pebble tombstones.
pkg/storage/point_synthesizing_iter.go, line 23 at r16 (raw file):
Previously, sumeerbhola wrote…
how does this honor scans that set FailOnMoreRecent or want to do Uncertainty checking? They need to see a range key even if there is no overlapping point, yes?
Good point. We should synthesize points at the start/seek key of range tombstones too, which handles both FailOnMoreRecent and Uncertainty for both scans and gets without point keys. This would replace emitOnSeek
.
pkg/storage/testdata/mvcc_histories/range_tombstone_scans, line 6 at r8 (raw file):
Previously, nicktrav (Nick Travers) wrote…
nit: looks like there's some extra whitespace here.
Thanks, fixed.
b2559d9
to
06c7731
Compare
6a1756c
to
3bd8605
Compare
489161a
to
66791c3
Compare
2c06ce4
to
cb77b2e
Compare
cc54ef9
to
b79bdae
Compare
Ok, this should be ready for a high-level review now. |
cb77b2e
to
49aacac
Compare
b79bdae
to
35b1535
Compare
49aacac
to
e5bd560
Compare
This is necessary to avoid mutating the bounds in use by `pebble.Iterator`, which can interfere with seek optimizations. Release note: None
Since we now use dual buffers for bounds encoding, we can fully generate the new `pebble.IterOptions` during `setOptions()` and compare them directly with the existing options, which simplifies the code. Release note: None
This patch adds initial experimental primitives for MVCC range keys, which will be the foundation for MVCC range tombstones. They are based on experimental Pebble range keys. * Data structures: * `MVCCRangeKey` * internal `nil` value for range tombstones (as with point tombstones) * `Engine` methods for mutating range keys: * `ExperimentalClearMVCCRangeKey()` * `ExperimentalClearAllMVCCRangeKeys()` * `ExperimentalPutMVCCRangeKey()` * `SupportsRangeKeys()` * `SimpleMVCCIterator` methods for accessing range keys: * `HasPointAndRange()` * `RangeBounds()` * `RangeKeys()` Range keys do not have a distinct identity, and should instead be considered a key continuum: they will merge with abutting keys of the same value, can be partially cleared, can split or merge along with ranges, and so on. Bounded scans will truncate them to the scan bounds. Range key support is implemented in `pebbleIterator` and `intentInterleavingIter`, but not in the rest of the MVCC or KV APIs. They are not persisted to disk either. Subsequent pull requests will extend their functionality and integrate them with other components. Release note: None
This patch adds a parameter `UseExperimentalRangeTombstone` for `DeleteRange`, which deletes the span using an MVCC range tombstone, backed by an `ExperimentalDeleteRangeUsingTombstone` MVCC function. It also adds an `ExperimentalMVCCRangeTombstones` version gate which must be checked before using it. This is an experimental implementation to allow writing range keys via the KV API for testing and development purposes. It has significant shortcomings, and will be fleshed out at a later time. Release note: None
Release note: None
This patch adds a `run` keyword `stats` which will compute and output stats after a test. If `trace` is also enabled, it additionally outputs the stats delta for each command. Release note: None
This patch adds `Key.Prevish()`, which returns a previous key in lexicographical sort order. This is needed to expand a latch span leftwards to peek at any left-adjacent range keys. It is impossible to find the exact immediate predecessor of a key, because it can have an infinite number of `0xff` bytes at the end, so this returns the nearest previous key right-padded with `0xff` up to the given length. It is still possible for an infinite number of keys to exist between `Key` and `Key.Prevish()`, as keys have unbounded length. Release note: None
This patch adds an iterator option `RangeKeyMaskingBelow`, which will cause range keys at or below the given timestamp to mask any point keys below them. Release note: None
ec77db7
to
022f7ba
Compare
This patch adds `MVCCStats` tracking for range keys. Two new fields are added to `MVCCStats`: * `RangeKeyCount`: the number of (fragmented) range keys, not counting historical versions. * `RangeKeyBytes`: the logical encoded byte size of all range keys. The latest version contributes the encoded key bounds, and all versions contribute encoded timestamps. Unlike point keys, which for historical reasons use a fixed-size timestamp contribution, this uses the actual variable-length timestamp size. `ComputeStatsForRange()` has been extended to calculate the above quantities, and additionally account for range tombstones themselves in `GCBytesAge` along with their effect on point keys. All relevant call sites have been updated to surface range keys for the MVCC iterators passed to `ComputeStatsForRange()`. Rudimentary range tombstone checks have been added during MVCC point writes and intent resolution to correctly account for them in MVCC statistics. Any further integration between point writes and range tombstones (in particular, conflict handling) will be properly dealt with later. Range key stats are also adjusted during range splits and merges, which will split and merge any range keys that straddle the split key. This requires a single range key seek to the left and right of the split key during these operations. Release note: None
This patch adds MVCC range tombstone handling for scans and gets. In the basic case, this simply means that point keys below an MVCC range tombstone are not visible. When tombstones are requested by the caller, the MVCC range tombstones themselves are never exposed, to avoid having to explicitly handle these throughout the codebase. Instead, synthetic MVCC point tombstones are emitted at the start of MVCC range tombstones and wherever they overlap a point key (above and below). Additionally, point gets return synthetic point tombstones even if no existing point key exists. The point tombstone synthesis is implemented by a new `pointSynthesizingIterator`, an `MVCCIterator` implementation that wraps an existing `MVCCIterator`. This allows e.g. `pebbleMVCCScanner` to remain unchanged, and automatically take MVCC range tombstones into account for e.g. conflict/uncertainty checks. Point tombstone synthesis must be enabled even when the caller has not requested tombstones, because they must always be taken into account for conflict/uncertainty checks. However, in these cases we enable range key masking below the read timestamp, omitting any covered points since these are no longer needed. Release note: None
022f7ba
to
2eebe21
Compare
Replaced by #82045. |
This is for the
mvcc-range-tombstones
branch. Only the last commit should be reviewed here -- the former are from #78085 and its ancestors.This will need performance optimizations both in MVCC and Pebble before being merged, as it's been seen to have a severe performance penalty (up to 40%). For now, focus on correctness.
This patch adds MVCC range tombstone handling for scans and gets. In the
basic case, this simply means that point keys below an MVCC range
tombstone are not visible.
When tombstones are requested by the caller, the MVCC range tombstones
themselves are never exposed, to avoid having to explicitly handle these
throughout the codebase. Instead, synthetic MVCC point tombstones are
emitted at the start of MVCC range tombstones and wherever they overlap
a point key (above and below). Additionally, point gets return synthetic
point tombstones even if no existing point key exists.
The point tombstone synthesis is implemented by a new
pointSynthesizingIterator
, anMVCCIterator
implementation that wrapsan existing
MVCCIterator
. This allows e.g.pebbleMVCCScanner
toremain unchanged, and automatically take MVCC range tombstones into
account for e.g. conflict/uncertainty checks.
Point tombstone synthesis must be enabled even when the caller has not
requested tombstones, because they must always be taken into account for
conflict/uncertainty checks. However, in these cases we enable range key
masking below the read timestamp, omitting any covered points since
these are no longer needed.
Touches #70412.
Release note: None