storage: MVCC range deletion tombstones #70412

erikgrinaker · 2021-09-20T12:57:13Z

We currently expose two main MVCC methods for deleting key spans, i.e. [key0, key1):

MVCCDeleteRange: iterates over visible keys and writes an MVCC tombstone for each one, in O(n).
Writer.ClearMVCCRangeAndIntents: removes all MVCC keys and versions via Pebble.DeleteRange, without any MVCC record, in <O(n).

However, for reasons outlined in #69380, we need a MVCC-compliant mechanism to delete a key span in < O(n). By this, we mean a mechanism that leaves an MVCC record of the range deletion and allows readers to read before it using an appropriate timestamp. To satisfy this, we will implement MVCC range tombstones, written either in MVCCDeleteRange() or as a separate method.

Primitives (data structures, mutation, iteration)
MVCC stats (both for garbage below and the tombstones themselves, with full-range fast path)
Conflict handling (WriteTooOld, collisions, uncertainty, etc)
Scans and gets (point tombstone synthesis, GC threshold)
MVCC iterator validity/error checking
Benchmarks

For more details, see the (currently internal) design document and tech note.

Epic CRDB-2624

Jira issue: CRDB-10062

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2021-10-18T14:03:06Z

May be implemented in Pebble as generalized range key support: cockroachdb/pebble#1339

erikgrinaker · 2022-01-27T13:19:07Z

I've been protoyping MVCC range tombstones based on Pebble range keys. I'm converging on the following approach, would appreciate a quick look by @sumeerbhola and @jbowens.

Data structure

Pebble range keys will use regular user keys as the start and end keys, and MVCC timestamps as suffix. This allows easy integration with point keys. To allow multiple range key kinds, the value will be a Protobuf union type, currently only with a range tombstone kind:

message MVCCRangeValue {
  option (gogoproto.onlyone) = true;

  MVCCRangeTombstone tombstone = 1;
}

message MVCCRangeTombstone {
  // NB: can't be roachpb.Key due to import cycle.
  bytes start_key = 1;
  bytes end_key   = 2;
}

The tombstone value includes the start and end keys to disambiguate individual range tombstones from constituent Pebble range key fragments. This also allows individual range tombstones to be identified across ranges, although we may want to clamp the returned bounds to the request bounds where appropriate. It also serves as a performance optimization, to avoid the need to decode every fragment for comparison during forwards and backwards iteration. However, these values will be duplicated for each range key fragment, so there will be some storage overhead.

One disadvantage of the key/suffix scheme is that range values of different kinds (e.g. range tombstones and range intents) cannot be iterated over separately. Since the amount of range tombstones is expected to be small, this will likely not present a problem. However, this will prevent differentiated range key masking by value kind, although this is not expected to be a problem for the currently anticipated range types (tombstones and intents).

In the MVCC API, range tombstones will be represented by a new MVCCRangeKey type:

type MVCCRangeKey struct {
	StartKey  roachpb.Key
	EndKey    roachpb.Key
	Timestamp hlc.Timestamp
}

`MVCCDeleteRangeUsingTombstone`

Range tombstones are written via storage.MVCCDeleteRangeUsingTombstone:

func MVCCDeleteRangeUsingTombstone(
	ctx context.Context,
	rw ReadWriter,
	ms *enginepb.MVCCStats,
	key, endKey roachpb.Key,
	timestamp hlc.Timestamp,
	maxIntents int64,
) error

This will scan for any separated intents in the span, returning a WriteIntentError, and then write the Pebble range key. It is not transactional, and thus cannot be atomic across ranges. Integration with MVCC statistics is currently unspecified.

It relies on the following lower-level Engine method:

type Engine interface {
	DeleteMVCCRange(MVCCRangeKey) error
}

`MVCCIterator`

Most MVCC read logic is built on top of MVCCIterator. To retain compatibility with current callers, this will expose range tombstones as synthesized point tombstones above/below existing point keys. This will not require any API changes, and will allow e.g. MVCCScan and MVCCGet to automatically handle range tombstones.

This will be implemented by a new storage.pebblePointSynthesizingIterator, which sits between storage.pebbleIterator and pebble.Iterator. Currently it will only synthesize point tombstones, but it could be extended to support e.g. ranged intents and other range key types. It will iterate across point keys, decode any range keys above/below them, and emit synthesized point keys in the appropriate order.

As a performance optimization, callers can request masking of point keys below range keys via IterOptions.RangeTombstoneMaskingAt. This gives a timestamp (typically the caller's read timestamp) below which any range keys will mask point keys and prevent them from being emitted. This cannot be used when the MVCC caller has requested tombstones to be emitted, since we must synthesize point tombstones for the point keys that would be masked.

Callers who need special handling of range tombstones (e.g. exports or rangefeeds) can disable tombstone synthesis by setting IterOptions.DisableRangeTombstoneSynthesis and access range tombstones via MVCCRangeTombstoneIterator (see below).

`MVCCRangeTombstoneIterator`

MVCCRangeTombstoneIterator iterates over MVCC range tombstones and emits them as MVCCRangeKey. A minimal interface would be e.g.:

type MVCCRangeTombstoneIterator interface {
	Close()
	Next()
	SeekGE(MVCCKey)
	UnsafeKey() MVCCRangeKey
	Valid() (bool, error)
}

This is exposed via a new method on Engine, which can either take the existing IterOptions or, more likely, a new option struct:

type Engine interface {
  NewMVCCRangeTombstoneIterator(opts IterOptions) MVCCRangeTombstoneIterator
}

aliher1911 · 2022-01-27T15:08:09Z

In the MVCC API, range tombstones will be represented by a new MVCCRangeKey type:

Should this be the case for any MVCCRangeValue, not only for the one containing MVCCRangeTombstone?

erikgrinaker · 2022-01-27T15:11:42Z

In the MVCC API, range tombstones will be represented by a new MVCCRangeKey type:

Should this be the case for any MVCCRangeValue, not only for the one containing MVCCRangeTombstone?

Probably not. Range tombstones can, because the key/timespan fully describes the tombstone, but this won't be the case for e.g. ranged intents that will have lots of metadata.

erikgrinaker · 2022-01-27T15:24:26Z

expose range tombstones as synthesized point tombstones above/below existing point keys. This will not require any API changes, and will allow e.g. MVCCScan and MVCCGet to automatically handle range tombstones.

One caveat is uncertainty restarts and other conflict handling, which must explicitly check for range tombstones at a non-existing point key. But these exceptions can likely be layered on top of the point tombstone synthesis, or achieved via synthesis settings.

sumeerbhola · 2022-01-27T16:57:47Z

One disadvantage of the key/suffix scheme is that range values of different kinds (e.g. range tombstones and range intents) cannot be iterated over separately.

Range keys provide two benefits:

simple handling of CockroachDB range splits/merges/rebalancing.
in some cases, the benefit of native interleaving support from Pebble.

Both benefits apply to range tombstones since they are physically part of the same key space as point values. Intents, on the other hand, are already in a separate part of the key space for points, and it probably makes sense to continue that for future range intents. In which case interleaving (when needed) will need to be explicit in CockroachDB (like we do now with intentInterleavingIter). So I don't think this is an actual weakness.

This will be implemented by a new storage.pebblePointSynthesizingIterator, which sits between storage.pebbleIterator and pebble.Iterator.

storage.pebbleIterator implements MVCCIterator and EngineIterator. The latter is often used for lock table iteration for which one doesn't need synthesizing of points for range keys. Is the idea here to define a common interface that both pebble.Iterator and storage.pebblePointSynthesizingIterator implement?

This cannot be used when the MVCC caller has requested tombstones to be emitted, since we must synthesize point tombstones for the point keys that would be masked.

I'm thinking about the case where the caller wants masking, and sees a set of points P (post-masking).

It still wants synthesized point tombstones that are more recent than P to do the uncertainty check.
If it was scanning a range R, I don't think it cares about the fact that there are range tombstones within the uncertainty interval in R-P, since either way there is no value there. If this is correct, I think we don't need to worry about "One caveat is uncertainty restarts and other conflict handling, which must explicitly check for range tombstones at a non-existing point key."

erikgrinaker · 2022-01-27T17:18:47Z

Both benefits apply to range tombstones since they are physically part of the same key space as point values. Intents, on the other hand, are already in a separate part of the key space for points, and it probably makes sense to continue that for future range intents. In which case interleaving (when needed) will need to be explicit in CockroachDB (like we do now with intentInterleavingIter). So I don't think this is an actual weakness.

Indeed, as I mentioned later in that paragraph I don't see this being a problem for the anticipated range key types. But I though it would be worth calling it out anyway, if we should ever come up with some other range key type across the normal user key space.

storage.pebbleIterator implements MVCCIterator and EngineIterator. The latter is often used for lock table iteration for which one doesn't need synthesizing of points for range keys. Is the idea here to define a common interface that both pebble.Iterator and storage.pebblePointSynthesizingIterator implement?

Right, didn't consider EngineIterator. I was hoping to avoid another interface here. It's possible for callers to disable range tombstone synthesis, and in that case we'll set up the pebble.Iterator with IterKeyTypePointsOnly instead of IterKeyTypePointsAndRanges. That should make the synthesis and range key logic basically a noop. Perhaps that will be sufficient?

Either way, I think the main point here is the concept of synthesizing point tombstones below MVCCIterator for easy integration and compatibility with existing callers. I'll look into the specifics of how to implement that, which will also depend on benchmarks.

If it was scanning a range R, I don't think it cares about the fact that there are range tombstones within the uncertainty interval in R-P, since either way there is no value there. If this is correct, I think we don't need to worry about "One caveat is uncertainty restarts and other conflict handling, which must explicitly check for range tombstones at a non-existing point key."

For non-tombstone reads, you're right. But if the caller has requested reads with tombstones, we need to determine whether the range tombstone across the empty span is visible or not. And for writes, we need to conflict with point writes at the same timestamp at least -- we may also need logic to order the range tombstone write and point write with respect to each other, but it's possible that locking and the timestamp cache takes care of this already (I'll have to give it some thought). In any case, I think it should be sufficient to layer this on top in the few cases that it's needed.

sumeerbhola · 2022-01-27T19:16:30Z

For non-tombstone reads, you're right. But if the caller has requested reads with tombstones, we need to determine whether the range tombstone across the empty span is visible or not. And for writes ...

It may be worthwhile enumerating the various cases in a doc so we can poke at it and revise. For instance

I see that we may need something to be visible for writes, but I am not exactly clear how that should be made visible -- the write is writing specific points so if we knew those keys we could also synthesize point keys.
For reads, I don't know the full list of cases where scan or get options is setting Tombstones=true. Rangefeeds and txn refreshes do, but they shouldn't care about uncertainty.

erikgrinaker · 2022-01-30T10:29:47Z

It may be worthwhile enumerating the various cases in a doc so we can poke at it and revise.

Yeah, I think that's a good idea. I'll write up either an RFC or a tech note to nail down the final MVCC range tombstone implementation and document it. I'm going to put a pin in these scan details for now and revisit.

the write is writing specific points so if we knew those keys we could also synthesize point keys

I was thinking the same thing. This would also handle an MVCCGet of a missing key.

I don't know the full list of cases where scan or get options is setting Tombstones=true

Doesn't seem to be that many. It's mostly rangefeeds and txn refreshes, as you point out, as well as testing. Exports and GC scans also care about range tombstones, but these have specialized APIs and will read the range tombstones separately.

The tombstone value includes the start and end keys to disambiguate individual range tombstones from constituent Pebble range key fragments. This also allows individual range tombstones to be identified across ranges, although we may want to clamp the returned bounds to the request bounds where appropriate. It also serves as a performance optimization, to avoid the need to decode every fragment for comparison during forwards and backwards iteration.

This would've been somewhat helpful for scans, but as we've touched on previously, I don't think we can/should think of a range tombstones as a single logical entity with fixed start/end bounds. The main reason is that mutations are non-atomic. For example, a range delete across multiple ranges may succeed on some ranges but fail on others. Similarly, range GC happens for different ranges at different times, so a tombstone spanning ranges would have "holes" in it. And splits/merges would require tombstone rewrites.

This is somewhat retreading old ground, but I'm reaching the same conclusion again: let's drop stored start/end bounds, but defragment during iteration for convenience.

77417: storage: add experimental MVCC range key primitives r=sumeerbhola,jbowens,nicktrav a=erikgrinaker This patch adds initial experimental primitives for MVCC range keys, which will be the foundation for MVCC range tombstones. They are based on experimental Pebble range keys. * Data structures: * `MVCCRangeKey` * `MVCCRangeKeyValue` * `Engine` methods for mutating range keys: * `ExperimentalClearMVCCRangeKey()` * `ExperimentalClearAllMVCCRangeKeys()` * `ExperimentalPutMVCCRangeKey()` * `SupportsRangeKeys()` * `SimpleMVCCIterator` methods for accessing range keys: * `HasPointAndRange()` * `RangeBounds()` * `RangeKeys()` Range keys do not have a distinct identity, and should instead be considered a key continuum: they will merge with abutting keys of the same value, can be partially cleared, can split or merge along with ranges, and so on. Bounded scans will truncate them to the scan bounds. Only MVCC range tombstones are currently supported, with an empty value. Attempts to write a non-tombstone value will error, since these would need additional logic throughout the MVCC APIs, and their semantics are still unclear. Range key support is implemented in `pebbleIterator` and `intentInterleavingIter`, but not in the rest of the MVCC or KV APIs. They are not persisted to disk either. Subsequent pull requests will extend their functionality and integrate them with other components. Touches #70412. Release note: None Co-authored-by: Erik Grinaker <[email protected]>

82017: storage: add `IterOptions.RangeKeyMaskingBelow` r=jbowens a=erikgrinaker This patch adds an iterator option `RangeKeyMaskingBelow`, which will cause range keys at or below the given timestamp to mask any point keys below them. This can be used to optimize callers that do not care about deleted point keys, since Pebble will not emit these. Touches #70412. Release note: None 82088: roachtest: extract version map into separate file r=tbg a=rail Previously, we had to patch the code in order to update the version upgrade test's version mapping. This made automation a bit complicated, thus it was done manually. Additionally, when we upgraded to a new major version, we would need to patch the same test in order to generate fixtures. This patch extracts the version mapping into a separate file, which will be easier to patch using automation. Additionally, it extracts the fixture generation code into a separate test which is skipped by CI, but can be run manually. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Rail Aliiev <[email protected]>

…82521 79748: streamingccl: add a builtin function that retrives stream ingestion statitics r=gh-casper a=gh-casper Support crdb_internal.stream_ingestion_stats(ingestion_job_id) to running progress for a stream ingestion job. Release note (sql change): Support crdb_internal.stream_ingestion_stats (ingestion_job_id) to running progress for a stream ingestion job. Release Justification: Cat 4. Closes: #57430 81598: obsservice: initial commit r=andreimatei a=andreimatei These are the humble beginnings of an Observability Service for CRDB. In time, this is meant to take over a lot of the observability functionality, extracting it from the database itself. For now, all there is here is an HTTP reverse proxy, able to route HTTP requests to a CRDB node (or cluster, through a load balancer). At least for the transitioning period, if not forever, the Obs Service will route HTTP paths that it doesn't understand to the Cockroach cluster being monitored. The HTTP proxy accepts its own certificates and can serve HTTPS independently of CRDB (which might be running in --insecure mode and not serving HTTPS). However, if CRDB is serving HTTPS, then the Obs Service must also be configured with a certificate so it can also serve HTTPS. The code of the Obs Service is split in between a binary (minimal shell), and a library. The idea is to keep the door open for other repos to extend the Obs Service with extra functionality (e.g. the CC managed service might want to add cloud-specific observability features). For such purposes, I've considered making a dedicated Go module for the Obs Service so that source-level imports of the Obs Service would be as ergonomic as Go wants them to be. I've put a decent amount of work in playing with this but, ultimately, I've decided against it, at least for now. The problem is that the Obs Service wants to take minor code dependencies on random CRDB libraries (syncutil, log, etc.) and the cockroach Go module is fairly broken (our git tags look like we do semantic versioning, but we don't actually do it). The go tooling has a terrible time with the cockroach module. Also, our code is generally not `go build`-able. If the obs service was a dedicated module, it would need to take a dependency on the cockroach module, which would negate the win for people who want to import it (in fact, it'd make matters worse for importers because the nasty cockroach dependency would be more hidden). An alternative I've considered was to start creating modules for all of the dependencies of the Obs Service. But, if CRDB would then need to depend on these modules, there's all sorts of problems. Release note: None 82041: storage: clear range keys in `Writer.Clear*Range` methods r=jbowens a=erikgrinaker This patch clears range keys in the `Writer` methods `ClearRawRange`, `ClearMVCCRange`, and `ClearMVCCIteratorRange`, as well as in the `ClearRangeWithHeuristic` helper. Range keys are not cleared in `ClearMVCCVersions`, since this method is specifically for clearing MVCC point key versions, and it is not possible to clear range keys between versions of the same point key. Touches #70412. Release note: None 82377: sql: use declarative schema changer for add column with unique r=fqazi a=fqazi Previously, the declarative schema changer was disabled when adding a column with unique constraint, since we didn't have a complete set of elements / logic for adding secondary indexes. This was inadequate because operations would not ordered correctly and rollbacks could potentially break. To address this, this patch enables support and addresses the missing pieces to setup the secondary indexes correctly, such as adding suffix columns, ordering operations correctly, and returning appropriate errors. Release note: None 82380: util/log: rework the bufferSink r=andreimatei a=andreimatei This is a rewrite of bufferSink (now called bufferedSink). As opposed to before, the code is smaller, more efficient, and, most importantly, simpler. I couldn't understand the previous code well, and I think it made new additions to it (in particular, adding a shutdown sequence) hard to analyze. The semantics of this buffer are now better and simpler. Before, a log message was traveling through a sequence of two buffers before being eventually delivered to the wrapped sink: 1. The message was first placed into an accumulator buffer. 2. Then, when that buffer grows too old or too large, the accumulator buffer is placed into a queue to be flushed later, and the accumulator was reset. 3. There was a limit on the size of this queue of batches to be flushed. When the queue was full, new messages were dropped. There were two goroutines managing these two buffers, which is one too many. This behavior of queuing multiple, distinct flushes, was peculiar at best. There's generally no reason to not coalesce all the flushes into one, thus speeding them up and reducing the risk of message drops. The bufferSink's memory limit was not explicit, and difficult enough to infer: it was set indirectly from the combination of limits on the two buffers, expressed in different units. Now, the sequence of two buffers (the accumulator and the flush queue) is gone. There's a single buffer accumulating new messages, plus at most one in-flight flush which captures its buffer. As soon as a flush is triggered, the buffer is reset. The buffer is now configured with a memory limit and, when that limit is reached, the _oldest_ messages in the buffer are dropped, instead of the newest. This, combined with the blocking semantics of the forceSync option, means that a forceSync message is never dropped (unless its size is, by itself, larger than the buffer limit). The number of goroutines running for the bufferSink drops from two to one. Release note: None 82436: server, ui: handle null plan gist in getStatementDetailsPerPlanHash r=ericharmeling a=ericharmeling This PR adds a new function that handles null plan gists in `getStatementDetailsPerPlanHash`. The PR also adds some logic to hide null plan gists in the SqlBox of the Statement Details page. Here's a screenshot of the Statement Details page when the plan gist is null: <img width="1024" alt="Screen Shot 2022-06-03 at 7 18 59 PM" src="https://user-images.githubusercontent.com/27286675/171966978-6a444297-70d4-4ef3-8afa-5068e3f35d9e.png"> Fixes #82095. We probably want to figure out why there are null values in [`planner.instrumentation.planGist`](https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/executor_statement_metrics.go#L212) in the first place. `getStatementDetailsPerPlanHash` is pulling from the system table populated with the plan gist from `planner.instrumentation.planGist`. CC `@cucaroach` Release note: None 82471: dev: allow `--rewrite` when testing many targets with `/...` r=rail a=rickystewart Closes #82053. Release note: None 82479: keys: resolve subtle non-bug by exporting RaftLogKeyFromPrefix r=tbg,erikgrinaker a=nvanbenschoten This commit resolves a subtle interaction that came tantalizingly close to a bug in `StateLoader.LoadLastIndex`. The method uses the `StateLoader`'s underlying `keys.RangeIDPrefixBuf` to generate two keys (`RaftLogPrefix` and `RaftLogKey`) that are in use as the same time. `RangeIDPrefixBuf` avoids heap allocations by sharing a single byte slice across all keys that it generates. This is why the type has the comment: > The generated keys are only valid until the next call to one of the key generation methods. As would be expected, given this comment, the second use of the `RangeIDPrefixBuf` overwrote the buffer and invalidated the first key. However, it happened to get lucky and generate a new key with the same prefix as the old key. As a result, the contents of the first key did not change. To make this aliasing more explicit and avoid this becoming a real bug in the future, we introduce a new `RaftLogKeyFromPrefix` function that callers can use to generate raft log entry keys from a raft log prefix. We then use this new function to avoid some redundant encoding work elsewhere due to repeated calls to `RaftLogKey`. 82520: cli,sql: fix reported result columns for EXPLAIN ANALYZE r=yuzefovich a=rafiss Fixes #82502 This fixes an issue where EXPLAIN ANALYZE would report a RowDescription for both the EXPLAIN and for the statement being explained. If the statement had a different number of result columns, this would confuse the CLI. No release note, since this bug was not released. Release note: None 82521: roachtest: fix a minor bug with tpch_concurrency roachtest r=yuzefovich a=yuzefovich `tpch_concurrency` pushes the cluster to its limits, so it is expected that OOMs occur. Previously, in a recently added code path, we would fail the test if an error occurred when running a debugging statement, yet that error is expected and should be returned. Fixes: #82510. Release note: None Co-authored-by: Casper <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

82013: storage: add MVCC point synthesizing iterator r=jbowens a=erikgrinaker This patch adds `pointSynthesizingIter`, an MVCC iterator which wraps an arbitrary MVCC iterator and synthesizes point keys for range keys at their start key and where they overlap point keys. It can optionally synthesize around the SeekGE seek key too, which is useful for point operations like `MVCCGet` where we may want to return a synthetic tombstone for an MVCC range tombstone if there is no existing point key. This will primarily be used to handle MVCC range tombstones in MVCC scans and gets, as well as during MVCC conflict checks, which allows much of this logic to remain unchanged and simplified (in particular, `pebbleMVCCScanner` will not need any changes). However, this patch does not make use of the iterator yet, since both it and Pebble will need further performance optimizations for use in hot paths. For now, correctness is sufficient, and only basic attempts at performance optimization have been made. Touches #70412. Release note: None 82547: opt: fix inverted key column type in testcat r=mgartner a=mgartner Previously, inverted key columns in the test catalog were incorrectly given the type of their source column. Now they have the correct type, `types.EncodedKey`. Release note: None 82635: bazel: unlist unsupported os+arch combos r=irfansharif a=irfansharif We don't support these for CRDB builds. Even windows is experimental, but lets keep that around anyway since we publish binaries for it. +cc #82625. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: irfan sharif <[email protected]>

81189: storage: add conflict handling for MVCC range tombstones r=jbowens a=erikgrinaker This patch takes MVCC range tombstones into account for all MVCC write operations, i.e. for conflict checks and conditional writes. MVCC stats updates are not covered here, but will be addressed in a subsequent PR. This also adds more exhaustive tests, especially for the intent handling in `mvccPutInternal()`. Touches #70412. Release note: None Co-authored-by: Erik Grinaker <[email protected]>

82045: storage: add MVCC range tombstone handling in scans and gets r=jbowens a=erikgrinaker This patch adds MVCC range tombstone handling for scans and gets. In the basic case, this simply means that point keys below an MVCC range tombstone are not visible. When tombstones are requested by the caller, the MVCC range tombstones themselves are never exposed, to avoid having to explicitly handle these throughout the codebase. Instead, synthetic MVCC point tombstones are emitted at the start of MVCC range tombstones and wherever they overlap a point key (above and below). Additionally, point gets return synthetic point tombstones if they overlap an MVCC range tombstone even if no existing point key exists. This is based on `pointSynthesizingIter`, which avoids additional logic in `pebbleMVCCScanner`. Synthetic MVCC point tombstones emitted for MVCC range tombstones are not stable, nor are they fully deterministic. For example, the start key will be truncated by iterator bounds, so an `MVCCScan` over a given key span may see a synthetic point tombstone at its start (if it overlaps an MVCC range tombstone), but this will not be emitted if a broader span is used (a different point tombstone will be emitted instead). Similarly, a CRDB range split/merge will split/merge MVCC range tombstones, changing which point tombstones are emitted. Furthermore, `MVCCGet` will synthesize an MVCC point tombstone if it overlaps an MVCC range tombstone and there is no existing point key there, while an `MVCCScan` will not emit these. Callers must take care not to rely on such semantics for MVCC tombstones. Existing callers have been audited to ensure they are not affected. Point tombstone synthesis must be enabled even when the caller has not requested tombstones, because they must always be taken into account for conflict/uncertainty checks. However, in these cases we enable range key masking below the read timestamp, omitting any covered points since these are no longer needed. Touches #70412. Release note: None Co-authored-by: Erik Grinaker <[email protected]>

78085: storage: add `MVCCStats` for range keys r=jbowens a=erikgrinaker **storage: export some MVCC key encoding functions** Release note: None **roachpb: add `Key.Prevish()` to find a previous key** This patch adds `Key.Prevish()`, which returns a previous key in lexicographical sort order. This is needed to expand a latch span leftwards to peek at any left-adjacent range keys. It is impossible to find the exact immediate predecessor of a key, because it can have an infinite number of `0xff` bytes at the end, so this returns the nearest previous key right-padded with `0xff` up to the given length. It is still possible for an infinite number of keys to exist between `Key` and `Key.Prevish()`, as keys have unbounded length. Release note: None **storage: add `MVCCStats` for range keys** This patch adds `MVCCStats` tracking for range keys. Four new fields are added to `MVCCStats`: * `RangeKeyCount`: the number of (fragmented) range keys, not counting historical versions. * `RangeKeyBytes`: the logical encoded byte size of all range keys. The latest version contributes the encoded key bounds, and all versions contribute encoded timestamps. Unlike point keys, which for historical reasons use a fixed-size timestamp contribution, this uses the actual variable-length timestamp size. * `RangeValCount`: the number of (fragmented) range key versions. * `RangeValBytes`: the encoded size of all range key values across all versions. The same value can be stored in multiple range keys due to fragmentation, which will be counted separately. Even though all range keys are currently MVCC range tombstones with no value, the `MVCCValueHeader` contribution can be non-zero due to e.g. a local timestamp. `ComputeStatsForRange()` has been extended to calculate the above quantities, and additionally account for range tombstones themselves in `GCBytesAge` along with their effect on point keys. All relevant call sites have been updated to surface range keys for the MVCC iterators passed to `ComputeStatsForRange()`. Most MVCC operations have been updated to correctly account for MVCC range tombstones, e.g. during point key writes and intent resolution. KV APIs are not yet updated, this will be addressed later. Range key stats are also adjusted during range splits and merges, which will split and merge any range keys that straddle the split key. This requires a single range key seek to the left and right of the split key during these operations. Touches #70412. Release note: None 82384: sql: reuse the slice of RequestUnion objects between fetches r=yuzefovich a=yuzefovich This commit teaches `txnKVFetcher` and `txnKVStreamer` to reuse the same slice of `RequestUnion` objects between different fetches. It is now extremely easy to do given the recent refactor. We do perform memory accounting for this slice (against a memory account bound to an unlimited memory monitor). Additionally, a similar optimization is applied to how resume requests are populated by the Streamer. Addresses: #82160. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

78085: storage: add `MVCCStats` for range keys r=jbowens a=erikgrinaker **storage: export some MVCC key encoding functions** Release note: None **roachpb: add `Key.Prevish()` to find a previous key** This patch adds `Key.Prevish()`, which returns a previous key in lexicographical sort order. This is needed to expand a latch span leftwards to peek at any left-adjacent range keys. It is impossible to find the exact immediate predecessor of a key, because it can have an infinite number of `0xff` bytes at the end, so this returns the nearest previous key right-padded with `0xff` up to the given length. It is still possible for an infinite number of keys to exist between `Key` and `Key.Prevish()`, as keys have unbounded length. Release note: None **storage: add `MVCCStats` for range keys** This patch adds `MVCCStats` tracking for range keys. Four new fields are added to `MVCCStats`: * `RangeKeyCount`: the number of (fragmented) range keys, not counting historical versions. * `RangeKeyBytes`: the logical encoded byte size of all range keys. The latest version contributes the encoded key bounds, and all versions contribute encoded timestamps. Unlike point keys, which for historical reasons use a fixed-size timestamp contribution, this uses the actual variable-length timestamp size. * `RangeValCount`: the number of (fragmented) range key versions. * `RangeValBytes`: the encoded size of all range key values across all versions. The same value can be stored in multiple range keys due to fragmentation, which will be counted separately. Even though all range keys are currently MVCC range tombstones with no value, the `MVCCValueHeader` contribution can be non-zero due to e.g. a local timestamp. `ComputeStatsForRange()` has been extended to calculate the above quantities, and additionally account for range tombstones themselves in `GCBytesAge` along with their effect on point keys. All relevant call sites have been updated to surface range keys for the MVCC iterators passed to `ComputeStatsForRange()`. Most MVCC operations have been updated to correctly account for MVCC range tombstones, e.g. during point key writes and intent resolution. KV APIs are not yet updated, this will be addressed later. Range key stats are also adjusted during range splits and merges, which will split and merge any range keys that straddle the split key. This requires a single range key seek to the left and right of the split key during these operations. Touches #70412. Release note: None Co-authored-by: Erik Grinaker <[email protected]>

80840: batcheval: handle MVCC range tombstones in `ClearRange` r=aliher1911 a=erikgrinaker This patch makes `ClearRange` account for MVCC range tombstones when updating MVCC stats. Touches #70412. Release note: None Co-authored-by: Erik Grinaker <[email protected]>

erikgrinaker · 2022-07-05T12:44:41Z

MVCC range tombstones are now functional, except for garbage collection which is tracked separately in #70414. For details, see tech note in #83762.

There is a large amount of optimization and test work remaining, but this is tracked separately. See CRDB-2624 for details (internal).

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. T-storage Storage Team labels Sep 20, 2021

erikgrinaker mentioned this issue Oct 5, 2021

backup: MVCC Bulk Operations Are Missing from Backups #71155

Closed

erikgrinaker assigned jbowens Oct 7, 2021

erikgrinaker mentioned this issue Oct 11, 2021

kvserver: support MVCC range tombstones in Export #71398

Closed

erikgrinaker assigned erikgrinaker and unassigned jbowens Dec 6, 2021

erikgrinaker added T-kv-replication and removed T-storage Storage Team labels Dec 20, 2021

This was referenced Feb 4, 2022

storage: add initial experimental MVCC range tombstone primitives #76090

Closed

storage: add experimental MVCC range tombstone primitives #76131

Merged

erikgrinaker mentioned this issue May 1, 2022

batcheval: handle MVCC range tombstones in ClearRange #80840

Merged

jlinder added sync-me-3 and removed sync-me-3 labels May 24, 2022

This was referenced May 28, 2022

storage: add MVCC point synthesizing iterator #82013

Merged

storage: clear range keys in Writer.Clear*Range methods #82041

Merged

storage: add MVCC range tombstone handling in scans and gets #82045

Merged

This was referenced Jun 4, 2022

storage: add IterOptions.RangeKeyMaskingBelow #82017

Merged

kvserver: emit MVCC range tombstones across rangefeeds #82449

Closed

This was referenced Jun 10, 2022

storage: add conflict handling for MVCC range tombstones #81189

Merged

changefeedccl: verify MVCC range tombstones are never visible to CDC #82943

Closed

erikgrinaker closed this as completed Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: MVCC range deletion tombstones #70412

storage: MVCC range deletion tombstones #70412

erikgrinaker commented Sep 20, 2021 •

edited

Loading

erikgrinaker commented Oct 18, 2021

erikgrinaker commented Jan 27, 2022

aliher1911 commented Jan 27, 2022

erikgrinaker commented Jan 27, 2022

erikgrinaker commented Jan 27, 2022 •

edited

Loading

sumeerbhola commented Jan 27, 2022

erikgrinaker commented Jan 27, 2022

sumeerbhola commented Jan 27, 2022

erikgrinaker commented Jan 30, 2022

erikgrinaker commented Jul 5, 2022 •

edited

Loading

storage: MVCC range deletion tombstones #70412

storage: MVCC range deletion tombstones #70412

Comments

erikgrinaker commented Sep 20, 2021 • edited Loading

erikgrinaker commented Oct 18, 2021

erikgrinaker commented Jan 27, 2022

Data structure

MVCCDeleteRangeUsingTombstone

MVCCIterator

MVCCRangeTombstoneIterator

aliher1911 commented Jan 27, 2022

erikgrinaker commented Jan 27, 2022

erikgrinaker commented Jan 27, 2022 • edited Loading

sumeerbhola commented Jan 27, 2022

erikgrinaker commented Jan 27, 2022

sumeerbhola commented Jan 27, 2022

erikgrinaker commented Jan 30, 2022

erikgrinaker commented Jul 5, 2022 • edited Loading

erikgrinaker commented Sep 20, 2021 •

edited

Loading

`MVCCDeleteRangeUsingTombstone`

`MVCCIterator`

`MVCCRangeTombstoneIterator`

erikgrinaker commented Jan 27, 2022 •

edited

Loading

erikgrinaker commented Jul 5, 2022 •

edited

Loading