Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: MVCC range deletion tombstones #70412

Closed
6 tasks
erikgrinaker opened this issue Sep 20, 2021 · 10 comments
Closed
6 tasks

storage: MVCC range deletion tombstones #70412

erikgrinaker opened this issue Sep 20, 2021 · 10 comments
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 20, 2021

We currently expose two main MVCC methods for deleting key spans, i.e. [key0, key1):

  • MVCCDeleteRange: iterates over visible keys and writes an MVCC tombstone for each one, in O(n).
  • Writer.ClearMVCCRangeAndIntents: removes all MVCC keys and versions via Pebble.DeleteRange, without any MVCC record, in <O(n).

However, for reasons outlined in #69380, we need a MVCC-compliant mechanism to delete a key span in < O(n). By this, we mean a mechanism that leaves an MVCC record of the range deletion and allows readers to read before it using an appropriate timestamp. To satisfy this, we will implement MVCC range tombstones, written either in MVCCDeleteRange() or as a separate method.

  • Primitives (data structures, mutation, iteration)
  • MVCC stats (both for garbage below and the tombstones themselves, with full-range fast path)
  • Conflict handling (WriteTooOld, collisions, uncertainty, etc)
  • Scans and gets (point tombstone synthesis, GC threshold)
  • MVCC iterator validity/error checking
  • Benchmarks

For more details, see the (currently internal) design document and tech note.

Epic CRDB-2624

Jira issue: CRDB-10062

@erikgrinaker
Copy link
Contributor Author

May be implemented in Pebble as generalized range key support: cockroachdb/pebble#1339

@erikgrinaker erikgrinaker assigned erikgrinaker and unassigned jbowens Dec 6, 2021
@erikgrinaker erikgrinaker added T-kv-replication and removed T-storage Storage Team labels Dec 20, 2021
@erikgrinaker
Copy link
Contributor Author

I've been protoyping MVCC range tombstones based on Pebble range keys. I'm converging on the following approach, would appreciate a quick look by @sumeerbhola and @jbowens.

Data structure

Pebble range keys will use regular user keys as the start and end keys, and MVCC timestamps as suffix. This allows easy integration with point keys. To allow multiple range key kinds, the value will be a Protobuf union type, currently only with a range tombstone kind:

message MVCCRangeValue {
  option (gogoproto.onlyone) = true;

  MVCCRangeTombstone tombstone = 1;
}

message MVCCRangeTombstone {
  // NB: can't be roachpb.Key due to import cycle.
  bytes start_key = 1;
  bytes end_key   = 2;
}

The tombstone value includes the start and end keys to disambiguate individual range tombstones from constituent Pebble range key fragments. This also allows individual range tombstones to be identified across ranges, although we may want to clamp the returned bounds to the request bounds where appropriate. It also serves as a performance optimization, to avoid the need to decode every fragment for comparison during forwards and backwards iteration. However, these values will be duplicated for each range key fragment, so there will be some storage overhead.

One disadvantage of the key/suffix scheme is that range values of different kinds (e.g. range tombstones and range intents) cannot be iterated over separately. Since the amount of range tombstones is expected to be small, this will likely not present a problem. However, this will prevent differentiated range key masking by value kind, although this is not expected to be a problem for the currently anticipated range types (tombstones and intents).

In the MVCC API, range tombstones will be represented by a new MVCCRangeKey type:

type MVCCRangeKey struct {
	StartKey  roachpb.Key
	EndKey    roachpb.Key
	Timestamp hlc.Timestamp
}

MVCCDeleteRangeUsingTombstone

Range tombstones are written via storage.MVCCDeleteRangeUsingTombstone:

func MVCCDeleteRangeUsingTombstone(
	ctx context.Context,
	rw ReadWriter,
	ms *enginepb.MVCCStats,
	key, endKey roachpb.Key,
	timestamp hlc.Timestamp,
	maxIntents int64,
) error

This will scan for any separated intents in the span, returning a WriteIntentError, and then write the Pebble range key. It is not transactional, and thus cannot be atomic across ranges. Integration with MVCC statistics is currently unspecified.

It relies on the following lower-level Engine method:

type Engine interface {
	DeleteMVCCRange(MVCCRangeKey) error
}

MVCCIterator

Most MVCC read logic is built on top of MVCCIterator. To retain compatibility with current callers, this will expose range tombstones as synthesized point tombstones above/below existing point keys. This will not require any API changes, and will allow e.g. MVCCScan and MVCCGet to automatically handle range tombstones.

This will be implemented by a new storage.pebblePointSynthesizingIterator, which sits between storage.pebbleIterator and pebble.Iterator. Currently it will only synthesize point tombstones, but it could be extended to support e.g. ranged intents and other range key types. It will iterate across point keys, decode any range keys above/below them, and emit synthesized point keys in the appropriate order.

As a performance optimization, callers can request masking of point keys below range keys via IterOptions.RangeTombstoneMaskingAt. This gives a timestamp (typically the caller's read timestamp) below which any range keys will mask point keys and prevent them from being emitted. This cannot be used when the MVCC caller has requested tombstones to be emitted, since we must synthesize point tombstones for the point keys that would be masked.

Callers who need special handling of range tombstones (e.g. exports or rangefeeds) can disable tombstone synthesis by setting IterOptions.DisableRangeTombstoneSynthesis and access range tombstones via MVCCRangeTombstoneIterator (see below).

MVCCRangeTombstoneIterator

MVCCRangeTombstoneIterator iterates over MVCC range tombstones and emits them as MVCCRangeKey. A minimal interface would be e.g.:

type MVCCRangeTombstoneIterator interface {
	Close()
	Next()
	SeekGE(MVCCKey)
	UnsafeKey() MVCCRangeKey
	Valid() (bool, error)
}

This is exposed via a new method on Engine, which can either take the existing IterOptions or, more likely, a new option struct:

type Engine interface {
  NewMVCCRangeTombstoneIterator(opts IterOptions) MVCCRangeTombstoneIterator
}

@aliher1911
Copy link
Contributor

In the MVCC API, range tombstones will be represented by a new MVCCRangeKey type:

Should this be the case for any MVCCRangeValue, not only for the one containing MVCCRangeTombstone?

@erikgrinaker
Copy link
Contributor Author

In the MVCC API, range tombstones will be represented by a new MVCCRangeKey type:

Should this be the case for any MVCCRangeValue, not only for the one containing MVCCRangeTombstone?

Probably not. Range tombstones can, because the key/timespan fully describes the tombstone, but this won't be the case for e.g. ranged intents that will have lots of metadata.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 27, 2022

expose range tombstones as synthesized point tombstones above/below existing point keys. This will not require any API changes, and will allow e.g. MVCCScan and MVCCGet to automatically handle range tombstones.

One caveat is uncertainty restarts and other conflict handling, which must explicitly check for range tombstones at a non-existing point key. But these exceptions can likely be layered on top of the point tombstone synthesis, or achieved via synthesis settings.

@sumeerbhola
Copy link
Collaborator

One disadvantage of the key/suffix scheme is that range values of different kinds (e.g. range tombstones and range intents) cannot be iterated over separately.

Range keys provide two benefits:

  • simple handling of CockroachDB range splits/merges/rebalancing.
  • in some cases, the benefit of native interleaving support from Pebble.

Both benefits apply to range tombstones since they are physically part of the same key space as point values. Intents, on the other hand, are already in a separate part of the key space for points, and it probably makes sense to continue that for future range intents. In which case interleaving (when needed) will need to be explicit in CockroachDB (like we do now with intentInterleavingIter). So I don't think this is an actual weakness.

This will be implemented by a new storage.pebblePointSynthesizingIterator, which sits between storage.pebbleIterator and pebble.Iterator.

storage.pebbleIterator implements MVCCIterator and EngineIterator. The latter is often used for lock table iteration for which one doesn't need synthesizing of points for range keys. Is the idea here to define a common interface that both pebble.Iterator and storage.pebblePointSynthesizingIterator implement?

This cannot be used when the MVCC caller has requested tombstones to be emitted, since we must synthesize point tombstones for the point keys that would be masked.

I'm thinking about the case where the caller wants masking, and sees a set of points P (post-masking).

  • It still wants synthesized point tombstones that are more recent than P to do the uncertainty check.
  • If it was scanning a range R, I don't think it cares about the fact that there are range tombstones within the uncertainty interval in R-P, since either way there is no value there. If this is correct, I think we don't need to worry about "One caveat is uncertainty restarts and other conflict handling, which must explicitly check for range tombstones at a non-existing point key."

@erikgrinaker
Copy link
Contributor Author

Both benefits apply to range tombstones since they are physically part of the same key space as point values. Intents, on the other hand, are already in a separate part of the key space for points, and it probably makes sense to continue that for future range intents. In which case interleaving (when needed) will need to be explicit in CockroachDB (like we do now with intentInterleavingIter). So I don't think this is an actual weakness.

Indeed, as I mentioned later in that paragraph I don't see this being a problem for the anticipated range key types. But I though it would be worth calling it out anyway, if we should ever come up with some other range key type across the normal user key space.

storage.pebbleIterator implements MVCCIterator and EngineIterator. The latter is often used for lock table iteration for which one doesn't need synthesizing of points for range keys. Is the idea here to define a common interface that both pebble.Iterator and storage.pebblePointSynthesizingIterator implement?

Right, didn't consider EngineIterator. I was hoping to avoid another interface here. It's possible for callers to disable range tombstone synthesis, and in that case we'll set up the pebble.Iterator with IterKeyTypePointsOnly instead of IterKeyTypePointsAndRanges. That should make the synthesis and range key logic basically a noop. Perhaps that will be sufficient?

Either way, I think the main point here is the concept of synthesizing point tombstones below MVCCIterator for easy integration and compatibility with existing callers. I'll look into the specifics of how to implement that, which will also depend on benchmarks.

  • If it was scanning a range R, I don't think it cares about the fact that there are range tombstones within the uncertainty interval in R-P, since either way there is no value there. If this is correct, I think we don't need to worry about "One caveat is uncertainty restarts and other conflict handling, which must explicitly check for range tombstones at a non-existing point key."

For non-tombstone reads, you're right. But if the caller has requested reads with tombstones, we need to determine whether the range tombstone across the empty span is visible or not. And for writes, we need to conflict with point writes at the same timestamp at least -- we may also need logic to order the range tombstone write and point write with respect to each other, but it's possible that locking and the timestamp cache takes care of this already (I'll have to give it some thought). In any case, I think it should be sufficient to layer this on top in the few cases that it's needed.

@sumeerbhola
Copy link
Collaborator

For non-tombstone reads, you're right. But if the caller has requested reads with tombstones, we need to determine whether the range tombstone across the empty span is visible or not. And for writes ...

It may be worthwhile enumerating the various cases in a doc so we can poke at it and revise. For instance

  • I see that we may need something to be visible for writes, but I am not exactly clear how that should be made visible -- the write is writing specific points so if we knew those keys we could also synthesize point keys.
  • For reads, I don't know the full list of cases where scan or get options is setting Tombstones=true. Rangefeeds and txn refreshes do, but they shouldn't care about uncertainty.

@erikgrinaker
Copy link
Contributor Author

It may be worthwhile enumerating the various cases in a doc so we can poke at it and revise.

Yeah, I think that's a good idea. I'll write up either an RFC or a tech note to nail down the final MVCC range tombstone implementation and document it. I'm going to put a pin in these scan details for now and revisit.

the write is writing specific points so if we knew those keys we could also synthesize point keys

I was thinking the same thing. This would also handle an MVCCGet of a missing key.

I don't know the full list of cases where scan or get options is setting Tombstones=true

Doesn't seem to be that many. It's mostly rangefeeds and txn refreshes, as you point out, as well as testing. Exports and GC scans also care about range tombstones, but these have specialized APIs and will read the range tombstones separately.

The tombstone value includes the start and end keys to disambiguate individual range tombstones from constituent Pebble range key fragments. This also allows individual range tombstones to be identified across ranges, although we may want to clamp the returned bounds to the request bounds where appropriate. It also serves as a performance optimization, to avoid the need to decode every fragment for comparison during forwards and backwards iteration.

This would've been somewhat helpful for scans, but as we've touched on previously, I don't think we can/should think of a range tombstones as a single logical entity with fixed start/end bounds. The main reason is that mutations are non-atomic. For example, a range delete across multiple ranges may succeed on some ranges but fail on others. Similarly, range GC happens for different ranges at different times, so a tombstone spanning ranges would have "holes" in it. And splits/merges would require tombstone rewrites.

This is somewhat retreading old ground, but I'm reaching the same conclusion again: let's drop stored start/end bounds, but defragment during iteration for convenience.

craig bot pushed a commit that referenced this issue Jun 4, 2022
77417: storage: add experimental MVCC range key primitives r=sumeerbhola,jbowens,nicktrav a=erikgrinaker

This patch adds initial experimental primitives for MVCC range keys,
which will be the foundation for MVCC range tombstones. They are based
on experimental Pebble range keys.

* Data structures:
  * `MVCCRangeKey`
  * `MVCCRangeKeyValue`

* `Engine` methods for mutating range keys:
  * `ExperimentalClearMVCCRangeKey()`
  * `ExperimentalClearAllMVCCRangeKeys()`
  * `ExperimentalPutMVCCRangeKey()`
  * `SupportsRangeKeys()`

* `SimpleMVCCIterator` methods for accessing range keys:
  * `HasPointAndRange()`
  * `RangeBounds()`
  * `RangeKeys()`

Range keys do not have a distinct identity, and should instead be
considered a key continuum: they will merge with abutting keys of the
same value, can be partially cleared, can split or merge along with
ranges, and so on. Bounded scans will truncate them to the scan bounds.

Only MVCC range tombstones are currently supported, with an empty value.
Attempts to write a non-tombstone value will error, since these would
need additional logic throughout the MVCC APIs, and their semantics are
still unclear.

Range key support is implemented in `pebbleIterator` and
`intentInterleavingIter`, but not in the rest of the MVCC or KV APIs.
They are not persisted to disk either. Subsequent pull requests will
extend their functionality and integrate them with other components.

Touches #70412.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
craig bot pushed a commit that referenced this issue Jun 7, 2022
82017: storage: add `IterOptions.RangeKeyMaskingBelow` r=jbowens a=erikgrinaker

This patch adds an iterator option `RangeKeyMaskingBelow`, which will
cause range keys at or below the given timestamp to mask any point keys
below them. This can be used to optimize callers that do not care about
deleted point keys, since Pebble will not emit these.

Touches #70412.

Release note: None

82088: roachtest: extract version map into separate file r=tbg a=rail

Previously, we had to patch the code in order to update the version
upgrade test's version mapping. This made automation a bit complicated,
thus it was done manually.

Additionally, when we upgraded to a new major version, we would need to
patch the same test in order to generate fixtures.

This patch extracts the version mapping into a separate file, which will
be easier to patch using automation. Additionally, it extracts the fixture
generation code into a separate test which is skipped by CI, but can be
run manually.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
craig bot pushed a commit that referenced this issue Jun 7, 2022
…82521

79748: streamingccl: add a builtin function that retrives stream ingestion statitics r=gh-casper a=gh-casper

Support crdb_internal.stream_ingestion_stats(ingestion_job_id) to
running progress for a stream ingestion job.

Release note (sql change): Support crdb_internal.stream_ingestion_stats
(ingestion_job_id) to running progress for a stream ingestion job.

Release Justification: Cat 4.

Closes: #57430

81598: obsservice: initial commit r=andreimatei a=andreimatei

These are the humble beginnings of an Observability Service for CRDB. In
time, this is meant to take over a lot of the observability
functionality, extracting it from the database itself. For now, all
there is here is an HTTP reverse proxy, able to route HTTP requests to a
CRDB node (or cluster, through a load balancer). At least for the
transitioning period, if not forever, the Obs Service will route HTTP
paths that it doesn't understand to the Cockroach cluster being
monitored.

The HTTP proxy accepts its own certificates and can serve HTTPS
independently of CRDB (which might be running in --insecure mode and not
serving HTTPS). However, if CRDB is serving HTTPS, then the Obs Service
must also be configured with a certificate so it can also serve HTTPS.

The code of the Obs Service is split in between a binary (minimal
shell), and a library. The idea is to keep the door open for other repos
to extend the Obs Service with extra functionality (e.g. the CC managed
service might want to add cloud-specific observability features). For
such purposes, I've considered making a dedicated Go module for the Obs
Service so that source-level imports of the Obs Service would be as
ergonomic as Go wants them to be. I've put a decent amount of work in
playing with this but, ultimately, I've decided against it, at least for
now. The problem is that the Obs Service wants to take minor code
dependencies on random CRDB libraries (syncutil, log, etc.) and the
cockroach Go module is fairly broken (our git tags look like we do
semantic versioning, but we don't actually do it). The go tooling has a
terrible time with the cockroach module. Also, our code is generally not
`go build`-able. If the obs service was a dedicated module, it would
need to take a dependency on the cockroach module, which would negate
the win for people who want to import it (in fact, it'd make matters
worse for importers because the nasty cockroach dependency would be more
hidden). An alternative I've considered was to start creating modules
for all of the dependencies of the Obs Service. But, if CRDB would then
need to depend on these modules, there's all sorts of problems.

Release note: None

82041: storage: clear range keys in `Writer.Clear*Range` methods r=jbowens a=erikgrinaker

This patch clears range keys in the `Writer` methods `ClearRawRange`,
`ClearMVCCRange`, and `ClearMVCCIteratorRange`, as well as in the
`ClearRangeWithHeuristic` helper.

Range keys are not cleared in `ClearMVCCVersions`, since this method is
specifically for clearing MVCC point key versions, and it is not
possible to clear range keys between versions of the same point key.

Touches #70412.

Release note: None

82377: sql: use declarative schema changer for add column with unique r=fqazi a=fqazi

Previously, the declarative schema changer was disabled
when adding a column with unique constraint, since we
didn't have a complete set of elements / logic for adding
secondary indexes. This was inadequate because operations
would not ordered correctly and rollbacks could potentially
break. To address this, this patch enables support and
addresses the missing pieces to setup the secondary indexes
correctly, such as adding suffix columns, ordering operations
correctly, and returning appropriate errors.

Release note: None

82380: util/log: rework the bufferSink r=andreimatei a=andreimatei

This is a rewrite of bufferSink (now called bufferedSink). As opposed to
before, the code is smaller, more efficient, and, most importantly,
simpler. I couldn't understand the previous code well, and I think it
made new additions to it (in particular, adding a shutdown sequence)
hard to analyze. The semantics of this buffer are now better and simpler.

Before, a log message was traveling through a sequence of two buffers
before being eventually delivered to the wrapped sink:
1. The message was first placed into an accumulator buffer.
2. Then, when that buffer grows too old or too large, the accumulator
   buffer is placed into a queue to be flushed later, and the
   accumulator was reset.
3. There was a limit on the size of this queue of batches to
   be flushed. When the queue was full, new messages were dropped.
There were two goroutines managing these two buffers, which is one too
many.

This behavior of queuing multiple, distinct flushes, was peculiar at
best. There's generally no reason to not coalesce all the flushes into
one, thus speeding them up and reducing the risk of message drops. The
bufferSink's memory limit was not explicit, and difficult enough to
infer: it was set indirectly from the combination of limits on the two
buffers, expressed in different units.

Now, the sequence of two buffers (the accumulator and the flush queue)
is gone. There's a single buffer accumulating new messages, plus at most
one in-flight flush which captures its buffer. As soon as a flush is
triggered, the buffer is reset. The buffer is now configured with a
memory limit and, when that limit is reached, the _oldest_ messages in
the buffer are dropped, instead of the newest. This, combined with the
blocking semantics of the forceSync option, means that a forceSync
message is never dropped (unless its size is, by itself, larger than the
buffer limit). The number of goroutines running for the bufferSink drops
from two to one.

Release note: None

82436: server, ui: handle null plan gist in getStatementDetailsPerPlanHash r=ericharmeling a=ericharmeling

This PR adds a new function that handles null plan gists in `getStatementDetailsPerPlanHash`. The PR also adds some logic to hide null plan gists in the SqlBox of the Statement Details page.

Here's a screenshot of the Statement Details page when the plan gist is null:

<img width="1024" alt="Screen Shot 2022-06-03 at 7 18 59 PM" src="https://user-images.githubusercontent.com/27286675/171966978-6a444297-70d4-4ef3-8afa-5068e3f35d9e.png">


Fixes #82095.

We probably want to figure out why there are null values in [`planner.instrumentation.planGist`](https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/executor_statement_metrics.go#L212) in the first place. `getStatementDetailsPerPlanHash` is pulling from the system table populated with the plan gist from `planner.instrumentation.planGist`. CC `@cucaroach`

Release note: None

82471: dev: allow `--rewrite` when testing many targets with `/...` r=rail a=rickystewart

Closes #82053.

Release note: None

82479: keys: resolve subtle non-bug by exporting RaftLogKeyFromPrefix r=tbg,erikgrinaker a=nvanbenschoten

This commit resolves a subtle interaction that came tantalizingly close to a bug in `StateLoader.LoadLastIndex`. The method uses the `StateLoader`'s underlying `keys.RangeIDPrefixBuf` to generate two keys (`RaftLogPrefix` and `RaftLogKey`) that are in use as the same time.

`RangeIDPrefixBuf` avoids heap allocations by sharing a single byte slice across all keys that it generates. This is why the type has the comment:
> The generated keys are only valid until the next call to one of the key generation methods.

As would be expected, given this comment, the second use of the `RangeIDPrefixBuf` overwrote the buffer and invalidated the first key. However, it happened to get lucky and generate a new key with the same prefix as the old key. As a result, the contents of the first key did not change.

To make this aliasing more explicit and avoid this becoming a real bug in the future, we introduce a new `RaftLogKeyFromPrefix` function that callers can use to generate raft log entry keys from a raft log prefix.

We then use this new function to avoid some redundant encoding work elsewhere due to repeated calls to `RaftLogKey`.

82520: cli,sql: fix reported result columns for EXPLAIN ANALYZE r=yuzefovich a=rafiss

Fixes #82502

This fixes an issue where EXPLAIN ANALYZE would report a RowDescription
for both the EXPLAIN and for the statement being explained. If the
statement had a different number of result columns, this would confuse
the CLI.

No release note, since this bug was not released.

Release note: None

82521: roachtest: fix a minor bug with tpch_concurrency roachtest r=yuzefovich a=yuzefovich

`tpch_concurrency` pushes the cluster to its limits, so it is expected
that OOMs occur. Previously, in a recently added code path, we would
fail the test if an error occurred when running a debugging statement,
yet that error is expected and should be returned.

Fixes: #82510.

Release note: None

Co-authored-by: Casper <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
craig bot pushed a commit that referenced this issue Jun 8, 2022
82013: storage: add MVCC point synthesizing iterator r=jbowens a=erikgrinaker

This patch adds `pointSynthesizingIter`, an MVCC iterator which wraps an
arbitrary MVCC iterator and synthesizes point keys for range keys at
their start key and where they overlap point keys. It can optionally
synthesize around the SeekGE seek key too, which is useful for point
operations like `MVCCGet` where we may want to return a synthetic
tombstone for an MVCC range tombstone if there is no existing point key.

This will primarily be used to handle MVCC range tombstones in MVCC
scans and gets, as well as during MVCC conflict checks, which allows
much of this logic to remain unchanged and simplified (in particular,
`pebbleMVCCScanner` will not need any changes).

However, this patch does not make use of the iterator yet, since both it
and Pebble will need further performance optimizations for use in hot
paths. For now, correctness is sufficient, and only basic attempts at
performance optimization have been made.

Touches #70412.

Release note: None

82547: opt: fix inverted key column type in testcat r=mgartner a=mgartner

Previously, inverted key columns in the test catalog were incorrectly
given the type of their source column. Now they have the correct type,
`types.EncodedKey`.

Release note: None

82635: bazel: unlist unsupported os+arch combos r=irfansharif a=irfansharif

We don't support these for CRDB builds. Even windows is experimental,
but lets keep that around anyway since we publish binaries for it.

+cc #82625.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
craig bot pushed a commit that referenced this issue Jun 15, 2022
81189: storage: add conflict handling for MVCC range tombstones r=jbowens a=erikgrinaker

This patch takes MVCC range tombstones into account for all MVCC write
operations, i.e. for conflict checks and conditional writes.

MVCC stats updates are not covered here, but will be addressed in a
subsequent PR. This also adds more exhaustive tests, especially for the
intent handling in `mvccPutInternal()`.

Touches #70412.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
craig bot pushed a commit that referenced this issue Jun 26, 2022
82045: storage: add MVCC range tombstone handling in scans and gets r=jbowens a=erikgrinaker

This patch adds MVCC range tombstone handling for scans and gets. In the
basic case, this simply means that point keys below an MVCC range
tombstone are not visible.

When tombstones are requested by the caller, the MVCC range tombstones
themselves are never exposed, to avoid having to explicitly handle these
throughout the codebase. Instead, synthetic MVCC point tombstones are
emitted at the start of MVCC range tombstones and wherever they overlap
a point key (above and below). Additionally, point gets return synthetic
point tombstones if they overlap an MVCC range tombstone even if no
existing point key exists. This is based on `pointSynthesizingIter`,
which avoids additional logic in `pebbleMVCCScanner`.

Synthetic MVCC point tombstones emitted for MVCC range tombstones are
not stable, nor are they fully deterministic. For example, the start key
will be truncated by iterator bounds, so an `MVCCScan` over a given key
span may see a synthetic point tombstone at its start (if it overlaps an
MVCC range tombstone), but this will not be emitted if a broader span is
used (a different point tombstone will be emitted instead). Similarly, a
CRDB range split/merge will split/merge MVCC range tombstones, changing
which point tombstones are emitted. Furthermore, `MVCCGet` will
synthesize an MVCC point tombstone if it overlaps an MVCC range
tombstone and there is no existing point key there, while an `MVCCScan`
will not emit these. Callers must take care not to rely on such
semantics for MVCC tombstones. Existing callers have been audited to
ensure they are not affected.

Point tombstone synthesis must be enabled even when the caller has not
requested tombstones, because they must always be taken into account for
conflict/uncertainty checks. However, in these cases we enable range key
masking below the read timestamp, omitting any covered points since
these are no longer needed.

Touches #70412.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
craig bot pushed a commit that referenced this issue Jun 28, 2022
78085: storage: add `MVCCStats` for range keys r=jbowens a=erikgrinaker

**storage: export some MVCC key encoding functions**

Release note: None

**roachpb: add `Key.Prevish()` to find a previous key**

This patch adds `Key.Prevish()`, which returns a previous key in
lexicographical sort order. This is needed to expand a latch span
leftwards to peek at any left-adjacent range keys.

It is impossible to find the exact immediate predecessor of a key,
because it can have an infinite number of `0xff` bytes at the end, so
this returns the nearest previous key right-padded with `0xff` up to the
given length. It is still possible for an infinite number of keys to
exist between `Key` and `Key.Prevish()`, as keys have unbounded length.

Release note: None

**storage: add `MVCCStats` for range keys**

This patch adds `MVCCStats` tracking for range keys. Four new fields are
added to `MVCCStats`:

* `RangeKeyCount`: the number of (fragmented) range keys, not counting
  historical versions.

* `RangeKeyBytes`: the logical encoded byte size of all range keys.
  The latest version contributes the encoded key bounds, and all
  versions contribute encoded timestamps. Unlike point keys, which for
  historical reasons use a fixed-size timestamp contribution, this uses
  the actual variable-length timestamp size.

* `RangeValCount`: the number of (fragmented) range key versions.

* `RangeValBytes`: the encoded size of all range key values across
  all versions. The same value can be stored in multiple range keys
  due to fragmentation, which will be counted separately. Even though
  all range keys are currently MVCC range tombstones with no value, the
  `MVCCValueHeader` contribution can be non-zero due to e.g. a local
  timestamp.

`ComputeStatsForRange()` has been extended to calculate the above
quantities, and additionally account for range tombstones themselves in
`GCBytesAge` along with their effect on point keys. All relevant call
sites have been updated to surface range keys for the MVCC iterators
passed to `ComputeStatsForRange()`.

Most MVCC operations have been updated to correctly account for MVCC
range tombstones, e.g. during point key writes and intent resolution. KV
APIs are not yet updated, this will be addressed later.

Range key stats are also adjusted during range splits and merges, which
will split and merge any range keys that straddle the split key. This
requires a single range key seek to the left and right of the split key
during these operations.

Touches #70412.

Release note: None

82384: sql: reuse the slice of RequestUnion objects between fetches r=yuzefovich a=yuzefovich

This commit teaches `txnKVFetcher` and `txnKVStreamer` to reuse the same
slice of `RequestUnion` objects between different fetches. It is now
extremely easy to do given the recent refactor. We do perform memory
accounting for this slice (against a memory account bound to an
unlimited memory monitor). Additionally, a similar optimization is
applied to how resume requests are populated by the Streamer.

Addresses: #82160.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
craig bot pushed a commit that referenced this issue Jun 28, 2022
78085: storage: add `MVCCStats` for range keys r=jbowens a=erikgrinaker

**storage: export some MVCC key encoding functions**

Release note: None

**roachpb: add `Key.Prevish()` to find a previous key**

This patch adds `Key.Prevish()`, which returns a previous key in
lexicographical sort order. This is needed to expand a latch span
leftwards to peek at any left-adjacent range keys.

It is impossible to find the exact immediate predecessor of a key,
because it can have an infinite number of `0xff` bytes at the end, so
this returns the nearest previous key right-padded with `0xff` up to the
given length. It is still possible for an infinite number of keys to
exist between `Key` and `Key.Prevish()`, as keys have unbounded length.

Release note: None

**storage: add `MVCCStats` for range keys**

This patch adds `MVCCStats` tracking for range keys. Four new fields are
added to `MVCCStats`:

* `RangeKeyCount`: the number of (fragmented) range keys, not counting
  historical versions.

* `RangeKeyBytes`: the logical encoded byte size of all range keys.
  The latest version contributes the encoded key bounds, and all
  versions contribute encoded timestamps. Unlike point keys, which for
  historical reasons use a fixed-size timestamp contribution, this uses
  the actual variable-length timestamp size.

* `RangeValCount`: the number of (fragmented) range key versions.

* `RangeValBytes`: the encoded size of all range key values across
  all versions. The same value can be stored in multiple range keys
  due to fragmentation, which will be counted separately. Even though
  all range keys are currently MVCC range tombstones with no value, the
  `MVCCValueHeader` contribution can be non-zero due to e.g. a local
  timestamp.

`ComputeStatsForRange()` has been extended to calculate the above
quantities, and additionally account for range tombstones themselves in
`GCBytesAge` along with their effect on point keys. All relevant call
sites have been updated to surface range keys for the MVCC iterators
passed to `ComputeStatsForRange()`.

Most MVCC operations have been updated to correctly account for MVCC
range tombstones, e.g. during point key writes and intent resolution. KV
APIs are not yet updated, this will be addressed later.

Range key stats are also adjusted during range splits and merges, which
will split and merge any range keys that straddle the split key. This
requires a single range key seek to the left and right of the split key
during these operations.

Touches #70412.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
craig bot pushed a commit that referenced this issue Jul 1, 2022
80840: batcheval: handle MVCC range tombstones in `ClearRange` r=aliher1911 a=erikgrinaker

This patch makes `ClearRange` account for MVCC range tombstones when
updating MVCC stats.

Touches #70412.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 5, 2022

MVCC range tombstones are now functional, except for garbage collection which is tracked separately in #70414. For details, see tech note in #83762.

There is a large amount of optimization and test work remaining, but this is tracked separately. See CRDB-2624 for details (internal).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

5 participants