-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: MVCC range deletion tombstones #70412
Comments
May be implemented in Pebble as generalized range key support: cockroachdb/pebble#1339 |
I've been protoyping MVCC range tombstones based on Pebble range keys. I'm converging on the following approach, would appreciate a quick look by @sumeerbhola and @jbowens. Data structurePebble range keys will use regular user keys as the start and end keys, and MVCC timestamps as suffix. This allows easy integration with point keys. To allow multiple range key kinds, the value will be a Protobuf union type, currently only with a range tombstone kind: message MVCCRangeValue {
option (gogoproto.onlyone) = true;
MVCCRangeTombstone tombstone = 1;
}
message MVCCRangeTombstone {
// NB: can't be roachpb.Key due to import cycle.
bytes start_key = 1;
bytes end_key = 2;
} The tombstone value includes the start and end keys to disambiguate individual range tombstones from constituent Pebble range key fragments. This also allows individual range tombstones to be identified across ranges, although we may want to clamp the returned bounds to the request bounds where appropriate. It also serves as a performance optimization, to avoid the need to decode every fragment for comparison during forwards and backwards iteration. However, these values will be duplicated for each range key fragment, so there will be some storage overhead. One disadvantage of the key/suffix scheme is that range values of different kinds (e.g. range tombstones and range intents) cannot be iterated over separately. Since the amount of range tombstones is expected to be small, this will likely not present a problem. However, this will prevent differentiated range key masking by value kind, although this is not expected to be a problem for the currently anticipated range types (tombstones and intents). In the MVCC API, range tombstones will be represented by a new type MVCCRangeKey struct {
StartKey roachpb.Key
EndKey roachpb.Key
Timestamp hlc.Timestamp
}
|
Should this be the case for any |
Probably not. Range tombstones can, because the key/timespan fully describes the tombstone, but this won't be the case for e.g. ranged intents that will have lots of metadata. |
One caveat is uncertainty restarts and other conflict handling, which must explicitly check for range tombstones at a non-existing point key. But these exceptions can likely be layered on top of the point tombstone synthesis, or achieved via synthesis settings. |
Range keys provide two benefits:
Both benefits apply to range tombstones since they are physically part of the same key space as point values. Intents, on the other hand, are already in a separate part of the key space for points, and it probably makes sense to continue that for future range intents. In which case interleaving (when needed) will need to be explicit in CockroachDB (like we do now with
I'm thinking about the case where the caller wants masking, and sees a set of points P (post-masking).
|
Indeed, as I mentioned later in that paragraph I don't see this being a problem for the anticipated range key types. But I though it would be worth calling it out anyway, if we should ever come up with some other range key type across the normal user key space.
Right, didn't consider Either way, I think the main point here is the concept of synthesizing point tombstones below
For non-tombstone reads, you're right. But if the caller has requested reads with tombstones, we need to determine whether the range tombstone across the empty span is visible or not. And for writes, we need to conflict with point writes at the same timestamp at least -- we may also need logic to order the range tombstone write and point write with respect to each other, but it's possible that locking and the timestamp cache takes care of this already (I'll have to give it some thought). In any case, I think it should be sufficient to layer this on top in the few cases that it's needed. |
It may be worthwhile enumerating the various cases in a doc so we can poke at it and revise. For instance
|
Yeah, I think that's a good idea. I'll write up either an RFC or a tech note to nail down the final MVCC range tombstone implementation and document it. I'm going to put a pin in these scan details for now and revisit.
I was thinking the same thing. This would also handle an
Doesn't seem to be that many. It's mostly rangefeeds and txn refreshes, as you point out, as well as testing. Exports and GC scans also care about range tombstones, but these have specialized APIs and will read the range tombstones separately.
This would've been somewhat helpful for scans, but as we've touched on previously, I don't think we can/should think of a range tombstones as a single logical entity with fixed start/end bounds. The main reason is that mutations are non-atomic. For example, a range delete across multiple ranges may succeed on some ranges but fail on others. Similarly, range GC happens for different ranges at different times, so a tombstone spanning ranges would have "holes" in it. And splits/merges would require tombstone rewrites. This is somewhat retreading old ground, but I'm reaching the same conclusion again: let's drop stored start/end bounds, but defragment during iteration for convenience. |
77417: storage: add experimental MVCC range key primitives r=sumeerbhola,jbowens,nicktrav a=erikgrinaker This patch adds initial experimental primitives for MVCC range keys, which will be the foundation for MVCC range tombstones. They are based on experimental Pebble range keys. * Data structures: * `MVCCRangeKey` * `MVCCRangeKeyValue` * `Engine` methods for mutating range keys: * `ExperimentalClearMVCCRangeKey()` * `ExperimentalClearAllMVCCRangeKeys()` * `ExperimentalPutMVCCRangeKey()` * `SupportsRangeKeys()` * `SimpleMVCCIterator` methods for accessing range keys: * `HasPointAndRange()` * `RangeBounds()` * `RangeKeys()` Range keys do not have a distinct identity, and should instead be considered a key continuum: they will merge with abutting keys of the same value, can be partially cleared, can split or merge along with ranges, and so on. Bounded scans will truncate them to the scan bounds. Only MVCC range tombstones are currently supported, with an empty value. Attempts to write a non-tombstone value will error, since these would need additional logic throughout the MVCC APIs, and their semantics are still unclear. Range key support is implemented in `pebbleIterator` and `intentInterleavingIter`, but not in the rest of the MVCC or KV APIs. They are not persisted to disk either. Subsequent pull requests will extend their functionality and integrate them with other components. Touches #70412. Release note: None Co-authored-by: Erik Grinaker <[email protected]>
82017: storage: add `IterOptions.RangeKeyMaskingBelow` r=jbowens a=erikgrinaker This patch adds an iterator option `RangeKeyMaskingBelow`, which will cause range keys at or below the given timestamp to mask any point keys below them. This can be used to optimize callers that do not care about deleted point keys, since Pebble will not emit these. Touches #70412. Release note: None 82088: roachtest: extract version map into separate file r=tbg a=rail Previously, we had to patch the code in order to update the version upgrade test's version mapping. This made automation a bit complicated, thus it was done manually. Additionally, when we upgraded to a new major version, we would need to patch the same test in order to generate fixtures. This patch extracts the version mapping into a separate file, which will be easier to patch using automation. Additionally, it extracts the fixture generation code into a separate test which is skipped by CI, but can be run manually. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Rail Aliiev <[email protected]>
…82521 79748: streamingccl: add a builtin function that retrives stream ingestion statitics r=gh-casper a=gh-casper Support crdb_internal.stream_ingestion_stats(ingestion_job_id) to running progress for a stream ingestion job. Release note (sql change): Support crdb_internal.stream_ingestion_stats (ingestion_job_id) to running progress for a stream ingestion job. Release Justification: Cat 4. Closes: #57430 81598: obsservice: initial commit r=andreimatei a=andreimatei These are the humble beginnings of an Observability Service for CRDB. In time, this is meant to take over a lot of the observability functionality, extracting it from the database itself. For now, all there is here is an HTTP reverse proxy, able to route HTTP requests to a CRDB node (or cluster, through a load balancer). At least for the transitioning period, if not forever, the Obs Service will route HTTP paths that it doesn't understand to the Cockroach cluster being monitored. The HTTP proxy accepts its own certificates and can serve HTTPS independently of CRDB (which might be running in --insecure mode and not serving HTTPS). However, if CRDB is serving HTTPS, then the Obs Service must also be configured with a certificate so it can also serve HTTPS. The code of the Obs Service is split in between a binary (minimal shell), and a library. The idea is to keep the door open for other repos to extend the Obs Service with extra functionality (e.g. the CC managed service might want to add cloud-specific observability features). For such purposes, I've considered making a dedicated Go module for the Obs Service so that source-level imports of the Obs Service would be as ergonomic as Go wants them to be. I've put a decent amount of work in playing with this but, ultimately, I've decided against it, at least for now. The problem is that the Obs Service wants to take minor code dependencies on random CRDB libraries (syncutil, log, etc.) and the cockroach Go module is fairly broken (our git tags look like we do semantic versioning, but we don't actually do it). The go tooling has a terrible time with the cockroach module. Also, our code is generally not `go build`-able. If the obs service was a dedicated module, it would need to take a dependency on the cockroach module, which would negate the win for people who want to import it (in fact, it'd make matters worse for importers because the nasty cockroach dependency would be more hidden). An alternative I've considered was to start creating modules for all of the dependencies of the Obs Service. But, if CRDB would then need to depend on these modules, there's all sorts of problems. Release note: None 82041: storage: clear range keys in `Writer.Clear*Range` methods r=jbowens a=erikgrinaker This patch clears range keys in the `Writer` methods `ClearRawRange`, `ClearMVCCRange`, and `ClearMVCCIteratorRange`, as well as in the `ClearRangeWithHeuristic` helper. Range keys are not cleared in `ClearMVCCVersions`, since this method is specifically for clearing MVCC point key versions, and it is not possible to clear range keys between versions of the same point key. Touches #70412. Release note: None 82377: sql: use declarative schema changer for add column with unique r=fqazi a=fqazi Previously, the declarative schema changer was disabled when adding a column with unique constraint, since we didn't have a complete set of elements / logic for adding secondary indexes. This was inadequate because operations would not ordered correctly and rollbacks could potentially break. To address this, this patch enables support and addresses the missing pieces to setup the secondary indexes correctly, such as adding suffix columns, ordering operations correctly, and returning appropriate errors. Release note: None 82380: util/log: rework the bufferSink r=andreimatei a=andreimatei This is a rewrite of bufferSink (now called bufferedSink). As opposed to before, the code is smaller, more efficient, and, most importantly, simpler. I couldn't understand the previous code well, and I think it made new additions to it (in particular, adding a shutdown sequence) hard to analyze. The semantics of this buffer are now better and simpler. Before, a log message was traveling through a sequence of two buffers before being eventually delivered to the wrapped sink: 1. The message was first placed into an accumulator buffer. 2. Then, when that buffer grows too old or too large, the accumulator buffer is placed into a queue to be flushed later, and the accumulator was reset. 3. There was a limit on the size of this queue of batches to be flushed. When the queue was full, new messages were dropped. There were two goroutines managing these two buffers, which is one too many. This behavior of queuing multiple, distinct flushes, was peculiar at best. There's generally no reason to not coalesce all the flushes into one, thus speeding them up and reducing the risk of message drops. The bufferSink's memory limit was not explicit, and difficult enough to infer: it was set indirectly from the combination of limits on the two buffers, expressed in different units. Now, the sequence of two buffers (the accumulator and the flush queue) is gone. There's a single buffer accumulating new messages, plus at most one in-flight flush which captures its buffer. As soon as a flush is triggered, the buffer is reset. The buffer is now configured with a memory limit and, when that limit is reached, the _oldest_ messages in the buffer are dropped, instead of the newest. This, combined with the blocking semantics of the forceSync option, means that a forceSync message is never dropped (unless its size is, by itself, larger than the buffer limit). The number of goroutines running for the bufferSink drops from two to one. Release note: None 82436: server, ui: handle null plan gist in getStatementDetailsPerPlanHash r=ericharmeling a=ericharmeling This PR adds a new function that handles null plan gists in `getStatementDetailsPerPlanHash`. The PR also adds some logic to hide null plan gists in the SqlBox of the Statement Details page. Here's a screenshot of the Statement Details page when the plan gist is null: <img width="1024" alt="Screen Shot 2022-06-03 at 7 18 59 PM" src="https://user-images.githubusercontent.com/27286675/171966978-6a444297-70d4-4ef3-8afa-5068e3f35d9e.png"> Fixes #82095. We probably want to figure out why there are null values in [`planner.instrumentation.planGist`](https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/executor_statement_metrics.go#L212) in the first place. `getStatementDetailsPerPlanHash` is pulling from the system table populated with the plan gist from `planner.instrumentation.planGist`. CC `@cucaroach` Release note: None 82471: dev: allow `--rewrite` when testing many targets with `/...` r=rail a=rickystewart Closes #82053. Release note: None 82479: keys: resolve subtle non-bug by exporting RaftLogKeyFromPrefix r=tbg,erikgrinaker a=nvanbenschoten This commit resolves a subtle interaction that came tantalizingly close to a bug in `StateLoader.LoadLastIndex`. The method uses the `StateLoader`'s underlying `keys.RangeIDPrefixBuf` to generate two keys (`RaftLogPrefix` and `RaftLogKey`) that are in use as the same time. `RangeIDPrefixBuf` avoids heap allocations by sharing a single byte slice across all keys that it generates. This is why the type has the comment: > The generated keys are only valid until the next call to one of the key generation methods. As would be expected, given this comment, the second use of the `RangeIDPrefixBuf` overwrote the buffer and invalidated the first key. However, it happened to get lucky and generate a new key with the same prefix as the old key. As a result, the contents of the first key did not change. To make this aliasing more explicit and avoid this becoming a real bug in the future, we introduce a new `RaftLogKeyFromPrefix` function that callers can use to generate raft log entry keys from a raft log prefix. We then use this new function to avoid some redundant encoding work elsewhere due to repeated calls to `RaftLogKey`. 82520: cli,sql: fix reported result columns for EXPLAIN ANALYZE r=yuzefovich a=rafiss Fixes #82502 This fixes an issue where EXPLAIN ANALYZE would report a RowDescription for both the EXPLAIN and for the statement being explained. If the statement had a different number of result columns, this would confuse the CLI. No release note, since this bug was not released. Release note: None 82521: roachtest: fix a minor bug with tpch_concurrency roachtest r=yuzefovich a=yuzefovich `tpch_concurrency` pushes the cluster to its limits, so it is expected that OOMs occur. Previously, in a recently added code path, we would fail the test if an error occurred when running a debugging statement, yet that error is expected and should be returned. Fixes: #82510. Release note: None Co-authored-by: Casper <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
82013: storage: add MVCC point synthesizing iterator r=jbowens a=erikgrinaker This patch adds `pointSynthesizingIter`, an MVCC iterator which wraps an arbitrary MVCC iterator and synthesizes point keys for range keys at their start key and where they overlap point keys. It can optionally synthesize around the SeekGE seek key too, which is useful for point operations like `MVCCGet` where we may want to return a synthetic tombstone for an MVCC range tombstone if there is no existing point key. This will primarily be used to handle MVCC range tombstones in MVCC scans and gets, as well as during MVCC conflict checks, which allows much of this logic to remain unchanged and simplified (in particular, `pebbleMVCCScanner` will not need any changes). However, this patch does not make use of the iterator yet, since both it and Pebble will need further performance optimizations for use in hot paths. For now, correctness is sufficient, and only basic attempts at performance optimization have been made. Touches #70412. Release note: None 82547: opt: fix inverted key column type in testcat r=mgartner a=mgartner Previously, inverted key columns in the test catalog were incorrectly given the type of their source column. Now they have the correct type, `types.EncodedKey`. Release note: None 82635: bazel: unlist unsupported os+arch combos r=irfansharif a=irfansharif We don't support these for CRDB builds. Even windows is experimental, but lets keep that around anyway since we publish binaries for it. +cc #82625. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: irfan sharif <[email protected]>
81189: storage: add conflict handling for MVCC range tombstones r=jbowens a=erikgrinaker This patch takes MVCC range tombstones into account for all MVCC write operations, i.e. for conflict checks and conditional writes. MVCC stats updates are not covered here, but will be addressed in a subsequent PR. This also adds more exhaustive tests, especially for the intent handling in `mvccPutInternal()`. Touches #70412. Release note: None Co-authored-by: Erik Grinaker <[email protected]>
82045: storage: add MVCC range tombstone handling in scans and gets r=jbowens a=erikgrinaker This patch adds MVCC range tombstone handling for scans and gets. In the basic case, this simply means that point keys below an MVCC range tombstone are not visible. When tombstones are requested by the caller, the MVCC range tombstones themselves are never exposed, to avoid having to explicitly handle these throughout the codebase. Instead, synthetic MVCC point tombstones are emitted at the start of MVCC range tombstones and wherever they overlap a point key (above and below). Additionally, point gets return synthetic point tombstones if they overlap an MVCC range tombstone even if no existing point key exists. This is based on `pointSynthesizingIter`, which avoids additional logic in `pebbleMVCCScanner`. Synthetic MVCC point tombstones emitted for MVCC range tombstones are not stable, nor are they fully deterministic. For example, the start key will be truncated by iterator bounds, so an `MVCCScan` over a given key span may see a synthetic point tombstone at its start (if it overlaps an MVCC range tombstone), but this will not be emitted if a broader span is used (a different point tombstone will be emitted instead). Similarly, a CRDB range split/merge will split/merge MVCC range tombstones, changing which point tombstones are emitted. Furthermore, `MVCCGet` will synthesize an MVCC point tombstone if it overlaps an MVCC range tombstone and there is no existing point key there, while an `MVCCScan` will not emit these. Callers must take care not to rely on such semantics for MVCC tombstones. Existing callers have been audited to ensure they are not affected. Point tombstone synthesis must be enabled even when the caller has not requested tombstones, because they must always be taken into account for conflict/uncertainty checks. However, in these cases we enable range key masking below the read timestamp, omitting any covered points since these are no longer needed. Touches #70412. Release note: None Co-authored-by: Erik Grinaker <[email protected]>
78085: storage: add `MVCCStats` for range keys r=jbowens a=erikgrinaker **storage: export some MVCC key encoding functions** Release note: None **roachpb: add `Key.Prevish()` to find a previous key** This patch adds `Key.Prevish()`, which returns a previous key in lexicographical sort order. This is needed to expand a latch span leftwards to peek at any left-adjacent range keys. It is impossible to find the exact immediate predecessor of a key, because it can have an infinite number of `0xff` bytes at the end, so this returns the nearest previous key right-padded with `0xff` up to the given length. It is still possible for an infinite number of keys to exist between `Key` and `Key.Prevish()`, as keys have unbounded length. Release note: None **storage: add `MVCCStats` for range keys** This patch adds `MVCCStats` tracking for range keys. Four new fields are added to `MVCCStats`: * `RangeKeyCount`: the number of (fragmented) range keys, not counting historical versions. * `RangeKeyBytes`: the logical encoded byte size of all range keys. The latest version contributes the encoded key bounds, and all versions contribute encoded timestamps. Unlike point keys, which for historical reasons use a fixed-size timestamp contribution, this uses the actual variable-length timestamp size. * `RangeValCount`: the number of (fragmented) range key versions. * `RangeValBytes`: the encoded size of all range key values across all versions. The same value can be stored in multiple range keys due to fragmentation, which will be counted separately. Even though all range keys are currently MVCC range tombstones with no value, the `MVCCValueHeader` contribution can be non-zero due to e.g. a local timestamp. `ComputeStatsForRange()` has been extended to calculate the above quantities, and additionally account for range tombstones themselves in `GCBytesAge` along with their effect on point keys. All relevant call sites have been updated to surface range keys for the MVCC iterators passed to `ComputeStatsForRange()`. Most MVCC operations have been updated to correctly account for MVCC range tombstones, e.g. during point key writes and intent resolution. KV APIs are not yet updated, this will be addressed later. Range key stats are also adjusted during range splits and merges, which will split and merge any range keys that straddle the split key. This requires a single range key seek to the left and right of the split key during these operations. Touches #70412. Release note: None 82384: sql: reuse the slice of RequestUnion objects between fetches r=yuzefovich a=yuzefovich This commit teaches `txnKVFetcher` and `txnKVStreamer` to reuse the same slice of `RequestUnion` objects between different fetches. It is now extremely easy to do given the recent refactor. We do perform memory accounting for this slice (against a memory account bound to an unlimited memory monitor). Additionally, a similar optimization is applied to how resume requests are populated by the Streamer. Addresses: #82160. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
78085: storage: add `MVCCStats` for range keys r=jbowens a=erikgrinaker **storage: export some MVCC key encoding functions** Release note: None **roachpb: add `Key.Prevish()` to find a previous key** This patch adds `Key.Prevish()`, which returns a previous key in lexicographical sort order. This is needed to expand a latch span leftwards to peek at any left-adjacent range keys. It is impossible to find the exact immediate predecessor of a key, because it can have an infinite number of `0xff` bytes at the end, so this returns the nearest previous key right-padded with `0xff` up to the given length. It is still possible for an infinite number of keys to exist between `Key` and `Key.Prevish()`, as keys have unbounded length. Release note: None **storage: add `MVCCStats` for range keys** This patch adds `MVCCStats` tracking for range keys. Four new fields are added to `MVCCStats`: * `RangeKeyCount`: the number of (fragmented) range keys, not counting historical versions. * `RangeKeyBytes`: the logical encoded byte size of all range keys. The latest version contributes the encoded key bounds, and all versions contribute encoded timestamps. Unlike point keys, which for historical reasons use a fixed-size timestamp contribution, this uses the actual variable-length timestamp size. * `RangeValCount`: the number of (fragmented) range key versions. * `RangeValBytes`: the encoded size of all range key values across all versions. The same value can be stored in multiple range keys due to fragmentation, which will be counted separately. Even though all range keys are currently MVCC range tombstones with no value, the `MVCCValueHeader` contribution can be non-zero due to e.g. a local timestamp. `ComputeStatsForRange()` has been extended to calculate the above quantities, and additionally account for range tombstones themselves in `GCBytesAge` along with their effect on point keys. All relevant call sites have been updated to surface range keys for the MVCC iterators passed to `ComputeStatsForRange()`. Most MVCC operations have been updated to correctly account for MVCC range tombstones, e.g. during point key writes and intent resolution. KV APIs are not yet updated, this will be addressed later. Range key stats are also adjusted during range splits and merges, which will split and merge any range keys that straddle the split key. This requires a single range key seek to the left and right of the split key during these operations. Touches #70412. Release note: None Co-authored-by: Erik Grinaker <[email protected]>
80840: batcheval: handle MVCC range tombstones in `ClearRange` r=aliher1911 a=erikgrinaker This patch makes `ClearRange` account for MVCC range tombstones when updating MVCC stats. Touches #70412. Release note: None Co-authored-by: Erik Grinaker <[email protected]>
We currently expose two main MVCC methods for deleting key spans, i.e.
[key0, key1)
:MVCCDeleteRange
: iterates over visible keys and writes an MVCC tombstone for each one, in O(n).Writer.ClearMVCCRangeAndIntents
: removes all MVCC keys and versions viaPebble.DeleteRange
, without any MVCC record, in <O(n).However, for reasons outlined in #69380, we need a MVCC-compliant mechanism to delete a key span in < O(n). By this, we mean a mechanism that leaves an MVCC record of the range deletion and allows readers to read before it using an appropriate timestamp. To satisfy this, we will implement MVCC range tombstones, written either in
MVCCDeleteRange()
or as a separate method.For more details, see the (currently internal) design document and tech note.
Epic CRDB-2624
Jira issue: CRDB-10062
The text was updated successfully, but these errors were encountered: