kvserver: Update snapshot strategy for disaggregated ingests #103028

itsbilal · 2023-05-10T15:23:38Z

The kvBatchSnapshotStrategy needs to be updated to recognize where the sending/recipient nodes are both using the same shared.Storage, and if so, use the new skip-shared iteration mode in db.ScanInternal to produce a list of shared sstable metadatas as well as some flat non-shared sstables that sit on top of them.

On the recipient side, these shared sstable metadata structs can be ingested into Pebble using the new IngestAndExcise command (see cockroachdb/pebble#2520 , which this issue has a dependency on).

Jira issue: CRDB-27800

The text was updated successfully, but these errors were encountered:

itsbilal · 2023-06-01T19:51:32Z

As mentioned in cockroachdb/pebble#2538, we should call the new IngestAndExcise even if there are no shared files, to avoid the need to write a range tombstone.

sumeerbhola · 2023-06-06T13:42:03Z

Do we have a list of all the users of Pebble snapshots and how they will be affected by excise removing things under them? Copying some text from cockroachdb/pebble#2424 where we started listing these and also discussing whether the use of Pebble snapshots was necessary.

Pebble snapshots will only be valid if we have not excised a keyspan from the LSM, which we will be doing when dropping a CockroachDB range. So the current uses of Pebble snapshots will need to ensure that the CockroachDB range is still present.

Sending range snapshots will be cheaper since most of the data is sitting in shared sstables, and does not need to be iterated over, hence using a Pebble snapshot may be unnecessary (more in db: efficient skipping of points deleted by RANGEDEL in the same file pebble#2424 (comment),

... mvccGCQueue: does it need a consistent state using a snapshot, or can afford to use multiple iterators when scanning a range.

snapshots related to the TSDB:

cockroach/pkg/kv/kvserver/ts_maintenance_queue.go

Lines 151 to 156 in ee9831d

snap := repl.store.TODOEngine().NewSnapshot()

now := repl.store.Clock().Now()

defer snap.Close()

if err := q.tsData.MaintainTimeSeries(

ctx, snap, desc.StartKey, desc.EndKey, q.db, q.mem, TimeSeriesMaintenanceMemoryBudget, now,

); err != nil {

. That use of a snapshot looks wholly unnecessary. (*ts.DB).MaintainTimeSeries passes it into (*ts.DB).findTimeSeries, which uses it once to create an iterator to find the names of the time series.

This change updates the ScanInternal function signature to not use the internal keyspan.Key type. Instead we use the new InternalRangeKey type alias that exports it. Also update the pointCollapsingIter to just call SeekGE in NextPrefix instead of panicking. Necessary for cockroachdb/cockroach#103028.

This change updates the ScanInternal function signature to not use the internal keyspan.Key type. Instead we use the rangekey.Key type alias that exports it. Also update the pointCollapsingIter to just call SeekGE in NextPrefix instead of panicking. Necessary for cockroachdb/cockroach#103028.

itsbilal · 2023-06-29T19:47:36Z

@sumeerbhola Thanks for linking that! Looks like the TS maintenance queue no longer uses a snapshot. The use of a snapshot to send a KV snapshot is likely okay for the purposes of this issue as we are guaranteed that we won't be clearing the replica until that replication is complete.

The only other use of a snapshot I can see is the GC queue. I couldn't find anything else from an audit of the code.

If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots. If ScanInternal() returns an `ErrInvalidSkipSharedIteration`, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests. Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing. Fixes cockroachdb#103028.

This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of cockroachdb#103028. Epic: none Release note: None

This change updates pkg/storage interfaces and implementations to allow the use of ScanInternal in skip-shared iteration mode as well as writing/reading of internal point keys, range dels and range keys. Replication / snapshot code will soon rely on these changes to be able to replicate internal keys in higher levels plus metadata of shared sstables in lower levels, as opposed to just observed user keys. Part of cockroachdb#103028 Epic: none Release note: None

This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of cockroachdb#103028. Epic: none Release note: None

This change updates pkg/storage interfaces and implementations to allow the use of ScanInternal in skip-shared iteration mode as well as writing/reading of internal point keys, range dels and range keys. Replication / snapshot code will soon rely on these changes to be able to replicate internal keys in higher levels plus metadata of shared sstables in lower levels, as opposed to just observed user keys. Part of cockroachdb#103028 Epic: none Release note: None

This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of cockroachdb#103028. Epic: none Release note: None

This change updates pkg/storage interfaces and implementations to allow the use of ScanInternal in skip-shared iteration mode as well as writing/reading of internal point keys, range dels and range keys. Replication / snapshot code will soon rely on these changes to be able to replicate internal keys in higher levels plus metadata of shared sstables in lower levels, as opposed to just observed user keys. Part of cockroachdb#103028 Epic: none Release note: None

This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of cockroachdb#103028. Epic: none Release note: None

107297: storage,kvserver: Foundational changes for disaggregated ingestions r=sumeerbhola a=itsbilal This change contains two commits (split off from the original mega-PR, #105839). The first is a pkg/storage change to add new interface methods to call pebble's db.ScanInternal as well as implement related helper methods in sstable writers/batch readers/writers to be able to do disaggregated snapshot ingestion. The second is a kvserver/rditer change to allow finer-grained control on what replicated spans we iterate on, as well as to be able to specifically opt into skip-shared iteration over the user key span through the use of `ScanInternal`. --- **storage: Update Engine/Reader/Writer interfaces for ScanInternal** This change updates pkg/storage interfaces and implementations to allow the use of ScanInternal in skip-shared iteration mode as well as writing/reading of internal point keys, range dels and range keys. Replication / snapshot code will soon rely on these changes to be able to replicate internal keys in higher levels plus metadata of shared sstables in lower levels, as opposed to just observed user keys. Part of #103028 Epic: none Release note: None **kvserver: Add ability to filter replicated spans in Select/Iterate** This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of #103028. Epic: none Release note: None 108336: sql: retry more distributed errors as local r=yuzefovich a=yuzefovich This PR contains a couple of commits that increase the allow-list of errors that are retried locally. In particular, it allows us to hide some issues we have around using DistSQL and shutting down SQL pods. Fixes: #106537. Fixes: #108152. Fixes: #108271. Release note: None 108406: server,testutils: remove complexity r=yuzefovich,herkolategan a=knz There is a saying (paraphrasing) that it always takes more work removing unwanted complexity than it takes to add it. This is an example of that. Prior to this commit, there was an "interesting" propagation of the flag that decides whether or not to define a test tenant for test servers and clusters. In a nutshell, we had: - an "input" flag in `base.TestServerArgs`, which remained mostly immutable - a boolean decided once by `ShouldStartDefaultTestTenant()` either in: - `serverutils.StartServerOnlyE` - or `testcluster.Start` - that boolean choice was then propagated to `server.testServer` via _another_ boolean config flag in `server.BaseConfig` - both the 2nd boolean and the original input flag were then again checked when the time came to do the work (in `maybeStartDefaultTestTenant`). Additional complexity was then incurred by the need of `TestCluster` to make the determination just once (and not once per server). This commit cuts through all the layers of complexity by simply propagating the choice of `ShouldStartDefaultTestTenant()` back into the `TestServerArgs` and only ever reading from that subsequently. Release note: None Epic: CRDB-18499 108465: cloudccl: allow external connection tests to be run in parallel r=rhu713 a=rhu713 Currently external connection tests read and write to the same path in cloud storage. Add a random uint64 as part of the path so that test runs have unique paths and can be run in parallel. Fixes: #107407 Release note: None 108481: acceptance: stabilize start-single-node in tcl test r=santamaura a=dhartunian We've continued to see flakes on this test which contain messages of throttled stores on node startup. The hypothesis is that these are due to leftover data directories from prior startups during the same test. This change clears the `logs/db` data directory for those invocations and also adds the sql memory flag which the common tcl function also uses. Resolves #108405 Epic: None Release note: None 108496: kv: unit test `PrepareTransactionForRetry` and `TransactionRefreshTimestamp` r=miraradeva a=nvanbenschoten Informs #104233. This commit adds a pair of new unit tests to verify the behavior of `PrepareTransactionForRetry` and `TransactionRefreshTimestamp`. These functions will be getting more complex for #104233, so it will be helpful to have these tests in place. The tests also serve as good documentation. Release note: None Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots. If ScanInternal() returns an `ErrInvalidSkipSharedIteration`, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests. Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing. Fixes cockroachdb#103028. Release note (general change): Takes advantage of new CLI option, `--experimental-shared-storage` to rebalance faster from node to node.

105839: kvserver,storage: Update snapshot strategy to use shared storage r=sumeerbhola a=itsbilal If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots. If ScanInternal() returns an `ErrInvalidSkipSharedIteration`, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests. Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing. Fixes #103028. Co-authored-by: Bilal Akhtar <[email protected]>

itsbilal added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels May 10, 2023

itsbilal mentioned this issue May 10, 2023

roachtest: Add new roachtest to test fast rebalance #103030

Closed

itsbilal self-assigned this May 30, 2023

itsbilal mentioned this issue Jun 2, 2023

*: add IngestAndExcise operation cockroachdb/pebble#2538

Merged

itsbilal mentioned this issue Jun 14, 2023

db: don't use internal types in ScanInternal cockroachdb/pebble#2635

Merged

itsbilal mentioned this issue Jun 29, 2023

kvserver,storage: Update snapshot strategy to use shared storage #105839

Merged

craig bot closed this as completed in 02a4b7e Aug 16, 2023

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: Update snapshot strategy for disaggregated ingests #103028

kvserver: Update snapshot strategy for disaggregated ingests #103028

itsbilal commented May 10, 2023 •

edited by cockroach-jira-scripts

Loading

itsbilal commented Jun 1, 2023

sumeerbhola commented Jun 6, 2023

itsbilal commented Jun 29, 2023

kvserver: Update snapshot strategy for disaggregated ingests #103028

kvserver: Update snapshot strategy for disaggregated ingests #103028

Comments

itsbilal commented May 10, 2023 • edited by cockroach-jira-scripts Loading

itsbilal commented Jun 1, 2023

sumeerbhola commented Jun 6, 2023

itsbilal commented Jun 29, 2023

itsbilal commented May 10, 2023 •

edited by cockroach-jira-scripts

Loading