-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: Update snapshot strategy for disaggregated ingests #103028
Comments
As mentioned in cockroachdb/pebble#2538, we should call the new |
Do we have a list of all the users of Pebble snapshots and how they will be affected by excise removing things under them? Copying some text from cockroachdb/pebble#2424 where we started listing these and also discussing whether the use of Pebble snapshots was necessary.
|
This change updates the ScanInternal function signature to not use the internal keyspan.Key type. Instead we use the new InternalRangeKey type alias that exports it. Also update the pointCollapsingIter to just call SeekGE in NextPrefix instead of panicking. Necessary for cockroachdb/cockroach#103028.
This change updates the ScanInternal function signature to not use the internal keyspan.Key type. Instead we use the rangekey.Key type alias that exports it. Also update the pointCollapsingIter to just call SeekGE in NextPrefix instead of panicking. Necessary for cockroachdb/cockroach#103028.
This change updates the ScanInternal function signature to not use the internal keyspan.Key type. Instead we use the rangekey.Key type alias that exports it. Also update the pointCollapsingIter to just call SeekGE in NextPrefix instead of panicking. Necessary for cockroachdb/cockroach#103028.
@sumeerbhola Thanks for linking that! Looks like the TS maintenance queue no longer uses a snapshot. The use of a snapshot to send a KV snapshot is likely okay for the purposes of this issue as we are guaranteed that we won't be clearing the replica until that replication is complete. The only other use of a snapshot I can see is the GC queue. I couldn't find anything else from an audit of the code. |
If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots. If ScanInternal() returns an `ErrInvalidSkipSharedIteration`, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests. Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing. Fixes cockroachdb#103028.
This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of cockroachdb#103028. Epic: none Release note: None
This change updates pkg/storage interfaces and implementations to allow the use of ScanInternal in skip-shared iteration mode as well as writing/reading of internal point keys, range dels and range keys. Replication / snapshot code will soon rely on these changes to be able to replicate internal keys in higher levels plus metadata of shared sstables in lower levels, as opposed to just observed user keys. Part of cockroachdb#103028 Epic: none Release note: None
This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of cockroachdb#103028. Epic: none Release note: None
This change updates pkg/storage interfaces and implementations to allow the use of ScanInternal in skip-shared iteration mode as well as writing/reading of internal point keys, range dels and range keys. Replication / snapshot code will soon rely on these changes to be able to replicate internal keys in higher levels plus metadata of shared sstables in lower levels, as opposed to just observed user keys. Part of cockroachdb#103028 Epic: none Release note: None
This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of cockroachdb#103028. Epic: none Release note: None
This change updates pkg/storage interfaces and implementations to allow the use of ScanInternal in skip-shared iteration mode as well as writing/reading of internal point keys, range dels and range keys. Replication / snapshot code will soon rely on these changes to be able to replicate internal keys in higher levels plus metadata of shared sstables in lower levels, as opposed to just observed user keys. Part of cockroachdb#103028 Epic: none Release note: None
This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of cockroachdb#103028. Epic: none Release note: None
107297: storage,kvserver: Foundational changes for disaggregated ingestions r=sumeerbhola a=itsbilal This change contains two commits (split off from the original mega-PR, #105839). The first is a pkg/storage change to add new interface methods to call pebble's db.ScanInternal as well as implement related helper methods in sstable writers/batch readers/writers to be able to do disaggregated snapshot ingestion. The second is a kvserver/rditer change to allow finer-grained control on what replicated spans we iterate on, as well as to be able to specifically opt into skip-shared iteration over the user key span through the use of `ScanInternal`. --- **storage: Update Engine/Reader/Writer interfaces for ScanInternal** This change updates pkg/storage interfaces and implementations to allow the use of ScanInternal in skip-shared iteration mode as well as writing/reading of internal point keys, range dels and range keys. Replication / snapshot code will soon rely on these changes to be able to replicate internal keys in higher levels plus metadata of shared sstables in lower levels, as opposed to just observed user keys. Part of #103028 Epic: none Release note: None **kvserver: Add ability to filter replicated spans in Select/Iterate** This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of #103028. Epic: none Release note: None 108336: sql: retry more distributed errors as local r=yuzefovich a=yuzefovich This PR contains a couple of commits that increase the allow-list of errors that are retried locally. In particular, it allows us to hide some issues we have around using DistSQL and shutting down SQL pods. Fixes: #106537. Fixes: #108152. Fixes: #108271. Release note: None 108406: server,testutils: remove complexity r=yuzefovich,herkolategan a=knz There is a saying (paraphrasing) that it always takes more work removing unwanted complexity than it takes to add it. This is an example of that. Prior to this commit, there was an "interesting" propagation of the flag that decides whether or not to define a test tenant for test servers and clusters. In a nutshell, we had: - an "input" flag in `base.TestServerArgs`, which remained mostly immutable - a boolean decided once by `ShouldStartDefaultTestTenant()` either in: - `serverutils.StartServerOnlyE` - or `testcluster.Start` - that boolean choice was then propagated to `server.testServer` via _another_ boolean config flag in `server.BaseConfig` - both the 2nd boolean and the original input flag were then again checked when the time came to do the work (in `maybeStartDefaultTestTenant`). Additional complexity was then incurred by the need of `TestCluster` to make the determination just once (and not once per server). This commit cuts through all the layers of complexity by simply propagating the choice of `ShouldStartDefaultTestTenant()` back into the `TestServerArgs` and only ever reading from that subsequently. Release note: None Epic: CRDB-18499 108465: cloudccl: allow external connection tests to be run in parallel r=rhu713 a=rhu713 Currently external connection tests read and write to the same path in cloud storage. Add a random uint64 as part of the path so that test runs have unique paths and can be run in parallel. Fixes: #107407 Release note: None 108481: acceptance: stabilize start-single-node in tcl test r=santamaura a=dhartunian We've continued to see flakes on this test which contain messages of throttled stores on node startup. The hypothesis is that these are due to leftover data directories from prior startups during the same test. This change clears the `logs/db` data directory for those invocations and also adds the sql memory flag which the common tcl function also uses. Resolves #108405 Epic: None Release note: None 108496: kv: unit test `PrepareTransactionForRetry` and `TransactionRefreshTimestamp` r=miraradeva a=nvanbenschoten Informs #104233. This commit adds a pair of new unit tests to verify the behavior of `PrepareTransactionForRetry` and `TransactionRefreshTimestamp`. These functions will be getting more complex for #104233, so it will be helpful to have these tests in place. The tests also serve as good documentation. Release note: None Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots. If ScanInternal() returns an `ErrInvalidSkipSharedIteration`, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests. Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing. Fixes cockroachdb#103028. Release note (general change): Takes advantage of new CLI option, `--experimental-shared-storage` to rebalance faster from node to node.
If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots. If ScanInternal() returns an `ErrInvalidSkipSharedIteration`, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests. Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing. Fixes cockroachdb#103028. Release note (general change): Takes advantage of new CLI option, `--experimental-shared-storage` to rebalance faster from node to node.
105839: kvserver,storage: Update snapshot strategy to use shared storage r=sumeerbhola a=itsbilal If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots. If ScanInternal() returns an `ErrInvalidSkipSharedIteration`, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests. Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing. Fixes #103028. Co-authored-by: Bilal Akhtar <[email protected]>
The kvBatchSnapshotStrategy needs to be updated to recognize where the sending/recipient nodes are both using the same shared.Storage, and if so, use the new skip-shared iteration mode in
db.ScanInternal
to produce a list of shared sstable metadatas as well as some flat non-shared sstables that sit on top of them.On the recipient side, these shared sstable metadata structs can be ingested into Pebble using the new
IngestAndExcise
command (see cockroachdb/pebble#2520 , which this issue has a dependency on).Jira issue: CRDB-27800
The text was updated successfully, but these errors were encountered: