-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver,storage: Update snapshot strategy to use shared storage #105839
Conversation
3e43689
to
7b8fc89
Compare
Turning into a draft PR until I get to the bottom of the test failures |
e8c4db8
to
324d647
Compare
This is ready for a look now. |
74cba06
to
18fa524
Compare
18fa524
to
1cd6602
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at everything but tests, and it looks good, but my mental models for this code are bad. This also deserves a look from someone more familiar with this code, and also someone else from storage.
I'll look at the tests carefully on Monday.
Reviewed 2 of 40 files at r1, 1 of 9 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @pavelkalinnikov)
-- commits
line 6 at r3:
Unfamiliar with this, are snapshots used for replication? So if we replicate using fast ingestion, with a RF=3, will the actual data only be present on disk once?
pkg/cli/debug.go
line 475 at r3 (raw file):
var results int return rditer.IterateReplicaKeySpans(&desc, snapshot, debugCtx.replicated, rditer.TableKeysInclude, /* skipTableKeys */
This should be /* includeTableKeys */
.
pkg/kv/kvserver/store_snapshot.go
line 251 at r3 (raw file):
func (msstw *multiSSTWriter) addRangeDelForLastSpan() error { if !msstw.doExcise { panic("this code path is only meant for ")
Incomplete comment.
pkg/kv/kvserver/store_snapshot.go
line 803 at r3 (raw file):
flushBatch := func() error { if err := kvSS.sendBatch(ctx, stream, b, ssts, stopFastReplicate, timingTag); err != nil {
Could we have stopFastReplicate == true
, but ssts
also be non-empty?
pkg/kv/kvserver/store_snapshot.go
line 893 at r3 (raw file):
var valBuf []byte if fastReplicate { err := rditer.IterateReplicaKeySpansShared(ctx, snap.State.Desc, snap.EngineSnap, func(key *pebble.InternalKey, value pebble.LazyValue) error {
The IterateReplicaKeySpans
function takes a replicatedOnly
field, which ensures that we are only iterating over fully replicated key spans. But this function doesn't, and it's directly calling ScanInternal
. How are we ensuring that we're still only iterating over fully replicated key spans, or do we not need that property?
pkg/storage/batch.go
line 204 at r3 (raw file):
// EngineRangeKeys returns the engine range key values at the current entry. func (r *BatchReader) EngineRangeKeys() ([]EngineRangeKeyValue, error) {
super nit: This can use the above function to get the raw keys, and then turn them into engine keys? Feel free to ignore this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing some comments.
Is there any way to split this PR into logically independent commits (easier to review)? For example, the part with the "include/exclude user keys" + tests in rditer
seems nicely separable.
Could you also indicate any particular area Replication reviewers should focus on, or provide tips on how to review this PR (because it's large)?
// If true, the snapshot could contain shared files present in a pre-configured | ||
// or explicitly specified shared.Storage instance. Such files will have their | ||
// metadata present in the snapshot, but not file contents. | ||
bool fast_replicate = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it absolutely "fast" when true, or there are trade-offs? Consider a more neutral naming which reflects the mechanism itself rather than its perf characteristics. Something like use_shared_storage
/ enable_shared_storage
/ enable_shared_snapshots
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it to SharedReplicate
- also good with SharedSnapshot
, let me know if that sounds good. We don't want all uses of shared storage to result in shared replication (eg. non-user keys, and in the future, different S3 buckets eg. multiregion), so something talking about shared storage is probably not ideal.
Thank you so much for the reviews @pavelkalinnikov and @bananabrick. I'll be getting to this shortly; oncall and code-yellow has been occupying more of my time this week. |
Thanks for the reviews @pavelkalinnikov and @bananabrick . I just created #107297 which contains two commits: one for the pkg/storage changes, and one for the pkg/kvserver/rditer changes that use those pkg/storage changes. Both of tou should be able to independently review those commits at much greater ease than this mega-PR. I should have also addressed all of your comments that applied to those packages in that PR. I'll now update this PR to address the remainder of those comments, and then rebase it on top of that PR (so it'll be the third commit). |
1cd6602
to
ea998eb
Compare
69aecc4
to
e84a133
Compare
f4472d2
to
1314597
Compare
107297: storage,kvserver: Foundational changes for disaggregated ingestions r=sumeerbhola a=itsbilal This change contains two commits (split off from the original mega-PR, #105839). The first is a pkg/storage change to add new interface methods to call pebble's db.ScanInternal as well as implement related helper methods in sstable writers/batch readers/writers to be able to do disaggregated snapshot ingestion. The second is a kvserver/rditer change to allow finer-grained control on what replicated spans we iterate on, as well as to be able to specifically opt into skip-shared iteration over the user key span through the use of `ScanInternal`. --- **storage: Update Engine/Reader/Writer interfaces for ScanInternal** This change updates pkg/storage interfaces and implementations to allow the use of ScanInternal in skip-shared iteration mode as well as writing/reading of internal point keys, range dels and range keys. Replication / snapshot code will soon rely on these changes to be able to replicate internal keys in higher levels plus metadata of shared sstables in lower levels, as opposed to just observed user keys. Part of #103028 Epic: none Release note: None **kvserver: Add ability to filter replicated spans in Select/Iterate** This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of #103028. Epic: none Release note: None 108336: sql: retry more distributed errors as local r=yuzefovich a=yuzefovich This PR contains a couple of commits that increase the allow-list of errors that are retried locally. In particular, it allows us to hide some issues we have around using DistSQL and shutting down SQL pods. Fixes: #106537. Fixes: #108152. Fixes: #108271. Release note: None 108406: server,testutils: remove complexity r=yuzefovich,herkolategan a=knz There is a saying (paraphrasing) that it always takes more work removing unwanted complexity than it takes to add it. This is an example of that. Prior to this commit, there was an "interesting" propagation of the flag that decides whether or not to define a test tenant for test servers and clusters. In a nutshell, we had: - an "input" flag in `base.TestServerArgs`, which remained mostly immutable - a boolean decided once by `ShouldStartDefaultTestTenant()` either in: - `serverutils.StartServerOnlyE` - or `testcluster.Start` - that boolean choice was then propagated to `server.testServer` via _another_ boolean config flag in `server.BaseConfig` - both the 2nd boolean and the original input flag were then again checked when the time came to do the work (in `maybeStartDefaultTestTenant`). Additional complexity was then incurred by the need of `TestCluster` to make the determination just once (and not once per server). This commit cuts through all the layers of complexity by simply propagating the choice of `ShouldStartDefaultTestTenant()` back into the `TestServerArgs` and only ever reading from that subsequently. Release note: None Epic: CRDB-18499 108465: cloudccl: allow external connection tests to be run in parallel r=rhu713 a=rhu713 Currently external connection tests read and write to the same path in cloud storage. Add a random uint64 as part of the path so that test runs have unique paths and can be run in parallel. Fixes: #107407 Release note: None 108481: acceptance: stabilize start-single-node in tcl test r=santamaura a=dhartunian We've continued to see flakes on this test which contain messages of throttled stores on node startup. The hypothesis is that these are due to leftover data directories from prior startups during the same test. This change clears the `logs/db` data directory for those invocations and also adds the sql memory flag which the common tcl function also uses. Resolves #108405 Epic: None Release note: None 108496: kv: unit test `PrepareTransactionForRetry` and `TransactionRefreshTimestamp` r=miraradeva a=nvanbenschoten Informs #104233. This commit adds a pair of new unit tests to verify the behavior of `PrepareTransactionForRetry` and `TransactionRefreshTimestamp`. These functions will be getting more complex for #104233, so it will be helpful to have these tests in place. The tests also serve as good documentation. Release note: None Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
1314597
to
e44bb30
Compare
Just rebased on master, this should be a slimmer and easier-to-review PR now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 61 files at r5, 1 of 69 files at r7, 3 of 12 files at r9, 13 of 71 files at r10, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, and @pavelkalinnikov)
-- commits
line 11 at r10:
Do we have an existing issue to close this hole of ErrInvalidSkipSharedIteration
, or at least promote it to the first thing we do in ScanInternal
before doing the iteration (check the files in the version and their largest seqnum). Assuming we get the error immediately, the caller should have a way to take corrective action and retry. It's not high priority but I'm worried about forgetting about this, and then having an incident where a stray wide local file in L5 caused many snapshots to have to be sent in the old manner.
pkg/kv/kvserver/store_snapshot.go
line 342 at r10 (raw file):
return errors.AssertionFailedf("start key %s must be before end key %s", end, start) } for msstw.keySpans[msstw.currSpan].EndKey.Compare(decodedStart.Key) <= 0 {
can we lift this sst rollover logic from Put/PutRangeKey/PutInternalPointKey/PutInternalRangeDelete into a separate method?
pkg/kv/kvserver/store_snapshot.go
line 363 at r10 (raw file):
} func (msstw *multiSSTWriter) PutInternalRangeKey(
PutInternalRangeDelete and PutInternalRangeKey look very similar. Can the code be mostly shared?
pkg/kv/kvserver/store_snapshot.go
line 827 at r10 (raw file):
b = nil ssts = ssts[:0] stopSharedReplicate = false
why this stopSharedReplicate=false
?
Is this because it is a single time signal to the receiver to write a rangedel?
It reads oddly since IterateRKSpansVisitor
, which is the old path, also uses this code. Perhaps rename it as transitionFromSharedToRegularReplicate -- it is understand as a state transition.
Also, I see why this kinda works even though the caller may have written part of the sst before this switch, in that it will now add a RANGEDEL. But RANGEDELs don't apply to data in the same ingested sst, because they have the same seqnum, so maybe it doesn't really work. Is there a unit test for the switchover?
pkg/kv/kvserver/kvserverpb/raft.proto
line 260 at r10 (raw file):
} // SharedTable represents one shared SSTable present in shared storage.
Is this the proto equivalent of pebble.SharedSSTMeta
? If yes, can you add a comment here that these two should be kept in sync.
pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/output
line 1 at r6 (raw file):
echo
why are rditer testdata files being changed/deleted in this PR?
pkg/kv/kvserver/replica_sst_snapshot_storage_test.go
line 280 at r10 (raw file):
msstw, err := newMultiSSTWriter( ctx, cluster.MakeTestingClusterSettings(), scratch, keySpans, 0, false, /* skipRangeDelForLastSpan */
is there testing for the true case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Dismissed @bananabrick and @pavelkalinnikov from 9 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @pavelkalinnikov, and @sumeerbhola)
Previously, bananabrick (Arjun Nair) wrote…
Unfamiliar with this, are snapshots used for replication? So if we replicate using fast ingestion, with a RF=3, will the actual data only be present on disk once?
Yes, that's right. But S3 is already replicated so we don't need to worry too much there. We do need to worry about S3 availability but that's a whole other can of worms and discussed elsewhere.
Previously, sumeerbhola wrote…
Do we have an existing issue to close this hole of
ErrInvalidSkipSharedIteration
, or at least promote it to the first thing we do inScanInternal
before doing the iteration (check the files in the version and their largest seqnum). Assuming we get the error immediately, the caller should have a way to take corrective action and retry. It's not high priority but I'm worried about forgetting about this, and then having an incident where a stray wide local file in L5 caused many snapshots to have to be sent in the old manner.
We already iterate over files in L5/L6 as the very first thing in ScanInternal and issue the error if it's going to happen before any keys are surfaced, so that issue is unnecessary. But 1) I didn't want this code to rely too heavily on that for clean abstraction reasons, and 2) We do the ScanInternal at the end of the KV snapshot iteration because the user key range is at the end. Hence the need for ReplicatedKeysUserOnly
because we could be forced to do a normal iteration on just the user keys
pkg/kv/kvserver/store_snapshot.go
line 803 at r3 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Could we have
stopFastReplicate == true
, butssts
also be non-empty?
We throw away sharedSSTs when that happens.
pkg/kv/kvserver/store_snapshot.go
line 893 at r3 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
The
IterateReplicaKeySpans
function takes areplicatedOnly
field, which ensures that we are only iterating over fully replicated key spans. But this function doesn't, and it's directly callingScanInternal
. How are we ensuring that we're still only iterating over fully replicated key spans, or do we not need that property?
IterateReplicaKeySpansShared
ensures that. It's not possible for Pebble to know what's replicated and what's not anyway, so that's the best place, abstraction-wise, to ensure it.
pkg/kv/kvserver/store_snapshot.go
line 342 at r10 (raw file):
Previously, sumeerbhola wrote…
can we lift this sst rollover logic from Put/PutRangeKey/PutInternalPointKey/PutInternalRangeDelete into a separate method?
Done.
pkg/kv/kvserver/store_snapshot.go
line 363 at r10 (raw file):
Previously, sumeerbhola wrote…
PutInternalRangeDelete and PutInternalRangeKey look very similar. Can the code be mostly shared?
Done. (Reduced a lot of code duplication)
pkg/kv/kvserver/store_snapshot.go
line 827 at r10 (raw file):
Previously, sumeerbhola wrote…
why this
stopSharedReplicate=false
?
Is this because it is a single time signal to the receiver to write a rangedel?
It reads oddly sinceIterateRKSpansVisitor
, which is the old path, also uses this code. Perhaps rename it as transitionFromSharedToRegularReplicate -- it is understand as a state transition.Also, I see why this kinda works even though the caller may have written part of the sst before this switch, in that it will now add a RANGEDEL. But RANGEDELs don't apply to data in the same ingested sst, because they have the same seqnum, so maybe it doesn't really work. Is there a unit test for the switchover?
This works because the caller will get the error before any keys are returned by IterateReplicaKeySpansShared
, and so we'll throw this flag up and put down the rangedel. The rangedel won't need to apply to keys within the same sst because at that point there won't be any keys in the sst in that span anyway. I've more explicltly called out in the raft.proto comment that this flag must go up before any user keys are streamed. We still need to send a snapshot batch over the wire with this bool as true so the receiver knows to put down the rangedel.
Also did the rename.
pkg/kv/kvserver/kvserverpb/raft.proto
line 260 at r10 (raw file):
Previously, sumeerbhola wrote…
Is this the proto equivalent of
pebble.SharedSSTMeta
? If yes, can you add a comment here that these two should be kept in sync.
Done.
pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/output
line 1 at r6 (raw file):
Previously, sumeerbhola wrote…
why are rditer testdata files being changed/deleted in this PR?
It's because ReplicatedKeysUserOnly
now excludes ReplicatedByRangeID
, a change made in this PR and not the last one.
pkg/kv/kvserver/replica_sst_snapshot_storage_test.go
line 280 at r10 (raw file):
Previously, sumeerbhola wrote…
is there testing for the true case?
Done. Now there is.
d3f6d60
to
1075f61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 61 files at r5, 2 of 12 files at r9, 1 of 71 files at r10, 8 of 11 files at r11, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, and @pavelkalinnikov)
pkg/kv/kvserver/store_snapshot.go
line 827 at r10 (raw file):
I've more explicltly called out in the raft.proto comment that this flag must go up before any user keys are streamed.
See comment above about checking this here too.
pkg/kv/kvserver/store_snapshot.go
line 265 at r11 (raw file):
} // msstw.currSpan == len(msstw.keySpans)-1 if err := msstw.currSST.ClearRawRange(
Can you add a check that no Put
call has happened for this span (since that won't be deleted by this rangedel and rangekeydel)?
Or alternatively, since rollover to a new span happens only on a Put
call, should we be asserting that we are not at the last span?
pkg/kv/kvserver/replica_sst_snapshot_storage_test.go
line 346 at r11 (raw file):
for the last span which only gets a rangedel when explicitly added.
so this test is only testing the case when it is explicitly added? How about testing the case when addRangeDelForLastSpan()
is not called?
If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots. If ScanInternal() returns an `ErrInvalidSkipSharedIteration`, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests. Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing. Fixes cockroachdb#103028. Release note (general change): Takes advantage of new CLI option, `--experimental-shared-storage` to rebalance faster from node to node.
1075f61
to
02a4b7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bananabrick, @pavelkalinnikov, and @sumeerbhola)
pkg/kv/kvserver/store_snapshot.go
line 827 at r10 (raw file):
Previously, sumeerbhola wrote…
I've more explicltly called out in the raft.proto comment that this flag must go up before any user keys are streamed.
See comment above about checking this here too.
Done.
pkg/kv/kvserver/store_snapshot.go
line 265 at r11 (raw file):
Previously, sumeerbhola wrote…
Can you add a check that no
Put
call has happened for this span (since that won't be deleted by this rangedel and rangekeydel)?
Or alternatively, since rollover to a new span happens only on aPut
call, should we be asserting that we are not at the last span?
Done. Went with the latter.
pkg/kv/kvserver/replica_sst_snapshot_storage_test.go
line 346 at r11 (raw file):
Previously, sumeerbhola wrote…
for the last span which only gets a rangedel when explicitly added.
so this test is only testing the case when it is explicitly added? How about testing the case when
addRangeDelForLastSpan()
is not called?
Done. Good catch - added a test case for the other case.
bors r=sumeerbhola |
Build succeeded: |
If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots.
If ScanInternal() returns an
ErrInvalidSkipSharedIteration
, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests.Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing.
Fixes #103028.