Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver,storage: Update snapshot strategy to use shared storage #105839

Merged
merged 1 commit into from
Aug 16, 2023

Conversation

itsbilal
Copy link
Member

If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots.

If ScanInternal() returns an ErrInvalidSkipSharedIteration, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests.

Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing.

Fixes #103028.

@itsbilal itsbilal requested review from a team as code owners June 29, 2023 18:31
@itsbilal itsbilal requested a review from a team June 29, 2023 18:31
@itsbilal itsbilal requested a review from a team as a code owner June 29, 2023 18:31
@itsbilal itsbilal requested a review from bananabrick June 29, 2023 18:31
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal itsbilal force-pushed the snapshot-strategy-disagg branch 2 times, most recently from 3e43689 to 7b8fc89 Compare June 29, 2023 20:12
@itsbilal
Copy link
Member Author

Turning into a draft PR until I get to the bottom of the test failures

@itsbilal itsbilal marked this pull request as draft June 29, 2023 20:21
@itsbilal itsbilal force-pushed the snapshot-strategy-disagg branch 6 times, most recently from e8c4db8 to 324d647 Compare July 4, 2023 17:23
@itsbilal
Copy link
Member Author

itsbilal commented Jul 4, 2023

This is ready for a look now.

@itsbilal itsbilal marked this pull request as ready for review July 4, 2023 17:53
@itsbilal itsbilal force-pushed the snapshot-strategy-disagg branch 2 times, most recently from 74cba06 to 18fa524 Compare July 4, 2023 18:49
@tbg tbg requested review from tbg and pav-kv and removed request for a team and tbg July 5, 2023 09:10
@itsbilal itsbilal force-pushed the snapshot-strategy-disagg branch from 18fa524 to 1cd6602 Compare July 5, 2023 21:05
Copy link
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at everything but tests, and it looks good, but my mental models for this code are bad. This also deserves a look from someone more familiar with this code, and also someone else from storage.

I'll look at the tests carefully on Monday.

Reviewed 2 of 40 files at r1, 1 of 9 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @pavelkalinnikov)


-- commits line 6 at r3:
Unfamiliar with this, are snapshots used for replication? So if we replicate using fast ingestion, with a RF=3, will the actual data only be present on disk once?


pkg/cli/debug.go line 475 at r3 (raw file):

	var results int
	return rditer.IterateReplicaKeySpans(&desc, snapshot, debugCtx.replicated, rditer.TableKeysInclude, /* skipTableKeys */

This should be /* includeTableKeys */.


pkg/kv/kvserver/store_snapshot.go line 251 at r3 (raw file):

func (msstw *multiSSTWriter) addRangeDelForLastSpan() error {
	if !msstw.doExcise {
		panic("this code path is only meant for ")

Incomplete comment.


pkg/kv/kvserver/store_snapshot.go line 803 at r3 (raw file):

	flushBatch := func() error {
		if err := kvSS.sendBatch(ctx, stream, b, ssts, stopFastReplicate, timingTag); err != nil {

Could we have stopFastReplicate == true, but ssts also be non-empty?


pkg/kv/kvserver/store_snapshot.go line 893 at r3 (raw file):

	var valBuf []byte
	if fastReplicate {
		err := rditer.IterateReplicaKeySpansShared(ctx, snap.State.Desc, snap.EngineSnap, func(key *pebble.InternalKey, value pebble.LazyValue) error {

The IterateReplicaKeySpans function takes a replicatedOnly field, which ensures that we are only iterating over fully replicated key spans. But this function doesn't, and it's directly calling ScanInternal. How are we ensuring that we're still only iterating over fully replicated key spans, or do we not need that property?


pkg/storage/batch.go line 204 at r3 (raw file):

// EngineRangeKeys returns the engine range key values at the current entry.
func (r *BatchReader) EngineRangeKeys() ([]EngineRangeKeyValue, error) {

super nit: This can use the above function to get the raw keys, and then turn them into engine keys? Feel free to ignore this.

@bananabrick bananabrick requested a review from a team July 8, 2023 01:41
Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing some comments.

Is there any way to split this PR into logically independent commits (easier to review)? For example, the part with the "include/exclude user keys" + tests in rditer seems nicely separable.

Could you also indicate any particular area Replication reviewers should focus on, or provide tips on how to review this PR (because it's large)?

// If true, the snapshot could contain shared files present in a pre-configured
// or explicitly specified shared.Storage instance. Such files will have their
// metadata present in the snapshot, but not file contents.
bool fast_replicate = 12;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it absolutely "fast" when true, or there are trade-offs? Consider a more neutral naming which reflects the mechanism itself rather than its perf characteristics. Something like use_shared_storage / enable_shared_storage / enable_shared_snapshots.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed it to SharedReplicate - also good with SharedSnapshot, let me know if that sounds good. We don't want all uses of shared storage to result in shared replication (eg. non-user keys, and in the future, different S3 buckets eg. multiregion), so something talking about shared storage is probably not ideal.

pkg/kv/kvserver/kvserverpb/raft.proto Show resolved Hide resolved
pkg/kv/kvserver/kvserverpb/raft.proto Outdated Show resolved Hide resolved
pkg/storage/sst_writer.go Outdated Show resolved Hide resolved
pkg/cli/debug.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rditer/replica_data_iter.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/rditer/replica_data_iter_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/store_snapshot.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/store_snapshot.go Outdated Show resolved Hide resolved
pkg/storage/pebble.go Outdated Show resolved Hide resolved
@itsbilal
Copy link
Member Author

Thank you so much for the reviews @pavelkalinnikov and @bananabrick. I'll be getting to this shortly; oncall and code-yellow has been occupying more of my time this week.

@itsbilal
Copy link
Member Author

Thanks for the reviews @pavelkalinnikov and @bananabrick . I just created #107297 which contains two commits: one for the pkg/storage changes, and one for the pkg/kvserver/rditer changes that use those pkg/storage changes. Both of tou should be able to independently review those commits at much greater ease than this mega-PR. I should have also addressed all of your comments that applied to those packages in that PR. I'll now update this PR to address the remainder of those comments, and then rebase it on top of that PR (so it'll be the third commit).

@itsbilal itsbilal force-pushed the snapshot-strategy-disagg branch from 1cd6602 to ea998eb Compare July 21, 2023 16:46
@itsbilal
Copy link
Member Author

Finished updating this PR and addressed all changes suggested so far. Note that the first two commits are #107297 and only the newest (i.e. third) commit should be reviewed here. I'll rebase this branch on master once #107297 merges.

@itsbilal itsbilal force-pushed the snapshot-strategy-disagg branch 2 times, most recently from 69aecc4 to e84a133 Compare July 21, 2023 19:54
@itsbilal itsbilal force-pushed the snapshot-strategy-disagg branch 2 times, most recently from f4472d2 to 1314597 Compare August 3, 2023 19:41
craig bot pushed a commit that referenced this pull request Aug 10, 2023
107297: storage,kvserver: Foundational changes for disaggregated ingestions r=sumeerbhola a=itsbilal

This change contains two commits (split off from the original mega-PR, #105839). The first is a pkg/storage change to add new interface methods to call pebble's db.ScanInternal as well as implement related helper methods in sstable writers/batch readers/writers to be able to do disaggregated snapshot ingestion. The second is a kvserver/rditer change to allow finer-grained control on what replicated spans we iterate on, as well as to be able to specifically opt into skip-shared iteration over the user key span through the use of `ScanInternal`.

---

**storage: Update Engine/Reader/Writer interfaces for ScanInternal**

This change updates pkg/storage interfaces and implementations to allow
the use of ScanInternal in skip-shared iteration mode as well as
writing/reading of internal point keys, range dels and range keys.
Replication / snapshot code will soon rely on these changes to
be able to replicate internal keys in higher levels plus metadata
of shared sstables in lower levels, as opposed to just observed
user keys.

Part of #103028

Epic: none

Release note: None

**kvserver: Add ability to filter replicated spans in Select/Iterate**

This change adds the ability to select for just the replicated
span in rditer.Select and rditer.IterateReplicaKeySpans. Also
adds a new rditer.IterateReplicaKeySpansShared that does a
ScanInternal on just the user key span, to be able to collect
metadata of shared sstables as well as any internal keys above
them.

We only use skip-shared iteration for the replicated user key span
of a range, and in practice, only if it's a non-system range.

Part of #103028.

Epic: none

Release note: None

108336: sql: retry more distributed errors as local r=yuzefovich a=yuzefovich

This PR contains a couple of commits that increase the allow-list of errors that are retried locally. In particular, it allows us to hide some issues we have around using DistSQL and shutting down SQL pods.

Fixes: #106537.
Fixes: #108152.
Fixes: #108271.

Release note: None

108406: server,testutils: remove complexity r=yuzefovich,herkolategan a=knz

There is a saying (paraphrasing) that it always takes more work removing unwanted complexity than it takes to add it. This is an example of that.

Prior to this commit, there was an "interesting" propagation of the flag that decides whether or not to define a test tenant for test servers and clusters. In a nutshell, we had:

- an "input" flag in `base.TestServerArgs`, which remained mostly immutable
- a boolean decided once by `ShouldStartDefaultTestTenant()` either in:
  - `serverutils.StartServerOnlyE`
  - or `testcluster.Start`
- that boolean choice was then propagated to `server.testServer` via _another_ boolean config flag in `server.BaseConfig`
- both the 2nd boolean and the original input flag were then again checked when the time came to do the work (in `maybeStartDefaultTestTenant`).

Additional complexity was then incurred by the need of `TestCluster` to make the determination just once (and not once per server).

This commit cuts through all the layers of complexity by simply propagating the choice of `ShouldStartDefaultTestTenant()` back into the `TestServerArgs` and only ever reading from that subsequently.

Release note: None
Epic: CRDB-18499

108465: cloudccl: allow external connection tests to be run in parallel r=rhu713 a=rhu713

Currently external connection tests read and write to the same path in cloud storage. Add a random uint64 as part of the path so that test runs have unique paths and can be run in parallel.

Fixes: #107407

Release note: None

108481: acceptance: stabilize start-single-node in tcl test r=santamaura a=dhartunian

We've continued to see flakes on this test which contain messages of throttled stores on node startup. The hypothesis is that these are due to leftover data directories from prior startups during the same test.

This change clears the `logs/db` data directory for those invocations and also adds the sql memory flag which the common tcl function also uses.

Resolves #108405
Epic: None

Release note: None

108496: kv: unit test `PrepareTransactionForRetry` and `TransactionRefreshTimestamp` r=miraradeva a=nvanbenschoten

Informs #104233.

This commit adds a pair of new unit tests to verify the behavior of `PrepareTransactionForRetry` and `TransactionRefreshTimestamp`. These functions will be getting more complex for #104233, so it will be helpful to have these tests in place. The tests also serve as good documentation.

Release note: None

Co-authored-by: Bilal Akhtar <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Rui Hu <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@itsbilal itsbilal force-pushed the snapshot-strategy-disagg branch from 1314597 to e44bb30 Compare August 10, 2023 16:36
@itsbilal
Copy link
Member Author

Just rebased on master, this should be a slimmer and easier-to-review PR now.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 61 files at r5, 1 of 69 files at r7, 3 of 12 files at r9, 13 of 71 files at r10, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, and @pavelkalinnikov)


-- commits line 11 at r10:
Do we have an existing issue to close this hole of ErrInvalidSkipSharedIteration, or at least promote it to the first thing we do in ScanInternal before doing the iteration (check the files in the version and their largest seqnum). Assuming we get the error immediately, the caller should have a way to take corrective action and retry. It's not high priority but I'm worried about forgetting about this, and then having an incident where a stray wide local file in L5 caused many snapshots to have to be sent in the old manner.


pkg/kv/kvserver/store_snapshot.go line 342 at r10 (raw file):

		return errors.AssertionFailedf("start key %s must be before end key %s", end, start)
	}
	for msstw.keySpans[msstw.currSpan].EndKey.Compare(decodedStart.Key) <= 0 {

can we lift this sst rollover logic from Put/PutRangeKey/PutInternalPointKey/PutInternalRangeDelete into a separate method?


pkg/kv/kvserver/store_snapshot.go line 363 at r10 (raw file):

}

func (msstw *multiSSTWriter) PutInternalRangeKey(

PutInternalRangeDelete and PutInternalRangeKey look very similar. Can the code be mostly shared?


pkg/kv/kvserver/store_snapshot.go line 827 at r10 (raw file):

		b = nil
		ssts = ssts[:0]
		stopSharedReplicate = false

why this stopSharedReplicate=false?
Is this because it is a single time signal to the receiver to write a rangedel?
It reads oddly since IterateRKSpansVisitor, which is the old path, also uses this code. Perhaps rename it as transitionFromSharedToRegularReplicate -- it is understand as a state transition.

Also, I see why this kinda works even though the caller may have written part of the sst before this switch, in that it will now add a RANGEDEL. But RANGEDELs don't apply to data in the same ingested sst, because they have the same seqnum, so maybe it doesn't really work. Is there a unit test for the switchover?


pkg/kv/kvserver/kvserverpb/raft.proto line 260 at r10 (raw file):

  }

  // SharedTable represents one shared SSTable present in shared storage.

Is this the proto equivalent of pebble.SharedSSTMeta? If yes, can you add a comment here that these two should be kept in sync.


pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/output line 1 at r6 (raw file):

echo

why are rditer testdata files being changed/deleted in this PR?


pkg/kv/kvserver/replica_sst_snapshot_storage_test.go line 280 at r10 (raw file):

	msstw, err := newMultiSSTWriter(
		ctx, cluster.MakeTestingClusterSettings(), scratch, keySpans, 0,
		false, /* skipRangeDelForLastSpan */

is there testing for the true case?

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Dismissed @bananabrick and @pavelkalinnikov from 9 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @pavelkalinnikov, and @sumeerbhola)


-- commits line 6 at r3:

Previously, bananabrick (Arjun Nair) wrote…

Unfamiliar with this, are snapshots used for replication? So if we replicate using fast ingestion, with a RF=3, will the actual data only be present on disk once?

Yes, that's right. But S3 is already replicated so we don't need to worry too much there. We do need to worry about S3 availability but that's a whole other can of worms and discussed elsewhere.


-- commits line 11 at r10:

Previously, sumeerbhola wrote…

Do we have an existing issue to close this hole of ErrInvalidSkipSharedIteration, or at least promote it to the first thing we do in ScanInternal before doing the iteration (check the files in the version and their largest seqnum). Assuming we get the error immediately, the caller should have a way to take corrective action and retry. It's not high priority but I'm worried about forgetting about this, and then having an incident where a stray wide local file in L5 caused many snapshots to have to be sent in the old manner.

We already iterate over files in L5/L6 as the very first thing in ScanInternal and issue the error if it's going to happen before any keys are surfaced, so that issue is unnecessary. But 1) I didn't want this code to rely too heavily on that for clean abstraction reasons, and 2) We do the ScanInternal at the end of the KV snapshot iteration because the user key range is at the end. Hence the need for ReplicatedKeysUserOnly because we could be forced to do a normal iteration on just the user keys


pkg/kv/kvserver/store_snapshot.go line 803 at r3 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Could we have stopFastReplicate == true, but ssts also be non-empty?

We throw away sharedSSTs when that happens.


pkg/kv/kvserver/store_snapshot.go line 893 at r3 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

The IterateReplicaKeySpans function takes a replicatedOnly field, which ensures that we are only iterating over fully replicated key spans. But this function doesn't, and it's directly calling ScanInternal. How are we ensuring that we're still only iterating over fully replicated key spans, or do we not need that property?

IterateReplicaKeySpansShared ensures that. It's not possible for Pebble to know what's replicated and what's not anyway, so that's the best place, abstraction-wise, to ensure it.


pkg/kv/kvserver/store_snapshot.go line 342 at r10 (raw file):

Previously, sumeerbhola wrote…

can we lift this sst rollover logic from Put/PutRangeKey/PutInternalPointKey/PutInternalRangeDelete into a separate method?

Done.


pkg/kv/kvserver/store_snapshot.go line 363 at r10 (raw file):

Previously, sumeerbhola wrote…

PutInternalRangeDelete and PutInternalRangeKey look very similar. Can the code be mostly shared?

Done. (Reduced a lot of code duplication)


pkg/kv/kvserver/store_snapshot.go line 827 at r10 (raw file):

Previously, sumeerbhola wrote…

why this stopSharedReplicate=false?
Is this because it is a single time signal to the receiver to write a rangedel?
It reads oddly since IterateRKSpansVisitor, which is the old path, also uses this code. Perhaps rename it as transitionFromSharedToRegularReplicate -- it is understand as a state transition.

Also, I see why this kinda works even though the caller may have written part of the sst before this switch, in that it will now add a RANGEDEL. But RANGEDELs don't apply to data in the same ingested sst, because they have the same seqnum, so maybe it doesn't really work. Is there a unit test for the switchover?

This works because the caller will get the error before any keys are returned by IterateReplicaKeySpansShared, and so we'll throw this flag up and put down the rangedel. The rangedel won't need to apply to keys within the same sst because at that point there won't be any keys in the sst in that span anyway. I've more explicltly called out in the raft.proto comment that this flag must go up before any user keys are streamed. We still need to send a snapshot batch over the wire with this bool as true so the receiver knows to put down the rangedel.

Also did the rename.


pkg/kv/kvserver/kvserverpb/raft.proto line 260 at r10 (raw file):

Previously, sumeerbhola wrote…

Is this the proto equivalent of pebble.SharedSSTMeta? If yes, can you add a comment here that these two should be kept in sync.

Done.


pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/output line 1 at r6 (raw file):

Previously, sumeerbhola wrote…

why are rditer testdata files being changed/deleted in this PR?

It's because ReplicatedKeysUserOnly now excludes ReplicatedByRangeID, a change made in this PR and not the last one.


pkg/kv/kvserver/replica_sst_snapshot_storage_test.go line 280 at r10 (raw file):

Previously, sumeerbhola wrote…

is there testing for the true case?

Done. Now there is.

@itsbilal itsbilal force-pushed the snapshot-strategy-disagg branch 2 times, most recently from d3f6d60 to 1075f61 Compare August 15, 2023 15:02
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 61 files at r5, 2 of 12 files at r9, 1 of 71 files at r10, 8 of 11 files at r11, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, and @pavelkalinnikov)


pkg/kv/kvserver/store_snapshot.go line 827 at r10 (raw file):

I've more explicltly called out in the raft.proto comment that this flag must go up before any user keys are streamed.

See comment above about checking this here too.


pkg/kv/kvserver/store_snapshot.go line 265 at r11 (raw file):

	}
	// msstw.currSpan == len(msstw.keySpans)-1
	if err := msstw.currSST.ClearRawRange(

Can you add a check that no Put call has happened for this span (since that won't be deleted by this rangedel and rangekeydel)?
Or alternatively, since rollover to a new span happens only on a Put call, should we be asserting that we are not at the last span?


pkg/kv/kvserver/replica_sst_snapshot_storage_test.go line 346 at r11 (raw file):

for the last span which only gets a rangedel when explicitly added.

so this test is only testing the case when it is explicitly added? How about testing the case when addRangeDelForLastSpan() is not called?

If the sender node was created with a SharedStorage, switch to
fast ingestion where we ScanInternal() the keys not in shared
levels, and just share the metadata for files in shared levels.
The sender of the snapshot specifies in the Header that it
is using this ability, and the receiver rejects the snapshot if
it cannot accept shared snapshots.

If ScanInternal() returns an `ErrInvalidSkipSharedIteration`,
we switch back to old-style snapshots where the entirety
of the range is sent over the stream as SnapshotRequests.

Future changes will add better support for detection of when
different nodes point to different blob storage buckets / shared
storage locations, and incorporate that in rebalancing.

Fixes cockroachdb#103028.

Release note (general change): Takes advantage of new CLI option,
`--experimental-shared-storage` to rebalance faster from node to node.
@itsbilal itsbilal force-pushed the snapshot-strategy-disagg branch from 1075f61 to 02a4b7e Compare August 16, 2023 15:53
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bananabrick, @pavelkalinnikov, and @sumeerbhola)


pkg/kv/kvserver/store_snapshot.go line 827 at r10 (raw file):

Previously, sumeerbhola wrote…

I've more explicltly called out in the raft.proto comment that this flag must go up before any user keys are streamed.

See comment above about checking this here too.

Done.


pkg/kv/kvserver/store_snapshot.go line 265 at r11 (raw file):

Previously, sumeerbhola wrote…

Can you add a check that no Put call has happened for this span (since that won't be deleted by this rangedel and rangekeydel)?
Or alternatively, since rollover to a new span happens only on a Put call, should we be asserting that we are not at the last span?

Done. Went with the latter.


pkg/kv/kvserver/replica_sst_snapshot_storage_test.go line 346 at r11 (raw file):

Previously, sumeerbhola wrote…

for the last span which only gets a rangedel when explicitly added.

so this test is only testing the case when it is explicitly added? How about testing the case when addRangeDelForLastSpan() is not called?

Done. Good catch - added a test case for the other case.

@itsbilal
Copy link
Member Author

bors r=sumeerbhola

@craig
Copy link
Contributor

craig bot commented Aug 16, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: Update snapshot strategy for disaggregated ingests
5 participants