Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage,kvserver: Foundational changes for disaggregated ingestions #107297

Merged
merged 2 commits into from
Aug 10, 2023

Conversation

itsbilal
Copy link
Member

This change contains two commits (split off from the original mega-PR, #105839). The first is a pkg/storage change to add new interface methods to call pebble's db.ScanInternal as well as implement related helper methods in sstable writers/batch readers/writers to be able to do disaggregated snapshot ingestion. The second is a kvserver/rditer change to allow finer-grained control on what replicated spans we iterate on, as well as to be able to specifically opt into skip-shared iteration over the user key span through the use of ScanInternal.


storage: Update Engine/Reader/Writer interfaces for ScanInternal

This change updates pkg/storage interfaces and implementations to allow
the use of ScanInternal in skip-shared iteration mode as well as
writing/reading of internal point keys, range dels and range keys.
Replication / snapshot code will soon rely on these changes to
be able to replicate internal keys in higher levels plus metadata
of shared sstables in lower levels, as opposed to just observed
user keys.

Part of #103028

Epic: none

Release note: None

kvserver: Add ability to filter replicated spans in Select/Iterate

This change adds the ability to select for just the replicated
span in rditer.Select and rditer.IterateReplicaKeySpans. Also
adds a new rditer.IterateReplicaKeySpansShared that does a
ScanInternal on just the user key span, to be able to collect
metadata of shared sstables as well as any internal keys above
them.

We only use skip-shared iteration for the replicated user key span
of a range, and in practice, only if it's a non-system range.

Part of #103028.

Epic: none

Release note: None

@itsbilal itsbilal requested review from a team as code owners July 20, 2023 19:25
@itsbilal itsbilal requested a review from a team July 20, 2023 19:25
@itsbilal itsbilal requested a review from a team as a code owner July 20, 2023 19:25
@itsbilal itsbilal requested a review from bananabrick July 20, 2023 19:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal
Copy link
Member Author

This is two smaller parts plucked in from #105839 and split into separate commits, to ease code review.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 8 files at r1, 4 of 49 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, and @pavelkalinnikov)


pkg/storage/engine.go line 786 at r1 (raw file):

	//
	// It is safe to modify the contents of the arguments after it returns.
	PutInternalKey(key *pebble.InternalKey, value []byte) error

Do these need to be a part of the Writer interface? Would it be feasible to extract them out into an InternalWriter that implements Writer and expands the interface? Since we don't expect any other callers to make use of these facilities, it'd be nice to quarantine them. Same goes for ScanInternal on Reader.

@itsbilal itsbilal force-pushed the rditer-shared branch 2 times, most recently from caefffc to 422c94e Compare July 24, 2023 16:10
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @jbowens, and @pavelkalinnikov)


pkg/storage/engine.go line 786 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Do these need to be a part of the Writer interface? Would it be feasible to extract them out into an InternalWriter that implements Writer and expands the interface? Since we don't expect any other callers to make use of these facilities, it'd be nice to quarantine them. Same goes for ScanInternal on Reader.

Done. The InternalWriter stuff turned out to be pretty straightforward as we only use these methods in the SST writer (where we directly reference the implementation), and the WriteBatch which is easy enough to use with InternalWriter. The Reader side is more complex as 1) Pebble supports ScanInternal on both snapshots and DBs, 2) we use it with a Snapshot, 3) pebbleReadOnly should probably also support ScanInternal anyway, 4) the only case where we don't support a ScanInternal on a reader is a batch, and we can just continue to panic there. So I left the Reader side as-is with the ScanInternal addition

Copy link
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take another look at this tomorrow.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @pavelkalinnikov)

Copy link
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. Haven't looked at all of the test files yet. Definitely deserves another look from someone familiar with the kvserver code.

Reviewed 3 of 8 files at r1, 11 of 66 files at r3, 12 of 61 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @pavelkalinnikov)


pkg/storage/engine.go line 878 at r3 (raw file):

	// start, end keys and bypasses the EngineKey encoding step. It also only
	// operates on point keys; for range keys, use ClearEngineRangeKey or
	// PutInternalRangeKey.

Is it not safe to modify contents of the slices after this function returns?


pkg/storage/pebble_batch.go line 577 at r3 (raw file):

}

// PutInternalKey implements the WriteBatch interface.

Is this still the WriteBatch interface?


pkg/storage/pebble_batch.go line 579 at r3 (raw file):

// PutInternalKey implements the WriteBatch interface.
func (p *pebbleBatch) PutInternalKey(key *pebble.InternalKey, value []byte) error {
	if len(key.UserKey) == 0 {

When does this happen?


pkg/storage/sst_writer.go line 274 at r3 (raw file):

}

// PutInternalKey implements the Writer interface.

Should this be ... InternalWriter interface?


pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/all/output line 1 at r4 (raw file):

echo

This test is very hard to read. All these lines are long enough that they don't fit on a screen, and it looks like the text itself is wrapped? My editor has line wrapping turned off. Is it the same for you?

Copy link
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @pavelkalinnikov)


pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/all/output line 1 at r4 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

This test is very hard to read. All these lines are long enough that they don't fit on a screen, and it looks like the text itself is wrapped? My editor has line wrapping turned off. Is it the same for you?

I was able to fix this. Was a problem with my editor.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @jbowens, and @pavelkalinnikov)


pkg/storage/engine.go line 878 at r3 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Is it not safe to modify contents of the slices after this function returns?

Clarified. And yes.


pkg/storage/pebble_batch.go line 577 at r3 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Is this still the WriteBatch interface?

Yes, because WriteBatch composes InternalBatch.


pkg/storage/pebble_batch.go line 579 at r3 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

When does this happen?

The same check is in all other Batch methods too, so it's here for consistency.


pkg/storage/sst_writer.go line 274 at r3 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Should this be ... InternalWriter interface?

Done.


pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/all/output line 1 at r4 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

I was able to fix this. Was a problem with my editor.

It's largely autogenerated code, so I would mostly look at differences between this and the other files with the different replicated key filters, rather than review the whole thing as-is.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 8 files at r1, 2 of 66 files at r3, 12 of 65 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, @jbowens, and @pavelkalinnikov)


pkg/storage/engine.go line 786 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done. The InternalWriter stuff turned out to be pretty straightforward as we only use these methods in the SST writer (where we directly reference the implementation), and the WriteBatch which is easy enough to use with InternalWriter. The Reader side is more complex as 1) Pebble supports ScanInternal on both snapshots and DBs, 2) we use it with a Snapshot, 3) pebbleReadOnly should probably also support ScanInternal anyway, 4) the only case where we don't support a ScanInternal on a reader is a batch, and we can just continue to panic there. So I left the Reader side as-is with the ScanInternal addition

In CRDB, we will be doing ScanInternal only on EFOS, yes?

I'm ok with it being defined on Reader if that is the simplest option.


pkg/storage/engine.go line 605 at r5 (raw file):

	// InternalKeys using a visitor pattern, while also allowing for keys in
	// shared files to be skipped if a visitor is provided for visitSharedFiles.
	// Useful for fast-replicating state from one Reader to another.

DB.ScanInternal allows visitSharedFile to be nil, and I don't see a use case for that. Can we slightly simplify the Pebble code base by requiring it to be non-nil?


pkg/storage/engine.go line 892 at r5 (raw file):

	//
	// It is safe to modify the contents of the arguments after it returns.
	PutInternalKey(key *pebble.InternalKey, value []byte) error

PutInternalPointKey?


pkg/storage/open.go line 187 at r5 (raw file):

		// version upgrade finalization. However, seeing as shared storage is
		// an experimental feature and upgrading from existing stores is not
		// supported, this is fine.

This will ratchet the Pebble format major version, so if someone starts such a node and then wants to start an older binary it will fail, yes? I suppose that can't be made to work anyway since we could have introduced remote files.
btw, have we finalized on the terminology in https://docs.google.com/document/d/1BnjKvE0RERV0QkTAfR5iNn8sYVo0gaVgDx4x6kCNCGg/edit and if yes, is there a place in the Pebble code which documents it?


pkg/storage/pebble.go line 2478 at r5 (raw file):

	visitSharedFile func(sst *pebble.SharedSSTMeta) error,
) error {
	return p.parent.ScanInternal(ctx, lower, upper, visitPointKey, visitRangeDel, visitRangeKey, visitSharedFile)

Given this appeals to the parent, it does not satisfy the behavior of ConsistentIterators. We should document in Reader.ScanInternal that it is not covered by the guarantees of ConsistentIterators.


pkg/storage/pebble_batch.go line 275 at r5 (raw file):

// ClearRawEncodedRange implements the InternalWriter interface.
func (p *pebbleBatch) ClearRawEncodedRange(start, end []byte) error {
	return p.batch.DeleteRange(start, end, pebble.Sync)

why pebble.Sync? I was expecting nil since this is a batch.


pkg/storage/pebble_batch.go line 504 at r5 (raw file):

// PutInternalRangeKey implements the InternalWriter interface.
func (p *pebbleBatch) PutInternalRangeKey(start, end []byte, key rangekey.Key) error {

We expose the seqnums from ScanInternal, but don't use them.

The code comment for DB.ScanInternal does not say that it is collapsing range keys, but it must be doing so -- can you update that comment. Also, similar comment for Reader.ScanInternal.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 61 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, @jbowens, and @pavelkalinnikov)


pkg/kv/kvserver/rditer/replica_data_iter.go line 388 at r6 (raw file):

		panic("reader must provide consistent iterators")
	}
	span := desc.KeySpan().AsRawSpanWithNoLocals()

Should we use Select to construct this with ReplicatedSpansUserOnly? I don't see any other reason for the existence of ReplicatedSpansUserOnly.
It also gives us better test coverage -- I had to check that desc.KeySpan() does the forwarding of the start key to LocalMax.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @jbowens, @pavelkalinnikov, and @sumeerbhola)


pkg/kv/kvserver/rditer/replica_data_iter.go line 388 at r6 (raw file):

Previously, sumeerbhola wrote…

Should we use Select to construct this with ReplicatedSpansUserOnly? I don't see any other reason for the existence of ReplicatedSpansUserOnly.
It also gives us better test coverage -- I had to check that desc.KeySpan() does the forwarding of the start key to LocalMax.

Done. The other use case of ReplicatedSpansUserOnly is when we restart iteration due to ErrInvalidSkipSharedIteration.


pkg/storage/engine.go line 786 at r1 (raw file):

Previously, sumeerbhola wrote…

In CRDB, we will be doing ScanInternal only on EFOS, yes?

I'm ok with it being defined on Reader if that is the simplest option.

Yeah, we'll probably only use it on EFOS from CRDB. There's the ScanStatistics case but that's probably going to be within Pebble.


pkg/storage/engine.go line 605 at r5 (raw file):

Previously, sumeerbhola wrote…

DB.ScanInternal allows visitSharedFile to be nil, and I don't see a use case for that. Can we slightly simplify the Pebble code base by requiring it to be non-nil?

The nil use case is ScanStatistics and potentially others in the future. We can require it to be non-nil in the Cockroach interface though, if we want.


pkg/storage/engine.go line 892 at r5 (raw file):

Previously, sumeerbhola wrote…

PutInternalPointKey?

Done.


pkg/storage/open.go line 187 at r5 (raw file):

Previously, sumeerbhola wrote…

This will ratchet the Pebble format major version, so if someone starts such a node and then wants to start an older binary it will fail, yes? I suppose that can't be made to work anyway since we could have introduced remote files.
btw, have we finalized on the terminology in https://docs.google.com/document/d/1BnjKvE0RERV0QkTAfR5iNn8sYVo0gaVgDx4x6kCNCGg/edit and if yes, is there a place in the Pebble code which documents it?

I think this use of SharedStorage is accurate as it'll only be used for shared sstables, not all remote sstables. But yes, we've finalized the terminology and the best definition for the different terms is around BackingType in db.go in pebble.


pkg/storage/pebble.go line 2478 at r5 (raw file):

Previously, sumeerbhola wrote…

Given this appeals to the parent, it does not satisfy the behavior of ConsistentIterators. We should document in Reader.ScanInternal that it is not covered by the guarantees of ConsistentIterators.

Done.


pkg/storage/pebble_batch.go line 275 at r5 (raw file):

Previously, sumeerbhola wrote…

why pebble.Sync? I was expecting nil since this is a batch.

All the other use-cases in pebbleBatch pass in pebble.Sync for the opt, so I just maintained consistency. Either way it won't make a difference as we're not committing the batch.


pkg/storage/pebble_batch.go line 504 at r5 (raw file):

Previously, sumeerbhola wrote…

We expose the seqnums from ScanInternal, but don't use them.

The code comment for DB.ScanInternal does not say that it is collapsing range keys, but it must be doing so -- can you update that comment. Also, similar comment for Reader.ScanInternal.

Done. DB.ScanInternal in Pebble already talks about this , though it could be more specific when talking about range key collapsing.

Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the rditer commit. Looks good. Some questions about the tests, there seem to be some leftover unused data files.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: once @pavelkalinnikov is happy with the rditer tests

Reviewed 1 of 61 files at r6, 5 of 66 files at r7, 5 of 61 files at r8, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, @jbowens, and @pavelkalinnikov)

@itsbilal itsbilal requested a review from a team as a code owner August 8, 2023 17:36
@itsbilal
Copy link
Member Author

itsbilal commented Aug 8, 2023

TFTRs! Addressed @pavelkalinnikov 's outstanding comments. Also rebased on #108369 to make it build which should be merged soon anyway (I'll rebase it on master once that merges)

@itsbilal itsbilal removed the request for review from a team August 8, 2023 22:43
This change updates pkg/storage interfaces and implementations to allow
the use of ScanInternal in skip-shared iteration mode as well as
writing/reading of internal point keys, range dels and range keys.
Replication / snapshot code will soon rely on these changes to
be able to replicate internal keys in higher levels plus metadata
of shared sstables in lower levels, as opposed to just observed
user keys.

Part of cockroachdb#103028

Epic: none

Release note: None
This change adds the ability to select for just the replicated
span in rditer.Select and rditer.IterateReplicaKeySpans. Also
adds a new rditer.IterateReplicaKeySpansShared that does a
ScanInternal on just the user key span, to be able to collect
metadata of shared sstables as well as any internal keys above
them.

We only use skip-shared iteration for the replicated user key span
of a range, and in practice, only if it's a non-system range.

Part of cockroachdb#103028.

Epic: none

Release note: None
@itsbilal
Copy link
Member Author

Verbal good-to-go from @pavelkalinnikov , so I'll merge this and rebase #105839. Thanks for the reviews y'all!

bors r=sumeerbhola

@craig craig bot merged commit 1f8fa96 into cockroachdb:master Aug 10, 2023
@craig
Copy link
Contributor

craig bot commented Aug 10, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants