-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage,kvserver: Foundational changes for disaggregated ingestions #107297
Conversation
This is two smaller parts plucked in from #105839 and split into separate commits, to ease code review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 8 files at r1, 4 of 49 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, and @pavelkalinnikov)
pkg/storage/engine.go
line 786 at r1 (raw file):
// // It is safe to modify the contents of the arguments after it returns. PutInternalKey(key *pebble.InternalKey, value []byte) error
Do these need to be a part of the Writer
interface? Would it be feasible to extract them out into an InternalWriter
that implements Writer
and expands the interface? Since we don't expect any other callers to make use of these facilities, it'd be nice to quarantine them. Same goes for ScanInternal
on Reader
.
caefffc
to
422c94e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @jbowens, and @pavelkalinnikov)
pkg/storage/engine.go
line 786 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Do these need to be a part of the
Writer
interface? Would it be feasible to extract them out into anInternalWriter
that implementsWriter
and expands the interface? Since we don't expect any other callers to make use of these facilities, it'd be nice to quarantine them. Same goes forScanInternal
onReader
.
Done. The InternalWriter
stuff turned out to be pretty straightforward as we only use these methods in the SST writer (where we directly reference the implementation), and the WriteBatch
which is easy enough to use with InternalWriter
. The Reader side is more complex as 1) Pebble supports ScanInternal on both snapshots and DBs, 2) we use it with a Snapshot, 3) pebbleReadOnly
should probably also support ScanInternal
anyway, 4) the only case where we don't support a ScanInternal on a reader is a batch, and we can just continue to panic there. So I left the Reader side as-is with the ScanInternal addition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take another look at this tomorrow.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @pavelkalinnikov)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me. Haven't looked at all of the test files yet. Definitely deserves another look from someone familiar with the kvserver code.
Reviewed 3 of 8 files at r1, 11 of 66 files at r3, 12 of 61 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @pavelkalinnikov)
pkg/storage/engine.go
line 878 at r3 (raw file):
// start, end keys and bypasses the EngineKey encoding step. It also only // operates on point keys; for range keys, use ClearEngineRangeKey or // PutInternalRangeKey.
Is it not safe to modify contents of the slices after this function returns?
pkg/storage/pebble_batch.go
line 577 at r3 (raw file):
} // PutInternalKey implements the WriteBatch interface.
Is this still the WriteBatch
interface?
pkg/storage/pebble_batch.go
line 579 at r3 (raw file):
// PutInternalKey implements the WriteBatch interface. func (p *pebbleBatch) PutInternalKey(key *pebble.InternalKey, value []byte) error { if len(key.UserKey) == 0 {
When does this happen?
pkg/storage/sst_writer.go
line 274 at r3 (raw file):
} // PutInternalKey implements the Writer interface.
Should this be ... InternalWriter interface
?
pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/all/output
line 1 at r4 (raw file):
echo
This test is very hard to read. All these lines are long enough that they don't fit on a screen, and it looks like the text itself is wrapped? My editor has line wrapping turned off. Is it the same for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @pavelkalinnikov)
pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/all/output
line 1 at r4 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
This test is very hard to read. All these lines are long enough that they don't fit on a screen, and it looks like the text itself is wrapped? My editor has line wrapping turned off. Is it the same for you?
I was able to fix this. Was a problem with my editor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @jbowens, and @pavelkalinnikov)
pkg/storage/engine.go
line 878 at r3 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Is it not safe to modify contents of the slices after this function returns?
Clarified. And yes.
pkg/storage/pebble_batch.go
line 577 at r3 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Is this still the
WriteBatch
interface?
Yes, because WriteBatch composes InternalBatch.
pkg/storage/pebble_batch.go
line 579 at r3 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
When does this happen?
The same check is in all other Batch methods too, so it's here for consistency.
pkg/storage/sst_writer.go
line 274 at r3 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Should this be
... InternalWriter interface
?
Done.
pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/all/output
line 1 at r4 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
I was able to fix this. Was a problem with my editor.
It's largely autogenerated code, so I would mostly look at differences between this and the other files with the different replicated key filters, rather than review the whole thing as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 8 files at r1, 2 of 66 files at r3, 12 of 65 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, @jbowens, and @pavelkalinnikov)
pkg/storage/engine.go
line 786 at r1 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done. The
InternalWriter
stuff turned out to be pretty straightforward as we only use these methods in the SST writer (where we directly reference the implementation), and theWriteBatch
which is easy enough to use withInternalWriter
. The Reader side is more complex as 1) Pebble supports ScanInternal on both snapshots and DBs, 2) we use it with a Snapshot, 3)pebbleReadOnly
should probably also supportScanInternal
anyway, 4) the only case where we don't support a ScanInternal on a reader is a batch, and we can just continue to panic there. So I left the Reader side as-is with the ScanInternal addition
In CRDB, we will be doing ScanInternal only on EFOS, yes?
I'm ok with it being defined on Reader
if that is the simplest option.
pkg/storage/engine.go
line 605 at r5 (raw file):
// InternalKeys using a visitor pattern, while also allowing for keys in // shared files to be skipped if a visitor is provided for visitSharedFiles. // Useful for fast-replicating state from one Reader to another.
DB.ScanInternal
allows visitSharedFile
to be nil, and I don't see a use case for that. Can we slightly simplify the Pebble code base by requiring it to be non-nil?
pkg/storage/engine.go
line 892 at r5 (raw file):
// // It is safe to modify the contents of the arguments after it returns. PutInternalKey(key *pebble.InternalKey, value []byte) error
PutInternalPointKey?
pkg/storage/open.go
line 187 at r5 (raw file):
// version upgrade finalization. However, seeing as shared storage is // an experimental feature and upgrading from existing stores is not // supported, this is fine.
This will ratchet the Pebble format major version, so if someone starts such a node and then wants to start an older binary it will fail, yes? I suppose that can't be made to work anyway since we could have introduced remote files.
btw, have we finalized on the terminology in https://docs.google.com/document/d/1BnjKvE0RERV0QkTAfR5iNn8sYVo0gaVgDx4x6kCNCGg/edit and if yes, is there a place in the Pebble code which documents it?
pkg/storage/pebble.go
line 2478 at r5 (raw file):
visitSharedFile func(sst *pebble.SharedSSTMeta) error, ) error { return p.parent.ScanInternal(ctx, lower, upper, visitPointKey, visitRangeDel, visitRangeKey, visitSharedFile)
Given this appeals to the parent, it does not satisfy the behavior of ConsistentIterators
. We should document in Reader.ScanInternal
that it is not covered by the guarantees of ConsistentIterators
.
pkg/storage/pebble_batch.go
line 275 at r5 (raw file):
// ClearRawEncodedRange implements the InternalWriter interface. func (p *pebbleBatch) ClearRawEncodedRange(start, end []byte) error { return p.batch.DeleteRange(start, end, pebble.Sync)
why pebble.Sync
? I was expecting nil since this is a batch.
pkg/storage/pebble_batch.go
line 504 at r5 (raw file):
// PutInternalRangeKey implements the InternalWriter interface. func (p *pebbleBatch) PutInternalRangeKey(start, end []byte, key rangekey.Key) error {
We expose the seqnums from ScanInternal
, but don't use them.
The code comment for DB.ScanInternal
does not say that it is collapsing range keys, but it must be doing so -- can you update that comment. Also, similar comment for Reader.ScanInternal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 61 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, @jbowens, and @pavelkalinnikov)
pkg/kv/kvserver/rditer/replica_data_iter.go
line 388 at r6 (raw file):
panic("reader must provide consistent iterators") } span := desc.KeySpan().AsRawSpanWithNoLocals()
Should we use Select
to construct this with ReplicatedSpansUserOnly
? I don't see any other reason for the existence of ReplicatedSpansUserOnly
.
It also gives us better test coverage -- I had to check that desc.KeySpan()
does the forwarding of the start key to LocalMax
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @jbowens, @pavelkalinnikov, and @sumeerbhola)
pkg/kv/kvserver/rditer/replica_data_iter.go
line 388 at r6 (raw file):
Previously, sumeerbhola wrote…
Should we use
Select
to construct this withReplicatedSpansUserOnly
? I don't see any other reason for the existence ofReplicatedSpansUserOnly
.
It also gives us better test coverage -- I had to check thatdesc.KeySpan()
does the forwarding of the start key toLocalMax
.
Done. The other use case of ReplicatedSpansUserOnly
is when we restart iteration due to ErrInvalidSkipSharedIteration
.
pkg/storage/engine.go
line 786 at r1 (raw file):
Previously, sumeerbhola wrote…
In CRDB, we will be doing ScanInternal only on EFOS, yes?
I'm ok with it being defined on
Reader
if that is the simplest option.
Yeah, we'll probably only use it on EFOS from CRDB. There's the ScanStatistics case but that's probably going to be within Pebble.
pkg/storage/engine.go
line 605 at r5 (raw file):
Previously, sumeerbhola wrote…
DB.ScanInternal
allowsvisitSharedFile
to be nil, and I don't see a use case for that. Can we slightly simplify the Pebble code base by requiring it to be non-nil?
The nil use case is ScanStatistics
and potentially others in the future. We can require it to be non-nil in the Cockroach interface though, if we want.
pkg/storage/engine.go
line 892 at r5 (raw file):
Previously, sumeerbhola wrote…
PutInternalPointKey?
Done.
pkg/storage/open.go
line 187 at r5 (raw file):
Previously, sumeerbhola wrote…
This will ratchet the Pebble format major version, so if someone starts such a node and then wants to start an older binary it will fail, yes? I suppose that can't be made to work anyway since we could have introduced remote files.
btw, have we finalized on the terminology in https://docs.google.com/document/d/1BnjKvE0RERV0QkTAfR5iNn8sYVo0gaVgDx4x6kCNCGg/edit and if yes, is there a place in the Pebble code which documents it?
I think this use of SharedStorage is accurate as it'll only be used for shared sstables, not all remote sstables. But yes, we've finalized the terminology and the best definition for the different terms is around BackingType in db.go in pebble.
pkg/storage/pebble.go
line 2478 at r5 (raw file):
Previously, sumeerbhola wrote…
Given this appeals to the parent, it does not satisfy the behavior of
ConsistentIterators
. We should document inReader.ScanInternal
that it is not covered by the guarantees ofConsistentIterators
.
Done.
pkg/storage/pebble_batch.go
line 275 at r5 (raw file):
Previously, sumeerbhola wrote…
why
pebble.Sync
? I was expecting nil since this is a batch.
All the other use-cases in pebbleBatch
pass in pebble.Sync
for the opt, so I just maintained consistency. Either way it won't make a difference as we're not committing the batch.
pkg/storage/pebble_batch.go
line 504 at r5 (raw file):
Previously, sumeerbhola wrote…
We expose the seqnums from
ScanInternal
, but don't use them.The code comment for
DB.ScanInternal
does not say that it is collapsing range keys, but it must be doing so -- can you update that comment. Also, similar comment forReader.ScanInternal
.
Done. DB.ScanInternal in Pebble already talks about this , though it could be more specific when talking about range key collapsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the rditer
commit. Looks good. Some questions about the tests, there seem to be some leftover unused data files.
pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/exclude-user/output
Show resolved
Hide resolved
pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r1/all/only/output
Outdated
Show resolved
Hide resolved
pkg/kv/kvserver/rditer/testdata/TestReplicaDataIterator/r3/replicatedOnly/skip/output
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once @pavelkalinnikov is happy with the rditer
tests
Reviewed 1 of 61 files at r6, 5 of 66 files at r7, 5 of 61 files at r8, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bananabrick, @itsbilal, @jbowens, and @pavelkalinnikov)
TFTRs! Addressed @pavelkalinnikov 's outstanding comments. Also rebased on #108369 to make it build which should be merged soon anyway (I'll rebase it on master once that merges) |
This change updates pkg/storage interfaces and implementations to allow the use of ScanInternal in skip-shared iteration mode as well as writing/reading of internal point keys, range dels and range keys. Replication / snapshot code will soon rely on these changes to be able to replicate internal keys in higher levels plus metadata of shared sstables in lower levels, as opposed to just observed user keys. Part of cockroachdb#103028 Epic: none Release note: None
This change adds the ability to select for just the replicated span in rditer.Select and rditer.IterateReplicaKeySpans. Also adds a new rditer.IterateReplicaKeySpansShared that does a ScanInternal on just the user key span, to be able to collect metadata of shared sstables as well as any internal keys above them. We only use skip-shared iteration for the replicated user key span of a range, and in practice, only if it's a non-system range. Part of cockroachdb#103028. Epic: none Release note: None
Verbal good-to-go from @pavelkalinnikov , so I'll merge this and rebase #105839. Thanks for the reviews y'all! bors r=sumeerbhola |
Build succeeded: |
This change contains two commits (split off from the original mega-PR, #105839). The first is a pkg/storage change to add new interface methods to call pebble's db.ScanInternal as well as implement related helper methods in sstable writers/batch readers/writers to be able to do disaggregated snapshot ingestion. The second is a kvserver/rditer change to allow finer-grained control on what replicated spans we iterate on, as well as to be able to specifically opt into skip-shared iteration over the user key span through the use of
ScanInternal
.storage: Update Engine/Reader/Writer interfaces for ScanInternal
This change updates pkg/storage interfaces and implementations to allow
the use of ScanInternal in skip-shared iteration mode as well as
writing/reading of internal point keys, range dels and range keys.
Replication / snapshot code will soon rely on these changes to
be able to replicate internal keys in higher levels plus metadata
of shared sstables in lower levels, as opposed to just observed
user keys.
Part of #103028
Epic: none
Release note: None
kvserver: Add ability to filter replicated spans in Select/Iterate
This change adds the ability to select for just the replicated
span in rditer.Select and rditer.IterateReplicaKeySpans. Also
adds a new rditer.IterateReplicaKeySpansShared that does a
ScanInternal on just the user key span, to be able to collect
metadata of shared sstables as well as any internal keys above
them.
We only use skip-shared iteration for the replicated user key span
of a range, and in practice, only if it's a non-system range.
Part of #103028.
Epic: none
Release note: None