-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: add infrastructure to read from virtual sstables #2288
Conversation
Not ready for review. Copying code over from another branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments (I know it's not ready for review)
Reviewable status: 0 of 17 files reviewed, 4 unresolved discussions (waiting on @bananabrick)
internal/manifest/version.go
line 116 at r1 (raw file):
} // PhysicalFileMeta is used by functions which want a guarantee that their input
[nit] This abstraction seems a bit strange. The functions could simply checkIsVirtual
.. We could add convenience methods FileMetadata.AssertVirtual()
and AssertNotVirtual()
internal/manifest/version.go
line 132 at r1 (raw file):
func NewPhysicalMeta(meta *FileMetadata) PhysicalFileMeta { if meta.IsVirtual() { panic("pebble: invalid function call")
[nit] the message here is not useful at all, maybe pebble: expected virtual file
internal/manifest/version.go
line 179 at r1 (raw file):
// sstables, it is upto the creator of the virtual sstable to set refs to // the appropriate value. refs *int32
It seems like it would be a good idea to separate this into a struct, and move all the atomic code into that struct.
internal/manifest/version.go
line 284 at r1 (raw file):
// IsVirtual returns true if the FileMetadata belongs to a virtual sstable. func (m *FileMetadata) IsVirtual() bool { // TODO(bananabrick): Might want to get rid of PhysicalSSTNum.
What's the alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost complete. We have the FileMetadata changes, refcounting for virtual sstables, VirtualReader api, table cache changes, 90% of sstable iterator changes(there's still some cases around SetBounds
/bounds set by the caller of NewIterWithBlockPropertyFilters
which I'm working on.
Remaining work:
- More testing around the iterator.
- Test which reads from virtual sstables through table cache(this will test the entire implementation).
- ve encoding/decoding.
Reviewable status: 0 of 28 files reviewed, 6 unresolved discussions (waiting on @RaduBerinde)
checkpoint.go
line 267 at r3 (raw file):
for f := iter.First(); f != nil; f = iter.Next() { if f.IsVirtual() { panic("pebble: checkpointing not yet supported for virtual sstables")
Will support in separate pr. Want to use this pr to unblock further disag storage development.
version_set.go
line 89 at r3 (raw file):
// on the creation of every version. obsoleteFn func(obsolete []physicalMeta) obsoleteTables []physicalMeta
@RaduBerinde This physicalMeta
abstraction is unusual, but it's been quite useful, and it's also easy to understand.
For example, I turned obsoleteTables []*manifest.FileMetadata
-> obsoleteTables []physicalMeta
, and then I started fixing compiler errors, and by the time I was done fixing compiler errors, I had refcounting working.
We could also have cases where we have some functions A, B, C where A only works on virtual sstables, B works on both, and C only works on virtual sstables. And we have A calling C, and B calling C, and C can also be called independently. Imo, it's cleaner to constrain the input using a type which indicates that the file is a virtual sstable rather than have assertions in all 3 functions.
internal/manifest/version.go
line 116 at r1 (raw file):
Previously, RaduBerinde wrote…
[nit] This abstraction seems a bit strange. The functions could simply check
IsVirtual
.. We could add convenience methodsFileMetadata.AssertVirtual()
andAssertNotVirtual()
Commented above why I think it's useful.
internal/manifest/version.go
line 179 at r1 (raw file):
Previously, RaduBerinde wrote…
It seems like it would be a good idea to separate this into a struct, and move all the atomic code into that struct.
We have an Atomic struct at the top of the FilemetaData
. Not sure why this isn't a part of it, since I believe moving it will also decrease the overall size of the struct.
internal/manifest/version.go
line 284 at r1 (raw file):
Previously, RaduBerinde wrote…
What's the alternative?
Implemented a VirtualSSTState, since there's other state virtual sstables require. It's also optional so made it a pointer to avoid size bloat in the FileMetadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I'll be moving version edit encoding/decoding into a separate pr(it's required for persisting virtual sstables to disk, but not for testing virtual sstables without Pebble restarts), so all that's left here is more testing around the iterators.
Reviewable status: 0 of 28 files reviewed, 6 unresolved discussions (waiting on @RaduBerinde)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on 3 things in this pr:
- Data driven tests which creates virtual sstables -> writes them to manifest -> opens iterators on the db, and performs reads.
- Randomized test which does the same as 1.
- Verifying some invariants on how the sstable iterators are used by top level iterators. (Want to make sure the truncation logic/bounds passed by top level iterators into sstable iterators is correct).
All that said, this should be ready for a look. The code in reader.go
has the bulk of the complicated logic and deserves the most scrutiny.
Reviewable status: 0 of 28 files reviewed, 7 unresolved discussions (waiting on @itsbilal and @RaduBerinde)
sstable/reader.go
line 1002 at r5 (raw file):
// SeekLT implements internalIterator.SeekLT, as documented in the pebble // package. Note that SeekLT only checks the lower bound. It is up to the // caller to ensure that key is less than the upper bound.
I found this invariant: // package. Note that SeekLT only checks the lower bound. It is up to the // caller to ensure that key is less than the upper bound.
, unnecessary. It seems to be the case that, if the upper bound is exclusive, then the top level iterators just pass in the exclusive upper bound here, which seems to do the right thing and that is also what my tests indicate.
Maybe the invariant should be: key should be less <= upper bound if upper bound is exclusive
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 28 files reviewed, 7 unresolved discussions (waiting on @itsbilal and @RaduBerinde)
sstable/reader.go
line 1002 at r5 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
I found this invariant:
// package. Note that SeekLT only checks the lower bound. It is up to the // caller to ensure that key is less than the upper bound.
, unnecessary. It seems to be the case that, if the upper bound is exclusive, then the top level iterators just pass in the exclusive upper bound here, which seems to do the right thing and that is also what my tests indicate.Maybe the invariant should be:
key should be less <= upper bound if upper bound is exclusive
?
This also seems to be the case from the comment on the Last
function:
// Last implements internalIterator.Last, as documented in the pebble
// package. Note that Last only checks the lower bound. It is up to the caller
// to ensure that key is less than the upper bound (e.g. via a call to
// SeekLT(upper))
It indicates that SeekLT(upper)
should be called, but upper
is not less than the upper bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 28 files reviewed, 8 unresolved discussions (waiting on @itsbilal and @RaduBerinde)
sstable/reader.go
line 993 at r5 (raw file):
// Seek to the first internal key ikey, _ := i.SeekGE(i.upper, base.SeekGEFlagsNone)
We could implement a variation of the Last
function which checks upper
, so that we don't have to perform these SeekGE
s and Next
s. But in terms of correctness, this should be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 28 files reviewed, 5 unresolved discussions (waiting on @bananabrick and @itsbilal)
version_set.go
line 89 at r3 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
@RaduBerinde This
physicalMeta
abstraction is unusual, but it's been quite useful, and it's also easy to understand.For example, I turned
obsoleteTables []*manifest.FileMetadata
->obsoleteTables []physicalMeta
, and then I started fixing compiler errors, and by the time I was done fixing compiler errors, I had refcounting working.We could also have cases where we have some functions A, B, C where A only works on virtual sstables, B works on both, and C only works on virtual sstables. And we have A calling C, and B calling C, and C can also be called independently. Imo, it's cleaner to constrain the input using a type which indicates that the file is a virtual sstable rather than have assertions in all 3 functions.
Interesting, I see.
sstable/reader.go
line 2776 at r5 (raw file):
} // VirtualReader wraps Reader. Its purpose is to restrict functionality of the
Consider making this a "virtual capable reader" rather than a "virtual only reader". If the meta is not virtual, it would just pass through to the Reader
(which maybe should become an unexported physicalReader
). Then various users would only need VirtualReader
. Just a suggestion, I am not sure it would make things better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a very rough pass over the manifest changes
Reviewed 2 of 28 files at r3.
Reviewable status: 2 of 28 files reviewed, 13 unresolved discussions (waiting on @bananabrick and @itsbilal)
version_set.go
line 95 at r5 (raw file):
// Zombie tables are tables which are no longer referenced directly by the // latest version. The files in the zombieTables list may be referenced by // an older Version, or may be refereced by virtual sstables in the latest
nit: referenced
version_set.go
line 96 at r5 (raw file):
// latest version. The files in the zombieTables list may be referenced by // an older Version, or may be refereced by virtual sstables in the latest // Version, or an older Version.
This is changing the meaning of zombie tables which were supposed to track tables that were not referenced in the latest version, so that we could realize that there was something like a long-lived Iterator that was holding onto an older version.
We had discussed a way to preserve this definition in the RFC PR, which was written in the RFC as:
type PhysicalState struct {
// Total refs across all virtual ssts * versions. That is, if the same virtual
// sst is present in multiple versions, it may have multiple refs, if the
// btree node is not the same.
totalRefs int32
// Number of virtual ssts in the latest version that refer to this physical
// SST. Will be 1 if there is only a physical sst, or there is only 1 virtual
// sst referencing this physical sst.
// INVARIANT: refsInLatestVersion <= totalRefs
// refsInLatestVersion == 0 is a zombie sstable.
refsInLatestVersion int32
I don't see the corresponding code.
Am I missing something?
internal/manifest/level_metadata.go
line 474 at r5 (raw file):
// make sure that the caller's usecase also works with virtual sstables. An // example of this is DB checkpointing, which will not work with virtual // sstables.
are you saying "currently will not work", but we will fix it to work?
internal/manifest/version.go
line 40 at r5 (raw file):
Largest InternalKey // SmallestSeqNum is the smallest sequence number in the table. SmallestSeqNum uint64
Were these removed because they are not needed?
IIRC, we pass this struct to users of Pebble via EventListener.
internal/manifest/version.go
line 216 at r5 (raw file):
// File creation time in seconds since the epoch (1970-01-01 00:00:00 // UTC). For ingested sstables, this corresponds to the time the file was // ingested.
Worth clarifying what CreationTime means for a virtual sst.
internal/manifest/version.go
line 245 at r5 (raw file):
// // Make sure to set Atomic.statsValid to 1 and set the table stats upon // virtual sstable creation, as asynchrnously computing these isn't
nit: asynchronously
internal/manifest/version.go
line 250 at r5 (raw file):
// TODO(bananabrick): TableStats are almost exclusively used for compaction // picking. Those heuristics make sense for physical sstables, but don't // necessarily make sense for virtual sstables. When implementing additional
The heuristic for picking the file in the level to compact is relevant even for virtual ssts. Maybe I am missing something.
The RFC says "TableStats will either be computed upon virtual sstable creation using linear interpolation on the block counts of the virtual/physical sstables or asynchronously using the file bounds of the virtual sstable."
internal/manifest/version_edit.go
line 109 at r5 (raw file):
// found that there was no overlapping file at the higher level). DeletedFiles map[DeletedFileEntry]*FileMetadata NewFiles []NewFileEntry
It is not clear to me how this works for virtual ssts. Unlike physical ssts, where the compaction could be explicit about what it had deleted, because there was only 1 ref in that version, now there can be multiple concurrent compactions that could have as input sstables vf1 and vf2 respectively, where both vf1 and vf2 refer to physical table pt. Neither compaction knows that it will remove the last reference to pt because they are racing with each other. So the version edit produced by the compaction is slightly incomplete. But when deciding to apply an edit, one can use the current state of the latest version to make this version edit complete, and then apply the edit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking directionally good! Some comments around better handling of stuff in the table cache, as well as the exclusive sentinel cases.
Reviewed 11 of 28 files at r3, 1 of 2 files at r5, all commit messages.
Reviewable status: 14 of 28 files reviewed, 21 unresolved discussions (waiting on @bananabrick, @RaduBerinde, and @sumeerbhola)
table_cache.go
line 179 at r5 (raw file):
// withReader fetches an sstable Reader associated with a physical sstable // present on disk. func (c *tableCacheContainer) withReader(meta physicalMeta, fn func(*sstable.Reader) error) error {
This one function should operate on both metadatas, and we should wrap the physical reader with a virtual reader if the file is virtual. It might make sense to have an sstable.Reader
interface, and physical/virtual
reader implementations. That way we:
- only deal with one kind of readers up the stack, the interface type
- The table cache only stores physical readers, better matching the 1 table cache entry = 1 open file descriptor semantic we have going right now. We'd probably need to add logic for this in
findNode
andgetShard
so the table cache only operators on physical files. - Also allows virtual sstables to reuse the same open file descriptor from the physical sstable
We'd probably need to rethink the table cache to open FD mapping for shared.Storage sstables, but for now this makes the most sense and avoids us from running out of tbale cache room if we spawn too many virtual sstables.
internal/manifest/version.go
line 306 at r5 (raw file):
// sstable is created to verify that the fields of the virtual sstable are // sound. func (m *FileMetadata) ValidateVirtualSSTable(parent *FileMetadata) {
nit: ValidateVirtual
sstable/reader.go
line 490 at r5 (raw file):
return false } return (!upperInclusive && i.cmp(key, upper) >= 0) || i.cmp(key, upper) > 0
nit: can avoid two comparisons by saving the value of i.cmp
before this line. Also I'd prefer to inline this everywhere.
sstable/reader.go
line 993 at r5 (raw file):
// Seek to the first internal key ikey, _ := i.SeekGE(i.upper, base.SeekGEFlagsNone)
Currently this isn't handling the case where i.upper is referencing an exclusive sentinel key as the virtual sstable upper bound. We'd want to either ensure virtualLast is not called if the virtual sst ends in an exclusive sentinel key (and instead we'd want to call SeekLT(i.upper)), or we'd need another bool to denote this case and handle it in virtualLast.
sstable/reader.go
line 1020 at r5 (raw file):
} } else if cmp > 0 { // This is a bit subtle, but if the end key is exclusive, then the
This won't always be true. The virtual sst could be truncated at a rangedel/rangekey exclusive sentinel key, in which case i.endKeyInclusive
would be false and we'd want to virtualLast regardless.
sstable/reader.go
line 2200 at r5 (raw file):
// Seek to the first internal key ikey, _ := i.SeekGE(i.upper, base.SeekGEFlagsNone)
Same issue as above
sstable/reader.go
line 2776 at r5 (raw file):
Previously, RaduBerinde wrote…
Consider making this a "virtual capable reader" rather than a "virtual only reader". If the meta is not virtual, it would just pass through to the
Reader
(which maybe should become an unexportedphysicalReader
). Then various users would only needVirtualReader
. Just a suggestion, I am not sure it would make things better.
I second this suggestion. It is an anti-pattern to have iterators hold onto two readers, only one of which is actually doing the reading. We can replace all the i.vReader != nil
with i.reader.IsVirtual()
.
sstable/reader.go
line 2893 at r5 (raw file):
// If we don't shrink the end key to fit within v.upper(end key is already // <= v.upper), then the exclusivity status of the key does not change. lastKeyInclusive = currEndInclusive
lastKeyInclusive = currEndInclusive && !(cmp(end, v.upper.UserKey) == 0 && v.upper.IsExclusiveSentinel())
.
sstable/reader.go
line 2897 at r5 (raw file):
// The new end key(last) is set to v.upper, but this new key should // be inclusive because the prev end key(end) was greater than this key. lastKeyInclusive = true
lastKeyInclusive = !v.upper.IsExclusiveSentinel()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 28 files at r3, all commit messages.
Reviewable status: 14 of 28 files reviewed, 27 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)
compaction.go
line 1236 at r5 (raw file):
rangeDelIter = keyspan.Truncate( c.cmp, rangeDelIter, lowerBound.UserKey, upperBound.UserKey, &f.Smallest, &f.Largest, false,
nit: bool arguments like this are hard to understand when reading. an inline comment helps, eg false /* panicOnTruncation */
internal/manifest/version.go
line 130 at r5 (raw file):
// NewPhysicalMeta should be the only source of creating the PhysicalFileMeta // wrapper type. func NewPhysicalMeta(meta *FileMetadata) PhysicalFileMeta {
nit: you might consider exposing methods:
func (f *FileMetadata) AsPhysicalFile() (PhysicalFileMetadata, bool)
for use in call sites that just want to do something with the physical/virtual file if it is one. eg,
if pm, ok := if f.AsPhysicalFile(); ok {
// do physical file things with `pm`
}
internal/manifest/version.go
line 139 at r5 (raw file):
} // NewVirtualMeta should the only source of creating the NewVirtualMeta wrapper
nit: s/should the/should be/
internal/manifest/version.go
line 161 at r5 (raw file):
PhysicalSSTSize uint64 // Refs is a pointer to the parent sstable Refs pointer.
nit: parent sstable ref count? "Refs pointer" sounds off because refs
itself isn't a pointer and it was moved from FileMetadata.Refs
to FileMetadata.atomic.refs
.
internal/manifest/version.go
line 250 at r5 (raw file):
Previously, sumeerbhola wrote…
The heuristic for picking the file in the level to compact is relevant even for virtual ssts. Maybe I am missing something.
The RFC says "TableStats will either be computed upon virtual sstable creation using linear interpolation on the block counts of the virtual/physical sstables or asynchronously using the file bounds of the virtual sstable."
Some combination of the two could make sense. RangeDeletionsBytesEstimate
is always computed asynchronously if the file contains any range deletions and PointDeletionsBytesEstimate
is computed asynchronously if a table contains 10% or more tombstones.
I could imagine linearly interpolating the properties, and then calculating both the compensation estimates asynchronously just like physical sstables.
Speaking of compaction heuristics, are we worried about space amplification from the physical sstables underlying virtual sstables? It seems like we should have some kind of simple heuristic to stave off pathological cases.
One simple compaction heuristic suggestion is to maintain a VirtualizedSize
on physical file metadata, that represents the sum of the virtual size estimates of all virtual files backed by the physical size. Whenever a virtual file becomes obsolete, the virtual file size estimate is removed from the physical file's VirtualizedSize
. When compaction-picking is picking a file and examines a virtual file, compensate the file size by increasing it by (PhysicalFile.Size - PhysicalFile.VirtualizedSize) / PhysicalFile.Refs
. The intuition is that every virtual file gets its compaction priority boosted by a share of the currently unused bytes of the physical file.
internal/manifest/version.go
line 330 at r5 (raw file):
// RefPtr returns the pointer which should used to reference count a // FileMetadata. func (m *FileMetadata) RefPtr() *int32 {
imo it would be a little cleaner to expose Ref() int32
, Unref() int32
and Refs() int32
methods like we do on *Version
, and let how we ref track be an implementation detail.
internal/manifest/version.go
line 1042 at r5 (raw file):
for _, f := range obsolete { if !f.IsVirtual() { physicalFiles = append(physicalFiles, NewPhysicalMeta(f))
should we make the callee (eg, versionSet.obsoleteFn
) filter out virtual files? It would a) avoid the duplication between here and UnrefLocked
and b) avoid unnecessarily materializing an ephemeral slice of physical obsolete sstables here.
If we want to provide the PhysicalFileMeta
type safety for code outside of internal/manifest
, we could expose a manifest.ForAllPhysicalFiles([]FileMetadata, func(PhysicalFileMetdata))
method or the like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 14 of 28 files reviewed, 27 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)
internal/manifest/version.go
line 250 at r5 (raw file):
One simple compaction heuristic suggestion is to maintain a VirtualizedSize on physical file metadata, that represents the sum of the virtual size estimates of all virtual files backed by the physical size. Whenever a virtual file becomes obsolete, the virtual file size estimate is removed from the physical file's VirtualizedSize.
+1 to this. I am doing something similar in my prototyping of blob files to decide when to rewrite them.
// LiveValueSize is the sum of the length of uncompressed values in this
// blob file that are still live (i.e., referred to by sstables in the
// latest version).
We only want the virtual files in the latest version contributing to this. And this is in-memory state (not written in a version edit), since it can be reconstructed based on the references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 28 files at r3.
Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)
sstable/reader.go
line 902 at r5 (raw file):
if i.vReader != nil { // Callers of SeekPrefixGE aren't aware of virtual sstable bounds, so // we may have to internally restrict the bounds.
why is that? I would've thought the leveliter
would know the bounds since it has the corresponding fileMetadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)
sstable/reader.go
line 1002 at r5 (raw file):
I might be misinterpreting, but I think you're saying the same thing as the comment? Implementations of SeekLT
do not need to check the upper bound and can assume that the provided seek key is less than the upper bound. They only need to check the lower bound, since we're iterating in the backward direction and may hit the lower bound.
It seems to be the case that, if the upper bound is exclusive, then the top level iterators just pass in the exclusive upper bound here
I believe this is what makes this invariant true, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
sstable/reader.go
line 902 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
why is that? I would've thought the
leveliter
would know the bounds since it has the corresponding fileMetadata.
Do we already have an invariant in the code that the level iter will not pass in bounds outside file metadata bounds to the sstable iterators? I couldn't confirm this invariant, so instead I guarded against that case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
sstable/reader.go
line 902 at r5 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Do we already have an invariant in the code that the level iter will not pass in bounds outside file metadata bounds to the sstable iterators? I couldn't confirm this invariant, so instead I guarded against that case here.
If the invariant is not already present, maybe we can try and ensure it in the level iter, but I believe that has its own complexities which @itsbilal ran into. I didn't try that method, so I don't know what the complexities will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)
internal/manifest/version_edit.go
line 109 at r5 (raw file):
Previously, sumeerbhola wrote…
It is not clear to me how this works for virtual ssts. Unlike physical ssts, where the compaction could be explicit about what it had deleted, because there was only 1 ref in that version, now there can be multiple concurrent compactions that could have as input sstables vf1 and vf2 respectively, where both vf1 and vf2 refer to physical table pt. Neither compaction knows that it will remove the last reference to pt because they are racing with each other. So the version edit produced by the compaction is slightly incomplete. But when deciding to apply an edit, one can use the current state of the latest version to make this version edit complete, and then apply the edit.
Adding some stuff I mentioned in a slack chat for the sake of other reviewers:
A single version edit should be explicit about representing (a) s1 is being logically removed from a level, so it does not need to be represented in that level metadata and (b) s1 is no longer needed by the latest version. We are explicit about (a) and (b) now -- the edit has both added and deleted files and if f \in deleted-added it is case (b) and f \in added \intersection deleted it is case (a). Explicitly representing case (b) is important for debuggability, so that we know in the systems history when the system transitioned something to zombie.
Meeting this goal with virtual ssts requires some format changes for the version edit. Also needs some changes for how version edits are constructed by compactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)
internal/manifest/version_edit.go
line 109 at r5 (raw file):
Previously, sumeerbhola wrote…
Adding some stuff I mentioned in a slack chat for the sake of other reviewers:
A single version edit should be explicit about representing (a) s1 is being logically removed from a level, so it does not need to be represented in that level metadata and (b) s1 is no longer needed by the latest version. We are explicit about (a) and (b) now -- the edit has both added and deleted files and if f \in deleted-added it is case (b) and f \in added \intersection deleted it is case (a). Explicitly representing case (b) is important for debuggability, so that we know in the systems history when the system transitioned something to zombie.
Meeting this goal with virtual ssts requires some format changes for the version edit. Also needs some changes for how version edits are constructed by compactions.
Another reason to do that is when writing a snapshot of the latest version one should write the FileMetadata for the physical sst that is referred to by n virtual ssts, and write that physical sst's FileMetadata once. This physical sst no longer has a level, since it can be references by virtual ssts in different levels.
I would prefer to sequence this work as roughly the following sequence of PRs for review
- The in-memory and persistent "data-structure" representing virtual ssts: The file metadata and manifest changes, including the encoding/decoding. We need to get this properly nailed down including the version edit and application interfaces that compactions and ingests will use. Of course, there will no virtual ssts in a real DB yet.
- The read path: the iterator changes for reading a virtual sst when encountered by a compaction or a user-facing read.
- The write path: the ingestion changes that will introduce virtual ssts into a DB and allow for end-to-end testing (including the metamorphic test).
Yes, this is correct. Unless I'm misreading what you're saying. The levelIter will clear a bound if the file metadata says that it doesn't need to be checked; if you have a file metadata
The complexities I ran into were around changing It seems more intuitive to have the sstable iterators enforce virtual sstable bounds. So I generally like the approach in this PR (minus As for the breakdown Sumeer suggested, I also think it'd be a good idea to split this PR up. However seeing as the iter pieces are already here, we could do one PR with just the file metadata changes, one with just the iter (+ file metadata commit) which would basically be this PR, one with version edit (+ file metadata commit), and one with the ingestion write path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. I like the idea of splitting it up.
Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
I'm wary of duplicative key comparisons here. A Seek[Prefix]GE already needs to seek among the sstable boundaries (including the virtual bounds) in order to find the table to read. Our code is already structured to enforce iterator boundaries within the sstable iterator if necessary. The virtual sstable boundaries can be thought of as further constraining those boundaries within the context of the individual sstable. I think structuring the code to maintain a single set of "effective" bounds, and only letting virtual bounds constrain those effective bounds is simpler. |
That's a good point, key comparisons can be pretty inefficient. In that case I'm good with consolidating this logic higher up, i.e. in the levelIter around or inside |
Need to respond to some review here + add some tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 34 files reviewed, 18 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
table_cache.go
line 179 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
This one function should operate on both metadatas, and we should wrap the physical reader with a virtual reader if the file is virtual. It might make sense to have an
sstable.Reader
interface, andphysical/virtual
reader implementations. That way we:
- only deal with one kind of readers up the stack, the interface type
- The table cache only stores physical readers, better matching the 1 table cache entry = 1 open file descriptor semantic we have going right now. We'd probably need to add logic for this in
findNode
andgetShard
so the table cache only operators on physical files.- Also allows virtual sstables to reuse the same open file descriptor from the physical sstable
We'd probably need to rethink the table cache to open FD mapping for shared.Storage sstables, but for now this makes the most sense and avoids us from running out of tbale cache room if we spawn too many virtual sstables.
I like the fact that the VirtualReader and the physical reader are separate.
version_set.go
line 96 at r5 (raw file):
Previously, sumeerbhola wrote…
This is changing the meaning of zombie tables which were supposed to track tables that were not referenced in the latest version, so that we could realize that there was something like a long-lived Iterator that was holding onto an older version.
We had discussed a way to preserve this definition in the RFC PR, which was written in the RFC as:type PhysicalState struct { // Total refs across all virtual ssts * versions. That is, if the same virtual // sst is present in multiple versions, it may have multiple refs, if the // btree node is not the same. totalRefs int32 // Number of virtual ssts in the latest version that refer to this physical // SST. Will be 1 if there is only a physical sst, or there is only 1 virtual // sst referencing this physical sst. // INVARIANT: refsInLatestVersion <= totalRefs // refsInLatestVersion == 0 is a zombie sstable. refsInLatestVersion int32
I don't see the corresponding code.
Am I missing something?
Done.
internal/manifest/level_metadata.go
line 474 at r5 (raw file):
Previously, sumeerbhola wrote…
are you saying "currently will not work", but we will fix it to work?
Done.
internal/manifest/version.go
line 40 at r5 (raw file):
Previously, sumeerbhola wrote…
Were these removed because they are not needed?
IIRC, we pass this struct to users of Pebble via EventListener.
Done.
Code quote:
// SmallestSeqNum is the smallest sequence number in the table.
SmallestSeqNum uint64
// LargestSeqNum is the largest sequence number in the table.
LargestSeqNum uint64
internal/manifest/version.go
line 216 at r5 (raw file):
Previously, sumeerbhola wrote…
Worth clarifying what CreationTime means for a virtual sst.
Done.
internal/manifest/version.go
line 250 at r5 (raw file):
Previously, sumeerbhola wrote…
One simple compaction heuristic suggestion is to maintain a VirtualizedSize on physical file metadata, that represents the sum of the virtual size estimates of all virtual files backed by the physical size. Whenever a virtual file becomes obsolete, the virtual file size estimate is removed from the physical file's VirtualizedSize.
+1 to this. I am doing something similar in my prototyping of blob files to decide when to rewrite them.
// LiveValueSize is the sum of the length of uncompressed values in this // blob file that are still live (i.e., referred to by sstables in the // latest version).
We only want the virtual files in the latest version contributing to this. And this is in-memory state (not written in a version edit), since it can be reconstructed based on the references.
Done.
internal/manifest/version.go
line 330 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
imo it would be a little cleaner to expose
Ref() int32
,Unref() int32
andRefs() int32
methods like we do on*Version
, and let how we ref track be an implementation detail.
Done.
internal/manifest/version.go
line 1042 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
should we make the callee (eg,
versionSet.obsoleteFn
) filter out virtual files? It would a) avoid the duplication between here andUnrefLocked
and b) avoid unnecessarily materializing an ephemeral slice of physical obsolete sstables here.If we want to provide the
PhysicalFileMeta
type safety for code outside ofinternal/manifest
, we could expose amanifest.ForAllPhysicalFiles([]FileMetadata, func(PhysicalFileMetdata))
method or the like
Done.
internal/manifest/version_edit.go
line 109 at r5 (raw file):
Previously, sumeerbhola wrote…
Another reason to do that is when writing a snapshot of the latest version one should write the FileMetadata for the physical sst that is referred to by n virtual ssts, and write that physical sst's FileMetadata once. This physical sst no longer has a level, since it can be references by virtual ssts in different levels.
I would prefer to sequence this work as roughly the following sequence of PRs for review
- The in-memory and persistent "data-structure" representing virtual ssts: The file metadata and manifest changes, including the encoding/decoding. We need to get this properly nailed down including the version edit and application interfaces that compactions and ingests will use. Of course, there will no virtual ssts in a real DB yet.
- The read path: the iterator changes for reading a virtual sst when encountered by a compaction or a user-facing read.
- The write path: the ingestion changes that will introduce virtual ssts into a DB and allow for end-to-end testing (including the metamorphic test).
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 34 files reviewed, 18 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
table_cache.go
line 179 at r5 (raw file):
I like the fact that the VirtualReader and the physical reader are separate.
This was accidentally published before I added additional context.
Think there's test failures due to a merge conflict since the test don't fail locally. Looking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the previous review. Meandered quite a bit before coming back to this, but it's back on track now.
Reviewable status: 0 of 36 files reviewed, 17 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
table_cache.go
line 179 at r5 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
I like the fact that the VirtualReader and the physical reader are separate.
This was accidentally published before I added additional context.
I implemented this here: #2454, and left a comment on why I think we shouldn't use interfaces.
sstable/reader.go
line 490 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
nit: can avoid two comparisons by saving the value of
i.cmp
before this line. Also I'd prefer to inline this everywhere.
Done.
sstable/reader.go
line 993 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Currently this isn't handling the case where i.upper is referencing an exclusive sentinel key as the virtual sstable upper bound. We'd want to either ensure virtualLast is not called if the virtual sst ends in an exclusive sentinel key (and instead we'd want to call SeekLT(i.upper)), or we'd need another bool to denote this case and handle it in virtualLast.
Done.
sstable/reader.go
line 1020 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
This won't always be true. The virtual sst could be truncated at a rangedel/rangekey exclusive sentinel key, in which case
i.endKeyInclusive
would be false and we'd want to virtualLast regardless.
Done.
sstable/reader.go
line 2200 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Same issue as above
Done.
sstable/reader.go
line 2776 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I second this suggestion. It is an anti-pattern to have iterators hold onto two readers, only one of which is actually doing the reading. We can replace all the
i.vReader != nil
withi.reader.IsVirtual()
.
Imo, making a virtual capable reader doesn't make the code cleaner. The vast majority of places only need a Reader
. The virtual reader(which is only a lightweight wrapper) only comes into play in the table cache to create iterators, and when tableCacheContainer.withVirtualReader
is called.
The iterator wasn't really holding on to two readers. The iterator was making all reads through the Reader
. To clarify this, I've removed the virtual reader from the iterator.
sstable/reader.go
line 2893 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
lastKeyInclusive = currEndInclusive && !(cmp(end, v.upper.UserKey) == 0 && v.upper.IsExclusiveSentinel())
.
Done.
sstable/reader.go
line 2897 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
lastKeyInclusive = !v.upper.IsExclusiveSentinel()
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 36 files reviewed, 18 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
sstable/reader.go
line 1019 at r11 (raw file):
) (*InternalKey, base.LazyValue) { if i.vState != nil { // Might have to fix upper bound since virtual sstable bounds are not
Imo, this sort of bounds constraining makes sense out of the level iter, because users of VirtualReader
s should be able to create iterators and have the bounds constrained. That said, if this isn't performant for the level iter case, then we can ensure additional invariants from the level iter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked through most of this, and it's looking very good. Splitting up the PR made this much easier.
Reviewed 3 of 28 files at r3, 1 of 25 files at r6, 5 of 26 files at r8, 14 of 21 files at r9, 1 of 6 files at r10, 7 of 10 files at r11, all commit messages.
Reviewable status: 31 of 36 files reviewed, 26 unresolved discussions (waiting on @bananabrick, @jbowens, @RaduBerinde, and @sumeerbhola)
table_cache.go
line 179 at r5 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
I implemented this here: #2454, and left a comment on why I think we shouldn't use interfaces.
Makes sense. I retract this comment as well in light of the wider conversation since. sorry for sending you down this rabbit hole!
table_cache.go
line 388 at r11 (raw file):
} var virtualReader sstable.VirtualReader
I think a good way to structure this code, instead of having repeat function calls that look exactly the same and if file.Virtual
conditionals, is:
// this interface definition can even live inside the method
type iterCreator interface {
NewRawRangeDelIter() keyspan.Iterator
NewCompactionIter(....) ...
NewIterWithBlockPropertyFiltersAndContext(...)
}
var ic iterCreator
ic = v.reader
if file.Virtual {
ic = sstable.NewVirtualReader(...)
}
// just use ic instead of v.reader
table_cache.go
line 576 at r11 (raw file):
// Note that even if rp.file is a virtual sstable, it should be okay to // return an sstable.Reader, because the reader in this case is only used
I would move this to the comment for GetReader
, i.e. a callout that "It's the caller's responsibility to not read outside virtual sstable bounds if this reader belongs to a virtual sstable".
table_cache.go
line 1050 at r11 (raw file):
type tableCacheNode struct { fileNum base.DiskFileNum
I like this simplification much more than the overuse of meta
.
table_stats.go
line 559 at r11 (raw file):
var size uint64 var err error if file.Virtual {
Repeated use of this paradigm is sort of what I wanted to avoid with the interface. I wonder if we can still do something like it, eg. sstable.IterFactory
interface that has the NewIter methods plus EstimateDiskUsage, and VirtualReader
and Reader
implementations of it, and a d.tableCache.withIterFactory
call if we're interested in an iter factory and don't care whether the underlying thing is virtual or physical.
internal/keyspan/truncate.go
line 18 at r11 (raw file):
start *base.InternalKey, end *base.InternalKey, endInclusive bool,
I don't see the need for this change? We will never generate a virtual sstable that cuts a rangedel short in practice anyway (if we do in the excise case, we can always just set the new rangedel's end key to the start of the excise span, and the same for the other bound, and that holds true with this logic without any changes). Or in other words, if we want a virtual sstable to truncate range keys/dels at d
(exclusive), we'll set the virtual sstable's end bound to the rangedel sentinel at d anyway. And [b,b)
)
The panic is fine. I just don't see endInclusive
being helpful, as end
is already an internal key and we take that into account when we sort trailers too.
sstable/reader.go
line 490 at r5 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Done.
I don't see it?
sstable/reader.go
line 2808 at r11 (raw file):
lower InternalKey upper InternalKey upperInclusive bool
Semantically it's better to either have lower
and upper
as user keys as well as upperInclusive bool
, or just lower
and upper
as internal keys where upperInclusive
is determined by upper.IsExclusiveSentinel()
. I have a strong preference for the former (user keys) as it makes it clear we will never create virtual ssts that split user keys across different sstables.
sstable/reader.go
line 2881 at r11 (raw file):
return keyspan.Truncate( v.reader.Compare, iter, v.vState.lower.UserKey, v.vState.upper.UserKey, &v.vState.lower, &v.vState.upper, v.vState.upperInclusive,
As mentioned above, we don't need to pass upperInclusive here.
sstable/reader.go
line 2911 at r11 (raw file):
start, end []byte, endInclusive bool, ) (lastKeyInclusive bool, first []byte, last []byte) { if start == nil || v.Compare(start, v.lower.UserKey) < 0 {
nit:
first = start
if start == nil || v.Compare(start, v.lower.UserKey) < 0 {
first = v.lower
}
sstable/reader.go
line 2919 at r11 (raw file):
// Note that we assume that start, end has some overlap with the virtual // sstable bounds. if end == nil {
nit: I'd move this out and make the stuff in the else if end != nil
sstable/reader.go
line 2924 at r11 (raw file):
} else { cmp := v.Compare(end, v.upper.UserKey) if cmp == 0 {
nit: switch cmp
Incomplete sentence at end of comment, I meant to say that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 23 of 36 files reviewed, 23 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
table_cache.go
line 388 at r11 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I think a good way to structure this code, instead of having repeat function calls that look exactly the same and
if file.Virtual
conditionals, is:// this interface definition can even live inside the method type iterCreator interface { NewRawRangeDelIter() keyspan.Iterator NewCompactionIter(....) ... NewIterWithBlockPropertyFiltersAndContext(...) } var ic iterCreator ic = v.reader if file.Virtual { ic = sstable.NewVirtualReader(...) } // just use ic instead of v.reader
Done.
table_cache.go
line 576 at r11 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I would move this to the comment for
GetReader
, i.e. a callout that "It's the caller's responsibility to not read outside virtual sstable bounds if this reader belongs to a virtual sstable".
Done.
table_stats.go
line 559 at r11 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Repeated use of this paradigm is sort of what I wanted to avoid with the interface. I wonder if we can still do something like it, eg.
sstable.IterFactory
interface that has the NewIter methods plus EstimateDiskUsage, andVirtualReader
andReader
implementations of it, and ad.tableCache.withIterFactory
call if we're interested in an iter factory and don't care whether the underlying thing is virtual or physical.
I don't want to muddle the code with too many interfaces for this, but if you feels strongly about it, then I'll implement it. Leaving it as is until the next round of review.
internal/keyspan/truncate.go
line 18 at r11 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I don't see the need for this change? We will never generate a virtual sstable that cuts a rangedel short in practice anyway (if we do in the excise case, we can always just set the new rangedel's end key to the start of the excise span, and the same for the other bound, and that holds true with this logic without any changes). Or in other words, if we want a virtual sstable to truncate range keys/dels at
d
(exclusive), we'll set the virtual sstable's end bound to the rangedel sentinel at d anyway. And[b,b)
)The panic is fine. I just don't see
endInclusive
being helpful, asend
is already an internal key and we take that into account when we sort trailers too.
Yea, you're right. We look at trailers while comparing the end key, so we don't need endInclusive.
sstable/reader.go
line 490 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I don't see it?
Forgot to push those changes. Pushed now.
sstable/reader.go
line 2808 at r11 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Semantically it's better to either have
lower
andupper
as user keys as well asupperInclusive bool
, or justlower
andupper
as internal keys whereupperInclusive
is determined byupper.IsExclusiveSentinel()
. I have a strong preference for the former (user keys) as it makes it clear we will never create virtual ssts that split user keys across different sstables.
You're right that we don't need to pass in endInclusive
, because we're comparing trailers in Truncate
, which does the same job.
I went with the second options you suggested.
sstable/reader.go
line 2881 at r11 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
As mentioned above, we don't need to pass upperInclusive here.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 23 of 36 files reviewed, 26 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @sumeerbhola)
db.go
line 1894 at r13 (raw file):
// storing the range `[start, end]`. The estimation is computed as follows: // // - For sstables fully contained in the range the whole file size is included.
[nit] this deserves a note for virtual ssts
table_cache.go
line 522 at r13 (raw file):
v.reader, file.VirtualMeta(), ) iter, err = virtualReader.NewRawRangeKeyIter()
[nit] we should add this to iterCreator
too
internal/keyspan/truncate.go
line 18 at r13 (raw file):
start *base.InternalKey, end *base.InternalKey, panicOnTruncation bool,
[nit] A Truncate iterator that panics on truncation is a weird construct (conceptually speaking).
Maybe rename Truncate
to Clip
or Restrict
? And/or panicOnTruncation
to panicOnPartialOverlap
?
sstable/reader.go
line 2815 at r13 (raw file):
// NewVirtualReader is used to contruct a reader which can read from virtual // sstables. func NewVirtualReader(reader *Reader, meta manifest.VirtualFileMeta) VirtualReader {
[nit] usually we use Make instead of New when we don't return a pointer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(drive-by)
Reviewed 2 of 21 files at r9, 1 of 6 files at r10, 1 of 2 files at r12, 5 of 8 files at r13.
Reviewable status: 29 of 36 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, and @jbowens)
table_cache.go
line 1050 at r11 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I like this simplification much more than the overuse of
meta
.
+1
table_cache.go
line 562 at r13 (raw file):
// // Note that currently the Reader returned here is only used to read value // blocks This reader shouldn't be used for other purposes like reading keys
nit: missing period after "blocks"
sstable/reader.go
line 504 at r13 (raw file):
} func (i *singleLevelIterator) keyOutOfBounds(key []byte, upper []byte, upperInclusive bool) bool {
does this inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 36 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, and @jbowens)
sstable/reader.go
line 902 at r5 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
If the invariant is not already present, maybe we can try and ensure it in the level iter, but I believe that has its own complexities which @itsbilal ran into. I didn't try that method, so I don't know what the complexities will be.
I think the use of levelIter.{findFileGE,findFileLT}
makes the guarantee that if the seek key is not within the file bounds, it will not get called on that sst.
But there may be some cases (including tools) that will use the sst iterator without a levelIter
, so for those we would need to do the comparison. We could optimize for the levelIter
case by setting a field when initing these sst iterators.
sstable/reader.go
line 993 at r5 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
We could implement a variation of the
Last
function which checksupper
, so that we don't have to perform theseSeekGE
s andNext
s. But in terms of correctness, this should be okay.
I don't think First
and Last
are performance sensitive. CockroachDB only uses Seeks. Additionally, if the pebble.Iterator
has a lower bound and First
is called, it turns into a SeekGE (same for upper-bound and Last).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save for the virtualState
user key comment, and a few nits. Thanks for getting this through!
Reviewed 1 of 26 files at r8, 1 of 6 files at r10, 2 of 10 files at r11, 2 of 8 files at r13, 2 of 2 files at r14, all commit messages.
Reviewable status: all files reviewed, 29 unresolved discussions (waiting on @bananabrick, @jbowens, and @sumeerbhola)
internal/keyspan/truncate.go
line 16 at r14 (raw file):
iter FragmentIterator, lower, upper []byte, start *base.InternalKey,
nit: start, end *base.InternalKey
sstable/reader.go
line 902 at r5 (raw file):
Previously, sumeerbhola wrote…
I think the use of
levelIter.{findFileGE,findFileLT}
makes the guarantee that if the seek key is not within the file bounds, it will not get called on that sst.
But there may be some cases (including tools) that will use the sst iterator without alevelIter
, so for those we would need to do the comparison. We could optimize for thelevelIter
case by setting a field when initing these sst iterators.
This can sit in SeekGEFlags
, similar to TrySeekUsingNext
i.e. constrainedSeekKey
or so, that levelIter could always pass true to. I'd be okay with a TODO to handle this case in the future as an optimization.
sstable/reader.go
line 2808 at r11 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
You're right that we don't need to pass in
endInclusive
, because we're comparing trailers inTruncate
, which does the same job.I went with the second options you suggested.
I still strongly think we should switch to user keys and an endInclusive
bool, to make it abundantly clear that we won't split a user key across two VSSTs and to prevent us from unintentionally ever using base.InternalCompare
even by chance. We can just construct a sentinel key for the call to Truncate as necessary. And besides the call to Truncate is exactly why this is important; we don't want to unintentionally exclude some internal keys for a user key just because Truncate compares trailers while everywhere else we only look at User key and IsExclusiveSentinel()
.
sstable/reader.go
line 504 at r13 (raw file):
Previously, sumeerbhola wrote…
does this inline?
@bananabrick would be good to add a // gcaccert:inline
right before this line
sstable/reader.go
line 1527 at r14 (raw file):
// SetBounds implements internalIterator.SetBounds, as documented in the pebble // package. Note that the upper field is exclusive.
I think this tidbit should also be copied to the InternalIterator interface, that only exclusive upper bounds can be specified using SetBounds.
sstable/reader.go
line 1540 at r14 (raw file):
} else { // TODO(bananabrick): Figure out the logic here to enable the boundsCmp // optimization for virtual sstables.
Do you plan to address this soon? If not, one easy way to address this is to check if lowerFromConstrainedBounds == passedInLower
and same for upper, and that endKeyInclusive
is false, and then we can use the same logic as before.
sstable/reader.go
line 1884 at r14 (raw file):
if i.vState != nil { // Callers of SeekGE don't know about virtual sstable bounds, so we may // have to internally restrict the bounds.
Let's add a TODO here to also optimize the levelIter case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 21 files at r9, 1 of 6 files at r10, 1 of 2 files at r12, 4 of 8 files at r13, 1 of 2 files at r14, all commit messages.
Reviewable status: all files reviewed, 35 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)
-- commits
line 8 at r14:
is this bullet still accurate, or is this referring to the DiskFileNum
introduced in an earlier commit?
format_major_version.go
line 527 at r14 (raw file):
// Any physical sstable which has been virtualized must // have already undergone this migration, and we don't // need to worry about the virtual sstable themselves.
can you add a format major version, eg FormatVirtualSSTables
, maybe just commented out for now, and reference it in this comment? we'll be relying on that format major version to not create virtual sstables in any format major versions before this migration.
table_cache.go
line 388 at r11 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Done.
i think this structure forces an unnecessary allocation for physical sstable readers.
does extracting the logic into a function generic over iterCreator
avoid the allocation?
table_cache.go
line 408 at r14 (raw file):
virtualReader := sstable.NewVirtualReader( v.reader, file.VirtualMeta(), )
every call to newIters
needs to suffer an additional allocation for a virtual sstable. we might find that we need a mechanism for caching these virtual readers too so only one exists per virtual sstable.
table_stats.go
line 439 at r14 (raw file):
file.VirtualMeta(), func(v sstable.VirtualReader) (err error) { fileSum += file.Size
is file.Size
the approximate size within the physical file?
table_stats.go
line 442 at r14 (raw file):
entryCount += file.Stats.NumEntries keySum += v.Properties.RawKeySize valSum += v.Properties.RawValueSize
how are these property values defined for virtual sstables? can you leave an explanatory comment here too?
sstable/reader.go
line 210 at r14 (raw file):
// inclusive while iterating instead of exclusive. This is used for virtual // sstable iteration where the file bounds are inclusive. endKeyInclusive bool
have we considered always using exclusive upper bounds for virtual sstables? it seems like it could simplify implementation here. we've discussed (#1741) using loose sstable bounds elsewhere. also with Comparer.ImmediateSuccessor
, they don't even need to be loose—just recharacterized as an exclusive bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)
Previously, jbowens (Jackson Owens) wrote…
is this bullet still accurate, or is this referring to the
DiskFileNum
introduced in an earlier commit?
Done.
format_major_version.go
line 527 at r14 (raw file):
Previously, jbowens (Jackson Owens) wrote…
can you add a format major version, eg
FormatVirtualSSTables
, maybe just commented out for now, and reference it in this comment? we'll be relying on that format major version to not create virtual sstables in any format major versions before this migration.
Will follow up in a new pr.
table_cache.go
line 388 at r11 (raw file):
Previously, jbowens (Jackson Owens) wrote…
i think this structure forces an unnecessary allocation for physical sstable readers.
does extracting the logic into a function generic over
iterCreator
avoid the allocation?
I wrote a benchmark: BenchmarkNewItersAlloc
to see if calling newIters
was leading to any allocations and it doesn't look like it. I think this sort of assignment to a variable with an interface type would only perform an allocation if the Reader wasn't already a heap allocated value(there seems to be some special casing in the internal go code for this).
If I modify the benchmark and run it on a virtual sstable, there is one allocation performed.
I think some additional code here is worth it to avoid the allocation, left a todo to optimize the virtual sstable case.
table_cache.go
line 522 at r13 (raw file):
Previously, RaduBerinde wrote…
[nit] we should add this to
iterCreator
too
Not adding this since the iterator interface is causing an allocation for virtual readers.
table_cache.go
line 408 at r14 (raw file):
Previously, jbowens (Jackson Owens) wrote…
every call to
newIters
needs to suffer an additional allocation for a virtual sstable. we might find that we need a mechanism for caching these virtual readers too so only one exists per virtual sstable.
They're really just a lightweight wrapper around a reader, holding on to some fields from the virtual sstable file metadata.
table_stats.go
line 439 at r14 (raw file):
Previously, jbowens (Jackson Owens) wrote…
is
file.Size
the approximate size within the physical file?
Yea. It's computed by calling EstimatedDiskUsage
on the physical file using virtual sstable bounds.
table_stats.go
line 442 at r14 (raw file):
Previously, jbowens (Jackson Owens) wrote…
how are these property values defined for virtual sstables? can you leave an explanatory comment here too?
They're extrapolated using the TableStats.NumEntries
for the virtual sstable, and Reader.Properties.NumEntries
for the physical sstable. They're defined in NewVirtualReader
. How they're computed is mentioned as a comment in the struct where they're defined.
sstable/reader.go
line 902 at r5 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
This can sit in
SeekGEFlags
, similar toTrySeekUsingNext
i.e.constrainedSeekKey
or so, that levelIter could always pass true to. I'd be okay with a TODO to handle this case in the future as an optimization.
Done.
sstable/reader.go
line 993 at r5 (raw file):
Previously, sumeerbhola wrote…
I don't think
First
andLast
are performance sensitive. CockroachDB only uses Seeks. Additionally, if thepebble.Iterator
has a lower bound andFirst
is called, it turns into a SeekGE (same for upper-bound and Last).
SeekLT
will also call this. I think we can optimize this away when the level iter is using the sstable iterators.
sstable/reader.go
line 2808 at r11 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I still strongly think we should switch to user keys and an
endInclusive
bool, to make it abundantly clear that we won't split a user key across two VSSTs and to prevent us from unintentionally ever usingbase.InternalCompare
even by chance. We can just construct a sentinel key for the call to Truncate as necessary. And besides the call to Truncate is exactly why this is important; we don't want to unintentionally exclude some internal keys for a user key just because Truncate compares trailers while everywhere else we only look at User key andIsExclusiveSentinel()
.
The call to truncate requires both smallest, largest internal keys. Wouldn't we need to store the trailer for the smallest key anyway? I've made a note to discuss this once you're back.
sstable/reader.go
line 504 at r13 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
@bananabrick would be good to add a
// gcaccert:inline
right before this line
No, this does not inline. Will inline this in a separate commit.
sstable/reader.go
line 210 at r14 (raw file):
Previously, jbowens (Jackson Owens) wrote…
have we considered always using exclusive upper bounds for virtual sstables? it seems like it could simplify implementation here. we've discussed (#1741) using loose sstable bounds elsewhere. also with
Comparer.ImmediateSuccessor
, they don't even need to be loose—just recharacterized as an exclusive bound.
Will leave this as is now since I've already implemented it. I haven't considered creating a loose upper bound. Maybe the IsExclusiveSentinel
function for virtual sstable bounds can always return true
, but I haven't considered all the ramifications. I fear it would simplify the sstable iterator code, but complicate invariants elsewhere. To be fair, I really haven't thought it through so maybe I'm wrong.
sstable/reader.go
line 1540 at r14 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Do you plan to address this soon? If not, one easy way to address this is to check if
lowerFromConstrainedBounds == passedInLower
and same for upper, and thatendKeyInclusive
is false, and then we can use the same logic as before.
Yea, I've made a note to address todos soon.
Addressed review but sumeer is OOO
- Support reading from virtual sstables through a Virtual sstable reader. - Support reading virtual sstables through the table cache. RFC: https://github.com/cockroachdb/pebble/blob/master/docs/RFCS/20221122_virtual_sstable.md
sstable reader.
RFC: https://github.com/cockroachdb/pebble/blob/master/docs/RFCS/20221122_virtual_sstable.md