Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: add infrastructure to read from virtual sstables #2288

Merged
merged 1 commit into from
May 3, 2023
Merged

db: add infrastructure to read from virtual sstables #2288

merged 1 commit into from
May 3, 2023

Conversation

bananabrick
Copy link
Contributor

@bananabrick bananabrick commented Jan 30, 2023

  • Support reading from virtual sstables through a Virtual
    sstable reader.
  • Support reading virtual sstables through the table cache.

RFC: https://github.com/cockroachdb/pebble/blob/master/docs/RFCS/20221122_virtual_sstable.md

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@bananabrick
Copy link
Contributor Author

Not ready for review. Copying code over from another branch.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments (I know it's not ready for review)

Reviewable status: 0 of 17 files reviewed, 4 unresolved discussions (waiting on @bananabrick)


internal/manifest/version.go line 116 at r1 (raw file):

}

// PhysicalFileMeta is used by functions which want a guarantee that their input

[nit] This abstraction seems a bit strange. The functions could simply checkIsVirtual.. We could add convenience methods FileMetadata.AssertVirtual() and AssertNotVirtual()


internal/manifest/version.go line 132 at r1 (raw file):

func NewPhysicalMeta(meta *FileMetadata) PhysicalFileMeta {
	if meta.IsVirtual() {
		panic("pebble: invalid function call")

[nit] the message here is not useful at all, maybe pebble: expected virtual file


internal/manifest/version.go line 179 at r1 (raw file):

	// sstables, it is upto the creator of the virtual sstable to set refs to
	// the appropriate value.
	refs *int32

It seems like it would be a good idea to separate this into a struct, and move all the atomic code into that struct.


internal/manifest/version.go line 284 at r1 (raw file):

// IsVirtual returns true if the FileMetadata belongs to a virtual sstable.
func (m *FileMetadata) IsVirtual() bool {
	// TODO(bananabrick): Might want to get rid of PhysicalSSTNum.

What's the alternative?

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost complete. We have the FileMetadata changes, refcounting for virtual sstables, VirtualReader api, table cache changes, 90% of sstable iterator changes(there's still some cases around SetBounds/bounds set by the caller of NewIterWithBlockPropertyFilters which I'm working on.

Remaining work:

  1. More testing around the iterator.
  2. Test which reads from virtual sstables through table cache(this will test the entire implementation).
  3. ve encoding/decoding.

Reviewable status: 0 of 28 files reviewed, 6 unresolved discussions (waiting on @RaduBerinde)


checkpoint.go line 267 at r3 (raw file):

		for f := iter.First(); f != nil; f = iter.Next() {
			if f.IsVirtual() {
				panic("pebble: checkpointing not yet supported for virtual sstables")

Will support in separate pr. Want to use this pr to unblock further disag storage development.


version_set.go line 89 at r3 (raw file):

	// on the creation of every version.
	obsoleteFn        func(obsolete []physicalMeta)
	obsoleteTables    []physicalMeta

@RaduBerinde This physicalMeta abstraction is unusual, but it's been quite useful, and it's also easy to understand.

For example, I turned obsoleteTables []*manifest.FileMetadata -> obsoleteTables []physicalMeta, and then I started fixing compiler errors, and by the time I was done fixing compiler errors, I had refcounting working.

We could also have cases where we have some functions A, B, C where A only works on virtual sstables, B works on both, and C only works on virtual sstables. And we have A calling C, and B calling C, and C can also be called independently. Imo, it's cleaner to constrain the input using a type which indicates that the file is a virtual sstable rather than have assertions in all 3 functions.


internal/manifest/version.go line 116 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] This abstraction seems a bit strange. The functions could simply checkIsVirtual.. We could add convenience methods FileMetadata.AssertVirtual() and AssertNotVirtual()

Commented above why I think it's useful.


internal/manifest/version.go line 179 at r1 (raw file):

Previously, RaduBerinde wrote…

It seems like it would be a good idea to separate this into a struct, and move all the atomic code into that struct.

We have an Atomic struct at the top of the FilemetaData. Not sure why this isn't a part of it, since I believe moving it will also decrease the overall size of the struct.


internal/manifest/version.go line 284 at r1 (raw file):

Previously, RaduBerinde wrote…

What's the alternative?

Implemented a VirtualSSTState, since there's other state virtual sstables require. It's also optional so made it a pointer to avoid size bloat in the FileMetadata.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'll be moving version edit encoding/decoding into a separate pr(it's required for persisting virtual sstables to disk, but not for testing virtual sstables without Pebble restarts), so all that's left here is more testing around the iterators.

Reviewable status: 0 of 28 files reviewed, 6 unresolved discussions (waiting on @RaduBerinde)

@bananabrick bananabrick requested review from itsbilal and a team February 13, 2023 23:33
@bananabrick bananabrick marked this pull request as ready for review February 13, 2023 23:34
Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on 3 things in this pr:

  1. Data driven tests which creates virtual sstables -> writes them to manifest -> opens iterators on the db, and performs reads.
  2. Randomized test which does the same as 1.
  3. Verifying some invariants on how the sstable iterators are used by top level iterators. (Want to make sure the truncation logic/bounds passed by top level iterators into sstable iterators is correct).

All that said, this should be ready for a look. The code in reader.go has the bulk of the complicated logic and deserves the most scrutiny.

Reviewable status: 0 of 28 files reviewed, 7 unresolved discussions (waiting on @itsbilal and @RaduBerinde)


sstable/reader.go line 1002 at r5 (raw file):

// SeekLT implements internalIterator.SeekLT, as documented in the pebble
// package. Note that SeekLT only checks the lower bound. It is up to the
// caller to ensure that key is less than the upper bound.

I found this invariant: // package. Note that SeekLT only checks the lower bound. It is up to the // caller to ensure that key is less than the upper bound., unnecessary. It seems to be the case that, if the upper bound is exclusive, then the top level iterators just pass in the exclusive upper bound here, which seems to do the right thing and that is also what my tests indicate.

Maybe the invariant should be: key should be less <= upper bound if upper bound is exclusive?

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 28 files reviewed, 7 unresolved discussions (waiting on @itsbilal and @RaduBerinde)


sstable/reader.go line 1002 at r5 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

I found this invariant: // package. Note that SeekLT only checks the lower bound. It is up to the // caller to ensure that key is less than the upper bound., unnecessary. It seems to be the case that, if the upper bound is exclusive, then the top level iterators just pass in the exclusive upper bound here, which seems to do the right thing and that is also what my tests indicate.

Maybe the invariant should be: key should be less <= upper bound if upper bound is exclusive?

This also seems to be the case from the comment on the Last function:

// Last implements internalIterator.Last, as documented in the pebble
// package. Note that Last only checks the lower bound. It is up to the caller
// to ensure that key is less than the upper bound (e.g. via a call to
// SeekLT(upper))

It indicates that SeekLT(upper) should be called, but upper is not less than the upper bound.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 28 files reviewed, 8 unresolved discussions (waiting on @itsbilal and @RaduBerinde)


sstable/reader.go line 993 at r5 (raw file):

	// Seek to the first internal key
	ikey, _ := i.SeekGE(i.upper, base.SeekGEFlagsNone)

We could implement a variation of the Last function which checks upper, so that we don't have to perform these SeekGEs and Nexts. But in terms of correctness, this should be okay.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 28 files reviewed, 5 unresolved discussions (waiting on @bananabrick and @itsbilal)


version_set.go line 89 at r3 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

@RaduBerinde This physicalMeta abstraction is unusual, but it's been quite useful, and it's also easy to understand.

For example, I turned obsoleteTables []*manifest.FileMetadata -> obsoleteTables []physicalMeta, and then I started fixing compiler errors, and by the time I was done fixing compiler errors, I had refcounting working.

We could also have cases where we have some functions A, B, C where A only works on virtual sstables, B works on both, and C only works on virtual sstables. And we have A calling C, and B calling C, and C can also be called independently. Imo, it's cleaner to constrain the input using a type which indicates that the file is a virtual sstable rather than have assertions in all 3 functions.

Interesting, I see.


sstable/reader.go line 2776 at r5 (raw file):

}

// VirtualReader wraps Reader. Its purpose is to restrict functionality of the

Consider making this a "virtual capable reader" rather than a "virtual only reader". If the meta is not virtual, it would just pass through to the Reader (which maybe should become an unexported physicalReader). Then various users would only need VirtualReader. Just a suggestion, I am not sure it would make things better.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a very rough pass over the manifest changes

Reviewed 2 of 28 files at r3.
Reviewable status: 2 of 28 files reviewed, 13 unresolved discussions (waiting on @bananabrick and @itsbilal)


version_set.go line 95 at r5 (raw file):

	// Zombie tables are tables which are no longer referenced directly by the
	// latest version. The files in the zombieTables list may be referenced by
	// an older Version, or may be refereced by virtual sstables in the latest

nit: referenced


version_set.go line 96 at r5 (raw file):

	// latest version. The files in the zombieTables list may be referenced by
	// an older Version, or may be refereced by virtual sstables in the latest
	// Version, or an older Version.

This is changing the meaning of zombie tables which were supposed to track tables that were not referenced in the latest version, so that we could realize that there was something like a long-lived Iterator that was holding onto an older version.
We had discussed a way to preserve this definition in the RFC PR, which was written in the RFC as:

type PhysicalState struct {
  // Total refs across all virtual ssts * versions. That is, if the same virtual
  // sst is present in multiple versions, it may have multiple refs, if the
  // btree node is not the same.
  totalRefs int32

  // Number of virtual ssts in the latest version that refer to this physical
  // SST. Will be 1 if there is only a physical sst, or there is only 1 virtual
  // sst referencing this physical sst.
  // INVARIANT: refsInLatestVersion <= totalRefs
  // refsInLatestVersion == 0 is a zombie sstable.
  refsInLatestVersion int32

I don't see the corresponding code.
Am I missing something?


internal/manifest/level_metadata.go line 474 at r5 (raw file):

// make sure that the caller's usecase also works with virtual sstables. An
// example of this is DB checkpointing, which will not work with virtual
// sstables.

are you saying "currently will not work", but we will fix it to work?


internal/manifest/version.go line 40 at r5 (raw file):

	Largest InternalKey
	// SmallestSeqNum is the smallest sequence number in the table.
	SmallestSeqNum uint64

Were these removed because they are not needed?
IIRC, we pass this struct to users of Pebble via EventListener.


internal/manifest/version.go line 216 at r5 (raw file):

	// File creation time in seconds since the epoch (1970-01-01 00:00:00
	// UTC). For ingested sstables, this corresponds to the time the file was
	// ingested.

Worth clarifying what CreationTime means for a virtual sst.


internal/manifest/version.go line 245 at r5 (raw file):

	//
	// Make sure to set Atomic.statsValid to 1 and set the table stats upon
	// virtual sstable creation, as asynchrnously computing these isn't

nit: asynchronously


internal/manifest/version.go line 250 at r5 (raw file):

	// TODO(bananabrick): TableStats are almost exclusively used for compaction
	// picking. Those heuristics make sense for physical sstables, but don't
	// necessarily make sense for virtual sstables. When implementing additional

The heuristic for picking the file in the level to compact is relevant even for virtual ssts. Maybe I am missing something.

The RFC says "TableStats will either be computed upon virtual sstable creation using linear interpolation on the block counts of the virtual/physical sstables or asynchronously using the file bounds of the virtual sstable."


internal/manifest/version_edit.go line 109 at r5 (raw file):

	// found that there was no overlapping file at the higher level).
	DeletedFiles map[DeletedFileEntry]*FileMetadata
	NewFiles     []NewFileEntry

It is not clear to me how this works for virtual ssts. Unlike physical ssts, where the compaction could be explicit about what it had deleted, because there was only 1 ref in that version, now there can be multiple concurrent compactions that could have as input sstables vf1 and vf2 respectively, where both vf1 and vf2 refer to physical table pt. Neither compaction knows that it will remove the last reference to pt because they are racing with each other. So the version edit produced by the compaction is slightly incomplete. But when deciding to apply an edit, one can use the current state of the latest version to make this version edit complete, and then apply the edit.

Copy link
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking directionally good! Some comments around better handling of stuff in the table cache, as well as the exclusive sentinel cases.

Reviewed 11 of 28 files at r3, 1 of 2 files at r5, all commit messages.
Reviewable status: 14 of 28 files reviewed, 21 unresolved discussions (waiting on @bananabrick, @RaduBerinde, and @sumeerbhola)


table_cache.go line 179 at r5 (raw file):

// withReader fetches an sstable Reader associated with a physical sstable
// present on disk.
func (c *tableCacheContainer) withReader(meta physicalMeta, fn func(*sstable.Reader) error) error {

This one function should operate on both metadatas, and we should wrap the physical reader with a virtual reader if the file is virtual. It might make sense to have an sstable.Reader interface, and physical/virtual reader implementations. That way we:

  1. only deal with one kind of readers up the stack, the interface type
  2. The table cache only stores physical readers, better matching the 1 table cache entry = 1 open file descriptor semantic we have going right now. We'd probably need to add logic for this in findNode and getShard so the table cache only operators on physical files.
  3. Also allows virtual sstables to reuse the same open file descriptor from the physical sstable

We'd probably need to rethink the table cache to open FD mapping for shared.Storage sstables, but for now this makes the most sense and avoids us from running out of tbale cache room if we spawn too many virtual sstables.


internal/manifest/version.go line 306 at r5 (raw file):

// sstable is created to verify that the fields of the virtual sstable are
// sound.
func (m *FileMetadata) ValidateVirtualSSTable(parent *FileMetadata) {

nit: ValidateVirtual


sstable/reader.go line 490 at r5 (raw file):

		return false
	}
	return (!upperInclusive && i.cmp(key, upper) >= 0) || i.cmp(key, upper) > 0

nit: can avoid two comparisons by saving the value of i.cmp before this line. Also I'd prefer to inline this everywhere.


sstable/reader.go line 993 at r5 (raw file):

	// Seek to the first internal key
	ikey, _ := i.SeekGE(i.upper, base.SeekGEFlagsNone)

Currently this isn't handling the case where i.upper is referencing an exclusive sentinel key as the virtual sstable upper bound. We'd want to either ensure virtualLast is not called if the virtual sst ends in an exclusive sentinel key (and instead we'd want to call SeekLT(i.upper)), or we'd need another bool to denote this case and handle it in virtualLast.


sstable/reader.go line 1020 at r5 (raw file):

			}
		} else if cmp > 0 {
			// This is a bit subtle, but if the end key is exclusive, then the

This won't always be true. The virtual sst could be truncated at a rangedel/rangekey exclusive sentinel key, in which case i.endKeyInclusive would be false and we'd want to virtualLast regardless.


sstable/reader.go line 2200 at r5 (raw file):

	// Seek to the first internal key
	ikey, _ := i.SeekGE(i.upper, base.SeekGEFlagsNone)

Same issue as above


sstable/reader.go line 2776 at r5 (raw file):

Previously, RaduBerinde wrote…

Consider making this a "virtual capable reader" rather than a "virtual only reader". If the meta is not virtual, it would just pass through to the Reader (which maybe should become an unexported physicalReader). Then various users would only need VirtualReader. Just a suggestion, I am not sure it would make things better.

I second this suggestion. It is an anti-pattern to have iterators hold onto two readers, only one of which is actually doing the reading. We can replace all the i.vReader != nil with i.reader.IsVirtual().


sstable/reader.go line 2893 at r5 (raw file):

	// If we don't shrink the end key to fit within v.upper(end key is already
	// <= v.upper), then the exclusivity status of the key does not change.
	lastKeyInclusive = currEndInclusive

lastKeyInclusive = currEndInclusive && !(cmp(end, v.upper.UserKey) == 0 && v.upper.IsExclusiveSentinel()).


sstable/reader.go line 2897 at r5 (raw file):

		// The new end key(last) is set to v.upper, but this new key should
		// be inclusive because the prev end key(end) was greater than this key.
		lastKeyInclusive = true

lastKeyInclusive = !v.upper.IsExclusiveSentinel()

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 28 files at r3, all commit messages.
Reviewable status: 14 of 28 files reviewed, 27 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)


compaction.go line 1236 at r5 (raw file):

			rangeDelIter = keyspan.Truncate(
				c.cmp, rangeDelIter, lowerBound.UserKey, upperBound.UserKey,
				&f.Smallest, &f.Largest, false,

nit: bool arguments like this are hard to understand when reading. an inline comment helps, eg false /* panicOnTruncation */


internal/manifest/version.go line 130 at r5 (raw file):

// NewPhysicalMeta should be the only source of creating the PhysicalFileMeta
// wrapper type.
func NewPhysicalMeta(meta *FileMetadata) PhysicalFileMeta {

nit: you might consider exposing methods:

func (f *FileMetadata) AsPhysicalFile() (PhysicalFileMetadata, bool)

for use in call sites that just want to do something with the physical/virtual file if it is one. eg,

if pm, ok := if f.AsPhysicalFile(); ok {
    // do physical file things with `pm`
}

internal/manifest/version.go line 139 at r5 (raw file):

}

// NewVirtualMeta should the only source of creating the NewVirtualMeta wrapper

nit: s/should the/should be/


internal/manifest/version.go line 161 at r5 (raw file):

	PhysicalSSTSize uint64

	// Refs is a pointer to the parent sstable Refs pointer.

nit: parent sstable ref count? "Refs pointer" sounds off because refs itself isn't a pointer and it was moved from FileMetadata.Refs to FileMetadata.atomic.refs.


internal/manifest/version.go line 250 at r5 (raw file):

Previously, sumeerbhola wrote…

The heuristic for picking the file in the level to compact is relevant even for virtual ssts. Maybe I am missing something.

The RFC says "TableStats will either be computed upon virtual sstable creation using linear interpolation on the block counts of the virtual/physical sstables or asynchronously using the file bounds of the virtual sstable."

Some combination of the two could make sense. RangeDeletionsBytesEstimate is always computed asynchronously if the file contains any range deletions and PointDeletionsBytesEstimate is computed asynchronously if a table contains 10% or more tombstones.

I could imagine linearly interpolating the properties, and then calculating both the compensation estimates asynchronously just like physical sstables.


Speaking of compaction heuristics, are we worried about space amplification from the physical sstables underlying virtual sstables? It seems like we should have some kind of simple heuristic to stave off pathological cases.

One simple compaction heuristic suggestion is to maintain a VirtualizedSize on physical file metadata, that represents the sum of the virtual size estimates of all virtual files backed by the physical size. Whenever a virtual file becomes obsolete, the virtual file size estimate is removed from the physical file's VirtualizedSize. When compaction-picking is picking a file and examines a virtual file, compensate the file size by increasing it by (PhysicalFile.Size - PhysicalFile.VirtualizedSize) / PhysicalFile.Refs. The intuition is that every virtual file gets its compaction priority boosted by a share of the currently unused bytes of the physical file.


internal/manifest/version.go line 330 at r5 (raw file):

// RefPtr returns the pointer which should used to reference count a
// FileMetadata.
func (m *FileMetadata) RefPtr() *int32 {

imo it would be a little cleaner to expose Ref() int32, Unref() int32 and Refs() int32 methods like we do on *Version, and let how we ref track be an implementation detail.


internal/manifest/version.go line 1042 at r5 (raw file):

		for _, f := range obsolete {
			if !f.IsVirtual() {
				physicalFiles = append(physicalFiles, NewPhysicalMeta(f))

should we make the callee (eg, versionSet.obsoleteFn) filter out virtual files? It would a) avoid the duplication between here and UnrefLocked and b) avoid unnecessarily materializing an ephemeral slice of physical obsolete sstables here.

If we want to provide the PhysicalFileMeta type safety for code outside of internal/manifest, we could expose a manifest.ForAllPhysicalFiles([]FileMetadata, func(PhysicalFileMetdata)) method or the like

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 28 files reviewed, 27 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)


internal/manifest/version.go line 250 at r5 (raw file):

One simple compaction heuristic suggestion is to maintain a VirtualizedSize on physical file metadata, that represents the sum of the virtual size estimates of all virtual files backed by the physical size. Whenever a virtual file becomes obsolete, the virtual file size estimate is removed from the physical file's VirtualizedSize.

+1 to this. I am doing something similar in my prototyping of blob files to decide when to rewrite them.

// LiveValueSize is the sum of the length of uncompressed values in this
// blob file that are still live (i.e., referred to by sstables in the
// latest version).

We only want the virtual files in the latest version contributing to this. And this is in-memory state (not written in a version edit), since it can be reconstructed based on the references.

@bananabrick bananabrick mentioned this pull request Feb 16, 2023
8 tasks
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 28 files at r3.
Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)


sstable/reader.go line 902 at r5 (raw file):

	if i.vReader != nil {
		// Callers of SeekPrefixGE aren't aware of virtual sstable bounds, so
		// we may have to internally restrict the bounds.

why is that? I would've thought the leveliter would know the bounds since it has the corresponding fileMetadata.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)


sstable/reader.go line 1002 at r5 (raw file):
I might be misinterpreting, but I think you're saying the same thing as the comment? Implementations of SeekLT do not need to check the upper bound and can assume that the provided seek key is less than the upper bound. They only need to check the lower bound, since we're iterating in the backward direction and may hit the lower bound.

It seems to be the case that, if the upper bound is exclusive, then the top level iterators just pass in the exclusive upper bound here

I believe this is what makes this invariant true, no?

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


sstable/reader.go line 902 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

why is that? I would've thought the leveliter would know the bounds since it has the corresponding fileMetadata.

Do we already have an invariant in the code that the level iter will not pass in bounds outside file metadata bounds to the sstable iterators? I couldn't confirm this invariant, so instead I guarded against that case here.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


sstable/reader.go line 902 at r5 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Do we already have an invariant in the code that the level iter will not pass in bounds outside file metadata bounds to the sstable iterators? I couldn't confirm this invariant, so instead I guarded against that case here.

If the invariant is not already present, maybe we can try and ensure it in the level iter, but I believe that has its own complexities which @itsbilal ran into. I didn't try that method, so I don't know what the complexities will be.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)


internal/manifest/version_edit.go line 109 at r5 (raw file):

Previously, sumeerbhola wrote…

It is not clear to me how this works for virtual ssts. Unlike physical ssts, where the compaction could be explicit about what it had deleted, because there was only 1 ref in that version, now there can be multiple concurrent compactions that could have as input sstables vf1 and vf2 respectively, where both vf1 and vf2 refer to physical table pt. Neither compaction knows that it will remove the last reference to pt because they are racing with each other. So the version edit produced by the compaction is slightly incomplete. But when deciding to apply an edit, one can use the current state of the latest version to make this version edit complete, and then apply the edit.

Adding some stuff I mentioned in a slack chat for the sake of other reviewers:

A single version edit should be explicit about representing (a) s1 is being logically removed from a level, so it does not need to be represented in that level metadata and (b) s1 is no longer needed by the latest version. We are explicit about (a) and (b) now -- the edit has both added and deleted files and if f \in deleted-added it is case (b) and f \in added \intersection deleted it is case (a). Explicitly representing case (b) is important for debuggability, so that we know in the systems history when the system transitioned something to zombie.

Meeting this goal with virtual ssts requires some format changes for the version edit. Also needs some changes for how version edits are constructed by compactions.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)


internal/manifest/version_edit.go line 109 at r5 (raw file):

Previously, sumeerbhola wrote…

Adding some stuff I mentioned in a slack chat for the sake of other reviewers:

A single version edit should be explicit about representing (a) s1 is being logically removed from a level, so it does not need to be represented in that level metadata and (b) s1 is no longer needed by the latest version. We are explicit about (a) and (b) now -- the edit has both added and deleted files and if f \in deleted-added it is case (b) and f \in added \intersection deleted it is case (a). Explicitly representing case (b) is important for debuggability, so that we know in the systems history when the system transitioned something to zombie.

Meeting this goal with virtual ssts requires some format changes for the version edit. Also needs some changes for how version edits are constructed by compactions.

Another reason to do that is when writing a snapshot of the latest version one should write the FileMetadata for the physical sst that is referred to by n virtual ssts, and write that physical sst's FileMetadata once. This physical sst no longer has a level, since it can be references by virtual ssts in different levels.

I would prefer to sequence this work as roughly the following sequence of PRs for review

  • The in-memory and persistent "data-structure" representing virtual ssts: The file metadata and manifest changes, including the encoding/decoding. We need to get this properly nailed down including the version edit and application interfaces that compactions and ingests will use. Of course, there will no virtual ssts in a real DB yet.
  • The read path: the iterator changes for reading a virtual sst when encountered by a compaction or a user-facing read.
  • The write path: the ingestion changes that will introduce virtual ssts into a DB and allow for end-to-end testing (including the metamorphic test).

@itsbilal
Copy link
Contributor

Do we already have an invariant in the code that the level iter will not pass in bounds outside file metadata bounds to the sstable iterators? I couldn't confirm this invariant, so instead I guarded against that case here.

Yes, this is correct. initTableBounds in level_iter is where we do this logic, and we always call this before calling newIters.

Unless I'm misreading what you're saying. The levelIter will clear a bound if the file metadata says that it doesn't need to be checked; if you have a file metadata [a-d] and your iter bounds are [c,j), the levelIter will pass bounds [c, nil) instead of [c, j). For virtual sstables this necessitates additional bounds-checking because there could be keys greater than d that you're masking in the file metadata.

If the invariant is not already present, maybe we can try and ensure it in the level iter, but I believe that has its own complexities which @itsbilal ran into. I didn't try that method, so I don't know what the complexities will be.

The complexities I ran into were around changing initTableBounds to enforce virtual sstable bounds, instead of teaching the sstable iterators about those. They aren't particularly complex, you just need to ensure you disable the appropriate merging iter optimizations at those bounds, eg. isIgnorableBoundaryKey, just so you don't end up stalling the merging iterator at a virtual sstable bound (as that doesn't mean you've exhausted the iterator as a whole). You can see all the changes between master and itsbilal/disagg-storage-w-sharing in level_iter.go as a starting point on all it'd take to make that change: master...itsbilal:pebble:disagg-storage-w-sharing#diff-2e057bed983b2f4632a29e106a88c55f496fece30e9b3f6e0717de34544047bf

It seems more intuitive to have the sstable iterators enforce virtual sstable bounds. So I generally like the approach in this PR (minus virtualReader refactor that I recommended).

As for the breakdown Sumeer suggested, I also think it'd be a good idea to split this PR up. However seeing as the iter pieces are already here, we could do one PR with just the file metadata changes, one with just the iter (+ file metadata commit) which would basically be this PR, one with version edit (+ file metadata commit), and one with the ingestion write path.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. I like the idea of splitting it up.

Reviewable status: 15 of 28 files reviewed, 28 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)

@jbowens
Copy link
Collaborator

jbowens commented Feb 21, 2023

It seems more intuitive to have the sstable iterators enforce virtual sstable bounds. So I generally like the approach in this PR (minus virtualReader refactor that I recommended).

I'm wary of duplicative key comparisons here. A Seek[Prefix]GE already needs to seek among the sstable boundaries (including the virtual bounds) in order to find the table to read.

Our code is already structured to enforce iterator boundaries within the sstable iterator if necessary. The virtual sstable boundaries can be thought of as further constraining those boundaries within the context of the individual sstable. I think structuring the code to maintain a single set of "effective" bounds, and only letting virtual bounds constrain those effective bounds is simpler.

@itsbilal
Copy link
Contributor

I'm wary of duplicative key comparisons here. A Seek[Prefix]GE already needs to seek among the sstable boundaries (including the virtual bounds) in order to find the table to read.

That's a good point, key comparisons can be pretty inefficient. In that case I'm good with consolidating this logic higher up, i.e. in the levelIter around or inside initTableBounds. That's the approach the prototype took as well. There shouldn't be any additional challenges beyond what the prototype has already worked around, so at least that should speed things up.

@bananabrick bananabrick changed the title db: Add initial virtual sstable support db: add infrastructure to read from virtual sstables Apr 4, 2023
@bananabrick bananabrick marked this pull request as draft April 4, 2023 01:56
@bananabrick
Copy link
Contributor Author

Need to respond to some review here + add some tests.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 34 files reviewed, 18 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


table_cache.go line 179 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This one function should operate on both metadatas, and we should wrap the physical reader with a virtual reader if the file is virtual. It might make sense to have an sstable.Reader interface, and physical/virtual reader implementations. That way we:

  1. only deal with one kind of readers up the stack, the interface type
  2. The table cache only stores physical readers, better matching the 1 table cache entry = 1 open file descriptor semantic we have going right now. We'd probably need to add logic for this in findNode and getShard so the table cache only operators on physical files.
  3. Also allows virtual sstables to reuse the same open file descriptor from the physical sstable

We'd probably need to rethink the table cache to open FD mapping for shared.Storage sstables, but for now this makes the most sense and avoids us from running out of tbale cache room if we spawn too many virtual sstables.

I like the fact that the VirtualReader and the physical reader are separate.


version_set.go line 96 at r5 (raw file):

Previously, sumeerbhola wrote…

This is changing the meaning of zombie tables which were supposed to track tables that were not referenced in the latest version, so that we could realize that there was something like a long-lived Iterator that was holding onto an older version.
We had discussed a way to preserve this definition in the RFC PR, which was written in the RFC as:

type PhysicalState struct {
  // Total refs across all virtual ssts * versions. That is, if the same virtual
  // sst is present in multiple versions, it may have multiple refs, if the
  // btree node is not the same.
  totalRefs int32

  // Number of virtual ssts in the latest version that refer to this physical
  // SST. Will be 1 if there is only a physical sst, or there is only 1 virtual
  // sst referencing this physical sst.
  // INVARIANT: refsInLatestVersion <= totalRefs
  // refsInLatestVersion == 0 is a zombie sstable.
  refsInLatestVersion int32

I don't see the corresponding code.
Am I missing something?

Done.


internal/manifest/level_metadata.go line 474 at r5 (raw file):

Previously, sumeerbhola wrote…

are you saying "currently will not work", but we will fix it to work?

Done.


internal/manifest/version.go line 40 at r5 (raw file):

Previously, sumeerbhola wrote…

Were these removed because they are not needed?
IIRC, we pass this struct to users of Pebble via EventListener.

Done.

Code quote:

	// SmallestSeqNum is the smallest sequence number in the table.
	SmallestSeqNum uint64
	// LargestSeqNum is the largest sequence number in the table.
	LargestSeqNum uint64

internal/manifest/version.go line 216 at r5 (raw file):

Previously, sumeerbhola wrote…

Worth clarifying what CreationTime means for a virtual sst.

Done.


internal/manifest/version.go line 250 at r5 (raw file):

Previously, sumeerbhola wrote…

One simple compaction heuristic suggestion is to maintain a VirtualizedSize on physical file metadata, that represents the sum of the virtual size estimates of all virtual files backed by the physical size. Whenever a virtual file becomes obsolete, the virtual file size estimate is removed from the physical file's VirtualizedSize.

+1 to this. I am doing something similar in my prototyping of blob files to decide when to rewrite them.

// LiveValueSize is the sum of the length of uncompressed values in this
// blob file that are still live (i.e., referred to by sstables in the
// latest version).

We only want the virtual files in the latest version contributing to this. And this is in-memory state (not written in a version edit), since it can be reconstructed based on the references.

Done.


internal/manifest/version.go line 330 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

imo it would be a little cleaner to expose Ref() int32, Unref() int32 and Refs() int32 methods like we do on *Version, and let how we ref track be an implementation detail.

Done.


internal/manifest/version.go line 1042 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

should we make the callee (eg, versionSet.obsoleteFn) filter out virtual files? It would a) avoid the duplication between here and UnrefLocked and b) avoid unnecessarily materializing an ephemeral slice of physical obsolete sstables here.

If we want to provide the PhysicalFileMeta type safety for code outside of internal/manifest, we could expose a manifest.ForAllPhysicalFiles([]FileMetadata, func(PhysicalFileMetdata)) method or the like

Done.


internal/manifest/version_edit.go line 109 at r5 (raw file):

Previously, sumeerbhola wrote…

Another reason to do that is when writing a snapshot of the latest version one should write the FileMetadata for the physical sst that is referred to by n virtual ssts, and write that physical sst's FileMetadata once. This physical sst no longer has a level, since it can be references by virtual ssts in different levels.

I would prefer to sequence this work as roughly the following sequence of PRs for review

  • The in-memory and persistent "data-structure" representing virtual ssts: The file metadata and manifest changes, including the encoding/decoding. We need to get this properly nailed down including the version edit and application interfaces that compactions and ingests will use. Of course, there will no virtual ssts in a real DB yet.
  • The read path: the iterator changes for reading a virtual sst when encountered by a compaction or a user-facing read.
  • The write path: the ingestion changes that will introduce virtual ssts into a DB and allow for end-to-end testing (including the metamorphic test).

Done.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 34 files reviewed, 18 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


table_cache.go line 179 at r5 (raw file):

I like the fact that the VirtualReader and the physical reader are separate.
This was accidentally published before I added additional context.

@bananabrick
Copy link
Contributor Author

Think there's test failures due to a merge conflict since the test don't fail locally. Looking.

@bananabrick bananabrick marked this pull request as ready for review April 18, 2023 23:40
Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the previous review. Meandered quite a bit before coming back to this, but it's back on track now.

Reviewable status: 0 of 36 files reviewed, 17 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


table_cache.go line 179 at r5 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

I like the fact that the VirtualReader and the physical reader are separate.
This was accidentally published before I added additional context.

I implemented this here: #2454, and left a comment on why I think we shouldn't use interfaces.


sstable/reader.go line 490 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

nit: can avoid two comparisons by saving the value of i.cmp before this line. Also I'd prefer to inline this everywhere.

Done.


sstable/reader.go line 993 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Currently this isn't handling the case where i.upper is referencing an exclusive sentinel key as the virtual sstable upper bound. We'd want to either ensure virtualLast is not called if the virtual sst ends in an exclusive sentinel key (and instead we'd want to call SeekLT(i.upper)), or we'd need another bool to denote this case and handle it in virtualLast.

Done.


sstable/reader.go line 1020 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This won't always be true. The virtual sst could be truncated at a rangedel/rangekey exclusive sentinel key, in which case i.endKeyInclusive would be false and we'd want to virtualLast regardless.

Done.


sstable/reader.go line 2200 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Same issue as above

Done.


sstable/reader.go line 2776 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I second this suggestion. It is an anti-pattern to have iterators hold onto two readers, only one of which is actually doing the reading. We can replace all the i.vReader != nil with i.reader.IsVirtual().

Imo, making a virtual capable reader doesn't make the code cleaner. The vast majority of places only need a Reader. The virtual reader(which is only a lightweight wrapper) only comes into play in the table cache to create iterators, and when tableCacheContainer.withVirtualReader is called.

The iterator wasn't really holding on to two readers. The iterator was making all reads through the Reader. To clarify this, I've removed the virtual reader from the iterator.


sstable/reader.go line 2893 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

lastKeyInclusive = currEndInclusive && !(cmp(end, v.upper.UserKey) == 0 && v.upper.IsExclusiveSentinel()).

Done.


sstable/reader.go line 2897 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

lastKeyInclusive = !v.upper.IsExclusiveSentinel()

Done.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 36 files reviewed, 18 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


sstable/reader.go line 1019 at r11 (raw file):

) (*InternalKey, base.LazyValue) {
	if i.vState != nil {
		// Might have to fix upper bound since virtual sstable bounds are not

Imo, this sort of bounds constraining makes sense out of the level iter, because users of VirtualReaders should be able to create iterators and have the bounds constrained. That said, if this isn't performant for the level iter case, then we can ensure additional invariants from the level iter.

Copy link
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through most of this, and it's looking very good. Splitting up the PR made this much easier.

Reviewed 3 of 28 files at r3, 1 of 25 files at r6, 5 of 26 files at r8, 14 of 21 files at r9, 1 of 6 files at r10, 7 of 10 files at r11, all commit messages.
Reviewable status: 31 of 36 files reviewed, 26 unresolved discussions (waiting on @bananabrick, @jbowens, @RaduBerinde, and @sumeerbhola)


table_cache.go line 179 at r5 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

I implemented this here: #2454, and left a comment on why I think we shouldn't use interfaces.

Makes sense. I retract this comment as well in light of the wider conversation since. sorry for sending you down this rabbit hole!


table_cache.go line 388 at r11 (raw file):

	}

	var virtualReader sstable.VirtualReader

I think a good way to structure this code, instead of having repeat function calls that look exactly the same and if file.Virtual conditionals, is:

// this interface definition can even live inside the method
type iterCreator interface {
  NewRawRangeDelIter() keyspan.Iterator
  NewCompactionIter(....) ...
  NewIterWithBlockPropertyFiltersAndContext(...)
}

var ic iterCreator
ic = v.reader

if file.Virtual {
  ic = sstable.NewVirtualReader(...)
}

// just use ic instead of v.reader

table_cache.go line 576 at r11 (raw file):

	// Note that even if rp.file is a virtual sstable, it should be okay to
	// return an sstable.Reader, because the reader in this case is only used

I would move this to the comment for GetReader, i.e. a callout that "It's the caller's responsibility to not read outside virtual sstable bounds if this reader belongs to a virtual sstable".


table_cache.go line 1050 at r11 (raw file):

type tableCacheNode struct {
	fileNum base.DiskFileNum

I like this simplification much more than the overuse of meta.


table_stats.go line 559 at r11 (raw file):

				var size uint64
				var err error
				if file.Virtual {

Repeated use of this paradigm is sort of what I wanted to avoid with the interface. I wonder if we can still do something like it, eg. sstable.IterFactory interface that has the NewIter methods plus EstimateDiskUsage, and VirtualReader and Reader implementations of it, and a d.tableCache.withIterFactory call if we're interested in an iter factory and don't care whether the underlying thing is virtual or physical.


internal/keyspan/truncate.go line 18 at r11 (raw file):

	start *base.InternalKey,
	end *base.InternalKey,
	endInclusive bool,

I don't see the need for this change? We will never generate a virtual sstable that cuts a rangedel short in practice anyway (if we do in the excise case, we can always just set the new rangedel's end key to the start of the excise span, and the same for the other bound, and that holds true with this logic without any changes). Or in other words, if we want a virtual sstable to truncate range keys/dels at d (exclusive), we'll set the virtual sstable's end bound to the rangedel sentinel at d anyway. And [b,b))

The panic is fine. I just don't see endInclusive being helpful, as end is already an internal key and we take that into account when we sort trailers too.


sstable/reader.go line 490 at r5 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Done.

I don't see it?


sstable/reader.go line 2808 at r11 (raw file):

	lower          InternalKey
	upper          InternalKey
	upperInclusive bool

Semantically it's better to either have lower and upper as user keys as well as upperInclusive bool, or just lower and upper as internal keys where upperInclusive is determined by upper.IsExclusiveSentinel(). I have a strong preference for the former (user keys) as it makes it clear we will never create virtual ssts that split user keys across different sstables.


sstable/reader.go line 2881 at r11 (raw file):

	return keyspan.Truncate(
		v.reader.Compare, iter, v.vState.lower.UserKey, v.vState.upper.UserKey,
		&v.vState.lower, &v.vState.upper, v.vState.upperInclusive,

As mentioned above, we don't need to pass upperInclusive here.


sstable/reader.go line 2911 at r11 (raw file):

	start, end []byte, endInclusive bool,
) (lastKeyInclusive bool, first []byte, last []byte) {
	if start == nil || v.Compare(start, v.lower.UserKey) < 0 {

nit:

first = start
if start == nil || v.Compare(start, v.lower.UserKey) < 0 {
  first = v.lower
}

sstable/reader.go line 2919 at r11 (raw file):

	// Note that we assume that start, end has some overlap with the virtual
	// sstable bounds.
	if end == nil {

nit: I'd move this out and make the stuff in the else if end != nil


sstable/reader.go line 2924 at r11 (raw file):

	} else {
		cmp := v.Compare(end, v.upper.UserKey)
		if cmp == 0 {

nit: switch cmp

@itsbilal
Copy link
Contributor

I don't see the need for this change? We will never generate a virtual sstable that cuts a rangedel short in practice anyway (if we do in the excise case, we can always just set the new rangedel's end key to the start of the excise span, and the same for the other bound, and that holds true with this logic without any changes). Or in other words, if we want a virtual sstable to truncate range keys/dels at d (exclusive), we'll set the virtual sstable's end bound to the rangedel sentinel at d anyway. And [b,b))

Incomplete sentence at end of comment, I meant to say that [b,b) is an empty range anyway so we will never have to emit something like it. The other issue with inclusive end handling in keyspan.Truncate is that it has no way to generate ImmediateSuccessor(b), which would be the correct place to truncate an inclusive-end range key/del ending at b.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 23 of 36 files reviewed, 23 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


table_cache.go line 388 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I think a good way to structure this code, instead of having repeat function calls that look exactly the same and if file.Virtual conditionals, is:

// this interface definition can even live inside the method
type iterCreator interface {
  NewRawRangeDelIter() keyspan.Iterator
  NewCompactionIter(....) ...
  NewIterWithBlockPropertyFiltersAndContext(...)
}

var ic iterCreator
ic = v.reader

if file.Virtual {
  ic = sstable.NewVirtualReader(...)
}

// just use ic instead of v.reader

Done.


table_cache.go line 576 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I would move this to the comment for GetReader, i.e. a callout that "It's the caller's responsibility to not read outside virtual sstable bounds if this reader belongs to a virtual sstable".

Done.


table_stats.go line 559 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Repeated use of this paradigm is sort of what I wanted to avoid with the interface. I wonder if we can still do something like it, eg. sstable.IterFactory interface that has the NewIter methods plus EstimateDiskUsage, and VirtualReader and Reader implementations of it, and a d.tableCache.withIterFactory call if we're interested in an iter factory and don't care whether the underlying thing is virtual or physical.

I don't want to muddle the code with too many interfaces for this, but if you feels strongly about it, then I'll implement it. Leaving it as is until the next round of review.


internal/keyspan/truncate.go line 18 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I don't see the need for this change? We will never generate a virtual sstable that cuts a rangedel short in practice anyway (if we do in the excise case, we can always just set the new rangedel's end key to the start of the excise span, and the same for the other bound, and that holds true with this logic without any changes). Or in other words, if we want a virtual sstable to truncate range keys/dels at d (exclusive), we'll set the virtual sstable's end bound to the rangedel sentinel at d anyway. And [b,b))

The panic is fine. I just don't see endInclusive being helpful, as end is already an internal key and we take that into account when we sort trailers too.

Yea, you're right. We look at trailers while comparing the end key, so we don't need endInclusive.


sstable/reader.go line 490 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I don't see it?

Forgot to push those changes. Pushed now.


sstable/reader.go line 2808 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Semantically it's better to either have lower and upper as user keys as well as upperInclusive bool, or just lower and upper as internal keys where upperInclusive is determined by upper.IsExclusiveSentinel(). I have a strong preference for the former (user keys) as it makes it clear we will never create virtual ssts that split user keys across different sstables.

You're right that we don't need to pass in endInclusive, because we're comparing trailers in Truncate, which does the same job.

I went with the second options you suggested.


sstable/reader.go line 2881 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

As mentioned above, we don't need to pass upperInclusive here.

Done.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 23 of 36 files reviewed, 26 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @sumeerbhola)


db.go line 1894 at r13 (raw file):

// storing the range `[start, end]`. The estimation is computed as follows:
//
//   - For sstables fully contained in the range the whole file size is included.

[nit] this deserves a note for virtual ssts


table_cache.go line 522 at r13 (raw file):

			v.reader, file.VirtualMeta(),
		)
		iter, err = virtualReader.NewRawRangeKeyIter()

[nit] we should add this to iterCreator too


internal/keyspan/truncate.go line 18 at r13 (raw file):

	start *base.InternalKey,
	end *base.InternalKey,
	panicOnTruncation bool,

[nit] A Truncate iterator that panics on truncation is a weird construct (conceptually speaking).

Maybe rename Truncate to Clip or Restrict? And/or panicOnTruncation to panicOnPartialOverlap?


sstable/reader.go line 2815 at r13 (raw file):

// NewVirtualReader is used to contruct a reader which can read from virtual
// sstables.
func NewVirtualReader(reader *Reader, meta manifest.VirtualFileMeta) VirtualReader {

[nit] usually we use Make instead of New when we don't return a pointer

@bananabrick bananabrick requested a review from itsbilal April 21, 2023 00:27
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(drive-by)

Reviewed 2 of 21 files at r9, 1 of 6 files at r10, 1 of 2 files at r12, 5 of 8 files at r13.
Reviewable status: 29 of 36 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, and @jbowens)


table_cache.go line 1050 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I like this simplification much more than the overuse of meta.

+1


table_cache.go line 562 at r13 (raw file):

//
// Note that currently the Reader returned here is only used to read value
// blocks This reader shouldn't be used for other purposes like reading keys

nit: missing period after "blocks"


sstable/reader.go line 504 at r13 (raw file):

}

func (i *singleLevelIterator) keyOutOfBounds(key []byte, upper []byte, upperInclusive bool) bool {

does this inline?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 36 files reviewed, 28 unresolved discussions (waiting on @bananabrick, @itsbilal, and @jbowens)


sstable/reader.go line 902 at r5 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

If the invariant is not already present, maybe we can try and ensure it in the level iter, but I believe that has its own complexities which @itsbilal ran into. I didn't try that method, so I don't know what the complexities will be.

I think the use of levelIter.{findFileGE,findFileLT} makes the guarantee that if the seek key is not within the file bounds, it will not get called on that sst.
But there may be some cases (including tools) that will use the sst iterator without a levelIter, so for those we would need to do the comparison. We could optimize for the levelIter case by setting a field when initing these sst iterators.


sstable/reader.go line 993 at r5 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

We could implement a variation of the Last function which checks upper, so that we don't have to perform these SeekGEs and Nexts. But in terms of correctness, this should be okay.

I don't think First and Last are performance sensitive. CockroachDB only uses Seeks. Additionally, if the pebble.Iterator has a lower bound and First is called, it turns into a SeekGE (same for upper-bound and Last).

Copy link
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: save for the virtualState user key comment, and a few nits. Thanks for getting this through!

Reviewed 1 of 26 files at r8, 1 of 6 files at r10, 2 of 10 files at r11, 2 of 8 files at r13, 2 of 2 files at r14, all commit messages.
Reviewable status: all files reviewed, 29 unresolved discussions (waiting on @bananabrick, @jbowens, and @sumeerbhola)


internal/keyspan/truncate.go line 16 at r14 (raw file):

	iter FragmentIterator,
	lower, upper []byte,
	start *base.InternalKey,

nit: start, end *base.InternalKey


sstable/reader.go line 902 at r5 (raw file):

Previously, sumeerbhola wrote…

I think the use of levelIter.{findFileGE,findFileLT} makes the guarantee that if the seek key is not within the file bounds, it will not get called on that sst.
But there may be some cases (including tools) that will use the sst iterator without a levelIter, so for those we would need to do the comparison. We could optimize for the levelIter case by setting a field when initing these sst iterators.

This can sit in SeekGEFlags, similar to TrySeekUsingNext i.e. constrainedSeekKey or so, that levelIter could always pass true to. I'd be okay with a TODO to handle this case in the future as an optimization.


sstable/reader.go line 2808 at r11 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

You're right that we don't need to pass in endInclusive, because we're comparing trailers in Truncate, which does the same job.

I went with the second options you suggested.

I still strongly think we should switch to user keys and an endInclusive bool, to make it abundantly clear that we won't split a user key across two VSSTs and to prevent us from unintentionally ever using base.InternalCompare even by chance. We can just construct a sentinel key for the call to Truncate as necessary. And besides the call to Truncate is exactly why this is important; we don't want to unintentionally exclude some internal keys for a user key just because Truncate compares trailers while everywhere else we only look at User key and IsExclusiveSentinel().


sstable/reader.go line 504 at r13 (raw file):

Previously, sumeerbhola wrote…

does this inline?

@bananabrick would be good to add a // gcaccert:inline right before this line


sstable/reader.go line 1527 at r14 (raw file):

// SetBounds implements internalIterator.SetBounds, as documented in the pebble
// package. Note that the upper field is exclusive.

I think this tidbit should also be copied to the InternalIterator interface, that only exclusive upper bounds can be specified using SetBounds.


sstable/reader.go line 1540 at r14 (raw file):

	} else {
		// TODO(bananabrick): Figure out the logic here to enable the boundsCmp
		// optimization for virtual sstables.

Do you plan to address this soon? If not, one easy way to address this is to check if lowerFromConstrainedBounds == passedInLower and same for upper, and that endKeyInclusive is false, and then we can use the same logic as before.


sstable/reader.go line 1884 at r14 (raw file):

	if i.vState != nil {
		// Callers of SeekGE don't know about virtual sstable bounds, so we may
		// have to internally restrict the bounds.

Let's add a TODO here to also optimize the levelIter case.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 21 files at r9, 1 of 6 files at r10, 1 of 2 files at r12, 4 of 8 files at r13, 1 of 2 files at r14, all commit messages.
Reviewable status: all files reviewed, 35 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)


-- commits line 8 at r14:
is this bullet still accurate, or is this referring to the DiskFileNum introduced in an earlier commit?


format_major_version.go line 527 at r14 (raw file):

					// Any physical sstable which has been virtualized must
					// have already undergone this migration, and we don't
					// need to worry about the virtual sstable themselves.

can you add a format major version, eg FormatVirtualSSTables, maybe just commented out for now, and reference it in this comment? we'll be relying on that format major version to not create virtual sstables in any format major versions before this migration.


table_cache.go line 388 at r11 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Done.

i think this structure forces an unnecessary allocation for physical sstable readers.

does extracting the logic into a function generic over iterCreator avoid the allocation?


table_cache.go line 408 at r14 (raw file):

		virtualReader := sstable.NewVirtualReader(
			v.reader, file.VirtualMeta(),
		)

every call to newIters needs to suffer an additional allocation for a virtual sstable. we might find that we need a mechanism for caching these virtual readers too so only one exists per virtual sstable.


table_stats.go line 439 at r14 (raw file):

					file.VirtualMeta(),
					func(v sstable.VirtualReader) (err error) {
						fileSum += file.Size

is file.Size the approximate size within the physical file?


table_stats.go line 442 at r14 (raw file):

						entryCount += file.Stats.NumEntries
						keySum += v.Properties.RawKeySize
						valSum += v.Properties.RawValueSize

how are these property values defined for virtual sstables? can you leave an explanatory comment here too?


sstable/reader.go line 210 at r14 (raw file):

	// inclusive while iterating instead of exclusive. This is used for virtual
	// sstable iteration where the file bounds are inclusive.
	endKeyInclusive bool

have we considered always using exclusive upper bounds for virtual sstables? it seems like it could simplify implementation here. we've discussed (#1741) using loose sstable bounds elsewhere. also with Comparer.ImmediateSuccessor, they don't even need to be loose—just recharacterized as an exclusive bound.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


-- commits line 8 at r14:

Previously, jbowens (Jackson Owens) wrote…

is this bullet still accurate, or is this referring to the DiskFileNum introduced in an earlier commit?

Done.


format_major_version.go line 527 at r14 (raw file):

Previously, jbowens (Jackson Owens) wrote…

can you add a format major version, eg FormatVirtualSSTables, maybe just commented out for now, and reference it in this comment? we'll be relying on that format major version to not create virtual sstables in any format major versions before this migration.

Will follow up in a new pr.


table_cache.go line 388 at r11 (raw file):

Previously, jbowens (Jackson Owens) wrote…

i think this structure forces an unnecessary allocation for physical sstable readers.

does extracting the logic into a function generic over iterCreator avoid the allocation?

I wrote a benchmark: BenchmarkNewItersAlloc to see if calling newIters was leading to any allocations and it doesn't look like it. I think this sort of assignment to a variable with an interface type would only perform an allocation if the Reader wasn't already a heap allocated value(there seems to be some special casing in the internal go code for this).

If I modify the benchmark and run it on a virtual sstable, there is one allocation performed.

I think some additional code here is worth it to avoid the allocation, left a todo to optimize the virtual sstable case.


table_cache.go line 522 at r13 (raw file):

Previously, RaduBerinde wrote…

[nit] we should add this to iterCreator too

Not adding this since the iterator interface is causing an allocation for virtual readers.


table_cache.go line 408 at r14 (raw file):

Previously, jbowens (Jackson Owens) wrote…

every call to newIters needs to suffer an additional allocation for a virtual sstable. we might find that we need a mechanism for caching these virtual readers too so only one exists per virtual sstable.

They're really just a lightweight wrapper around a reader, holding on to some fields from the virtual sstable file metadata.


table_stats.go line 439 at r14 (raw file):

Previously, jbowens (Jackson Owens) wrote…

is file.Size the approximate size within the physical file?

Yea. It's computed by calling EstimatedDiskUsage on the physical file using virtual sstable bounds.


table_stats.go line 442 at r14 (raw file):

Previously, jbowens (Jackson Owens) wrote…

how are these property values defined for virtual sstables? can you leave an explanatory comment here too?

They're extrapolated using the TableStats.NumEntries for the virtual sstable, and Reader.Properties.NumEntries for the physical sstable. They're defined in NewVirtualReader. How they're computed is mentioned as a comment in the struct where they're defined.


sstable/reader.go line 902 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This can sit in SeekGEFlags, similar to TrySeekUsingNext i.e. constrainedSeekKey or so, that levelIter could always pass true to. I'd be okay with a TODO to handle this case in the future as an optimization.

Done.


sstable/reader.go line 993 at r5 (raw file):

Previously, sumeerbhola wrote…

I don't think First and Last are performance sensitive. CockroachDB only uses Seeks. Additionally, if the pebble.Iterator has a lower bound and First is called, it turns into a SeekGE (same for upper-bound and Last).

SeekLT will also call this. I think we can optimize this away when the level iter is using the sstable iterators.


sstable/reader.go line 2808 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I still strongly think we should switch to user keys and an endInclusive bool, to make it abundantly clear that we won't split a user key across two VSSTs and to prevent us from unintentionally ever using base.InternalCompare even by chance. We can just construct a sentinel key for the call to Truncate as necessary. And besides the call to Truncate is exactly why this is important; we don't want to unintentionally exclude some internal keys for a user key just because Truncate compares trailers while everywhere else we only look at User key and IsExclusiveSentinel().

The call to truncate requires both smallest, largest internal keys. Wouldn't we need to store the trailer for the smallest key anyway? I've made a note to discuss this once you're back.


sstable/reader.go line 504 at r13 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

@bananabrick would be good to add a // gcaccert:inline right before this line

No, this does not inline. Will inline this in a separate commit.


sstable/reader.go line 210 at r14 (raw file):

Previously, jbowens (Jackson Owens) wrote…

have we considered always using exclusive upper bounds for virtual sstables? it seems like it could simplify implementation here. we've discussed (#1741) using loose sstable bounds elsewhere. also with Comparer.ImmediateSuccessor, they don't even need to be loose—just recharacterized as an exclusive bound.

Will leave this as is now since I've already implemented it. I haven't considered creating a loose upper bound. Maybe the IsExclusiveSentinel function for virtual sstable bounds can always return true, but I haven't considered all the ramifications. I fear it would simplify the sstable iterator code, but complicate invariants elsewhere. To be fair, I really haven't thought it through so maybe I'm wrong.


sstable/reader.go line 1540 at r14 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Do you plan to address this soon? If not, one easy way to address this is to check if lowerFromConstrainedBounds == passedInLower and same for upper, and that endKeyInclusive is false, and then we can use the same logic as before.

Yea, I've made a note to address todos soon.

@bananabrick bananabrick dismissed sumeerbhola’s stale review May 3, 2023 20:05

Addressed review but sumeer is OOO

- Support reading from virtual sstables through a Virtual
sstable reader.
- Support reading virtual sstables through the table cache.

RFC: https://github.com/cockroachdb/pebble/blob/master/docs/RFCS/20221122_virtual_sstable.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants