Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: virtual sstables in the ingestion path #2116

Merged
merged 1 commit into from
Nov 22, 2022
Merged

rfc: virtual sstables in the ingestion path #2116

merged 1 commit into from
Nov 22, 2022

Conversation

bananabrick
Copy link
Contributor

@bananabrick bananabrick commented Nov 9, 2022

Issue: #1683

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@bananabrick
Copy link
Contributor Author

Need to fix a few todos before review

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be good for review. I have some minor todos in there, but they shouldn't hinder review.

The RFC discusses the components required to ensure correctness. It doesn't delve into precise implementation details. So, for Reader.NewRawRangeKeyIter we mention that we need to perform filtering within the fragmentBlockIter, but we don't touch on the precise algorithm to perform such filtering.

As the implementation progresses, we can probably sneak those details into the RFC if necessary.

Reviewable status: 0 of 1 files reviewed, all discussions resolved

@bananabrick bananabrick requested review from itsbilal, sumeerbhola, jbowens and a team November 14, 2022 20:37
@bananabrick bananabrick marked this pull request as ready for review November 14, 2022 20:37
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good.
Some of my comments are not about this RFC but ensuring that whatever we do here can be extended by @itsbilal without a lot of refactoring, for the disaggregated shared storage masking use case.

Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed (commit messages unreviewed), 10 unresolved discussions (waiting on @bananabrick, @itsbilal, and @jbowens)


docs/RFCS/virtual_sstable.md line 25 at r1 (raw file):

files, which have no data overlap with other files in the lsm, lower in the lsm
because of file boundary overlap with files in the lsm. In this case, we are
forces to place files higher in the lsm, sometimes in L0, which can cause higher

nit: forced


docs/RFCS/virtual_sstable.md line 76 at r1 (raw file):

We cannot use virtual sstables in the above scenario for two reasons:

  1. We don't have a quick method of detecting no data overlap.

I think for the virtual sst use case where we will mask data (not this RFC, but we should make sure what we are doing here is easily extendable) we are still good in this case since the first virtual sst will have e and the second one p--r.
@itsbilal


docs/RFCS/virtual_sstable.md line 102 at r1 (raw file):

  would be additional implementation complexity(see the `FilemetaData` section).
  We also want to avoid too many virtual sstables in the lsm as they can lead to
  space amp(see `Compaction` section).

Again, the masking case will need to do this, but seems like an easy extension.


docs/RFCS/virtual_sstable.md line 109 at r1 (raw file):

  1. For CockroachDB's snapshot ingestion, there is one large sst (up to 512MB)
     and many tiny ones. We can choose the apply this splitting logic only for
     the large sst. It is ok for the tiny ssts to be ingested into L0.

For masking (for disaggregated shared storage) we could avoid keeping the local keyspace in blobstore and do the usual snapshot ingest where we lay down a rangedel and ingest a unshared sst.


docs/RFCS/virtual_sstable.md line 150 at r1 (raw file):

metadata field to indicate the file number of the parent sstable. In the latter
case, we can use a few of the most significant bits of the 64 bit file number to
distinguish the virtual sstables.

I thought the latter was a good idea when originally writing #1683 but for the masking use case in disaggregated shared storage we cannot put any limit on the number of times a sst will be virtualized (we will prioritize compactions for small ssts to avoid any performance issues, but we can't make any guarantees there). So I now think the former is better
@itsbilal


docs/RFCS/virtual_sstable.md line 179 at r1 (raw file):

These fields are used for l0 sublevels, pebble tooling, delete compaction hints,
and a lot of plumbing. We don't need to worry about the L0 sublevels use case
because we won't have virtual sstables in L0. For the rest of the use cases

@itsbilal
Hmm, so we if we do virtualize L0 files for the masking case will we need to iterate over it to construct tight seqnum bounds?
I think its worth looking at the L0SubLevels logic again and see if we can manage with these being lower and upper bounds for the sublevel assignment. I suspect it can since these virtual ssts pointing to the same sst will never overlap in their keyspans.


docs/RFCS/virtual_sstable.md line 220 at r1 (raw file):

virtual sstables, we can have a case where the last version which references a
physical sstable is unrefed, but the physical sstable is not added to the
obsolete table list because its `FilemetaData.refs` count is not 0. Since the

I didn't understand this scenario where there is no one referencing the physical sstable but the refs count is not zero. Could you elaborate?


docs/RFCS/virtual_sstable.md line 235 at r1 (raw file):

Deleted files still referenced by older versions are considered zombie sstables.
For the sake of simplicity, we can modify the definition of zombie sstables to
include physical sstables which have been virtualized. When the physical sstable

Don't quite understand this. The clean definition would be that zombie tables are those physical sstables that are not directly or indirectly (via virtual ssts) referenced by the latest version. This is not easy to compute given the incremental computation in BulkVersionEdit.Apply. But I think if we kept an additional refcount pointer in the FileMetadata that counts the number of refs in the latest version, we could compute this incrementally. I think it's important not to overcount zombie ssts.

We should consider having some additional counts related to number of physical ssts that have been virtualized, and number of virtual ssts. It would be useful for diagnosis to know if the latter is much larger than the former.


docs/RFCS/virtual_sstable.md line 261 at r1 (raw file):

We could either have an abstraction over the physical sstable `Reader` per
virtual sstable, or update the `Reader` API to accept file bounds of the
sstable. In the latter case, we would create one `Reader` on the physical

+1 to the latter. We should be caching a single Reader per physical sst.
I think most (all?) of the tableCacheShard methods that lookup or construct a Reader take a FileMetadata parameter, so they can use that to find the Reader for the physical sstable, and then use the bounds in the FileMetadata (if it is a virtual sst) in creating the sstable iterator (in tableCacheShard.newIters).


docs/RFCS/virtual_sstable.md line 303 at r1 (raw file):

sstable `b` still refers to it. In the worst case, sstable `b` will never be
picked for compaction and will never be compacted into and we'll have permanent
space amplification.

This is a good point. I think we need to track stats for this at the DB level. And on a per virtual sst level (so we can decide what to compact).
For the latter, the shared entity across all the virtual FileMetadatas could be a struct, which includes the overall ref, and the sum of the virtual sizes of the ssts pointing to this in the latest Version. As new Versions get installed and virtual ssts are deleted, they will decrement their size from the sum. When the ratio of the virtual sst size to the sum-size becomes small (high space amplification), it should be prioritized for compaction.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed (commit messages unreviewed), 9 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


docs/RFCS/virtual_sstable.md line 76 at r1 (raw file):

Previously, sumeerbhola wrote…

I think for the virtual sst use case where we will mask data (not this RFC, but we should make sure what we are doing here is easily extendable) we are still good in this case since the first virtual sst will have e and the second one p--r.
@itsbilal

Yea that makes sense. Would the goal be to mask f-n in every single sstable already existing in the lsm?


docs/RFCS/virtual_sstable.md line 102 at r1 (raw file):

Previously, sumeerbhola wrote…

Again, the masking case will need to do this, but seems like an easy extension.

Yea should be easy as long as we can fall back to the worst case of scanning the sstables in L0 to compute smallest/largest sequence numbers.


docs/RFCS/virtual_sstable.md line 109 at r1 (raw file):

Previously, sumeerbhola wrote…

For masking (for disaggregated shared storage) we could avoid keeping the local keyspace in blobstore and do the usual snapshot ingest where we lay down a rangedel and ingest a unshared sst.

Is this saying that in the case where we're ingesting tiny ssts, we avoid the masking + virtualizing and just write a range delete like we currently do?


docs/RFCS/virtual_sstable.md line 150 at r1 (raw file):

Previously, sumeerbhola wrote…

I thought the latter was a good idea when originally writing #1683 but for the masking use case in disaggregated shared storage we cannot put any limit on the number of times a sst will be virtualized (we will prioritize compactions for small ssts to avoid any performance issues, but we can't make any guarantees there). So I now think the former is better
@itsbilal

Yea, it seems simpler to assign a unique file number to every virtual sst.


docs/RFCS/virtual_sstable.md line 220 at r1 (raw file):

we can have a case where the last version which references a
physical sstable is unrefed, but the physical sstable is not added to the
obsolete table list because its FilemetaData.refs count is not 0. Since the

Example:
There are virtual sstables referring to some physical sstable 'a' and therefore the refs count of the FilemetaData associated with 'a' is > 1. The sstable 'a' is only directly referenced by an older Version 'b'. When version 'b' is unrefed, the refcount of the FilemetaData associated with a will be decremented but still be at least one. We're going to determine whether to add sstable 'a' to the obsolete table list, and we won't because the ref count of sstable 'a' is not 0.


docs/RFCS/virtual_sstable.md line 261 at r1 (raw file):

Previously, sumeerbhola wrote…

+1 to the latter. We should be caching a single Reader per physical sst.
I think most (all?) of the tableCacheShard methods that lookup or construct a Reader take a FileMetadata parameter, so they can use that to find the Reader for the physical sstable, and then use the bounds in the FileMetadata (if it is a virtual sst) in creating the sstable iterator (in tableCacheShard.newIters).

Yea, the latter makes sense to me.


docs/RFCS/virtual_sstable.md line 303 at r1 (raw file):

When the ratio of the virtual sst size to the sum-size becomes small (high space amplification), it should be prioritized for compaction.

This ratio will start small and only get larger as other virtual sstables are deleted. For example, if we have to two virtual sstables 'a' and 'b' with sizes 1MB each. Then for sstable 'a' the initial ratio would be 0.5, and once sstable 'b' is deleted the ratio will become 1. Not sure how that indicates space amp. I would think we'd want to improve the odds of a virtual sstable being picked for a compaction as other virtual sstables with the same backing file are deleted.

We can still use the ratio, but it seems like we'd want to pick a virtual sstable for compaction when the ratio becomes too large?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed (commit messages unreviewed), 9 unresolved discussions (waiting on @bananabrick, @itsbilal, and @jbowens)


docs/RFCS/virtual_sstable.md line 76 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Yea that makes sense. Would the goal be to mask f-n in every single sstable already existing in the lsm?

Yes, that would be the goal.


docs/RFCS/virtual_sstable.md line 109 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Is this saying that in the case where we're ingesting tiny ssts, we avoid the masking + virtualizing and just write a range delete like we currently do?

Correct. We can use the rangedel scheme when we are not importing shared ssts (from another LSM) directly into L5/L6.


docs/RFCS/virtual_sstable.md line 220 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

we can have a case where the last version which references a
physical sstable is unrefed, but the physical sstable is not added to the
obsolete table list because its FilemetaData.refs count is not 0. Since the

Example:
There are virtual sstables referring to some physical sstable 'a' and therefore the refs count of the FilemetaData associated with 'a' is > 1. The sstable 'a' is only directly referenced by an older Version 'b'. When version 'b' is unrefed, the refcount of the FilemetaData associated with a will be decremented but still be at least one. We're going to determine whether to add sstable 'a' to the obsolete table list, and we won't because the ref count of sstable 'a' is not 0.

I'm still not getting it. Currently FileMetadata.refs is incremented/decremented via the btree in LevelMetadata which lives in a Version. So at most 1 ref per Version for a FileMetadata. Different versions with the same file may share a ref if they share a btree node. We are changing FileMetadata.refs to be a pointer, so that virtual sst FileMetadata will use the same refs pointer. So a LevelMetadata LM2 that has 2 virtual ssts that refer to the same physical file will hold 2 refs. LevelMetadata LM1 that has only 1 physical sst will hold 1 ref. So the total is 3 (let's say the btree nodes were different). When LM1 goes away the refs value will be decremented once, and will equal 2. When LM2 goes away the refs value will be decremented twice and will become 0.
What am I missing?


docs/RFCS/virtual_sstable.md line 303 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

When the ratio of the virtual sst size to the sum-size becomes small (high space amplification), it should be prioritized for compaction.

This ratio will start small and only get larger as other virtual sstables are deleted. For example, if we have to two virtual sstables 'a' and 'b' with sizes 1MB each. Then for sstable 'a' the initial ratio would be 0.5, and once sstable 'b' is deleted the ratio will become 1. Not sure how that indicates space amp. I would think we'd want to improve the odds of a virtual sstable being picked for a compaction as other virtual sstables with the same backing file are deleted.

We can still use the ratio, but it seems like we'd want to pick a virtual sstable for compaction when the ratio becomes too large?

Yes, something like that, but I didn't mean to define the ratio using the size of 'a'. The ratio is one value defined for the physical sst in the latest version. The virtual ssts "inherit" that value too, since they are the ones who will need to be picked in compactions. More specifically, I was imagining something like:

type FileMetadata struct {
  // State needed by both physical and virtual ssts and that is different for each.
  ...
  
  // Non-nil if virtual sst. 
  physicalSST *FileMetadata
  
  // Non-nil if physical sst.
  physicalState *physicalState
}

type physicalState struct {
  // Total refs across all virtual ssts * versions. That is, if the same virtual sst is
  // present in multiple versions, it may have multiple refs, if the btree node is not
  // the same.
  totalRefs int32
  
  // Number of virtual ssts in the latest version that refer to this physical SST. Will
  // be 1 if there is only a physical sst, or there is only 1 virtual sst referencing
  // this physical sst.
  // INVARIANT: refsInLatestVersion <= totalRefs
  // refsInLatestVersion == 0 is a zombie sstable.
  refsInLatestVersion int32

  fileSize uint64

  // If sst is not virtualized and in latest version virtualSizeSumInLatestVersion == fileSize.
  // If virtualSizeSumInLatestVersion > 0 and virtualSizeSumInLatestVersion/fileSize is very small,
  // the corresponding virtual sst(s) should be candidates for compaction. These candidates
  // can be tracked via btree annotations.
  // Incrementlly updated in BulkVersionEdit.Apply, when updating refsInLatestVersion.
  virtualSizeSumInLatestVersion uint64
}

Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did one pass-through, looking pretty good!

Reviewable status: all files reviewed (commit messages unreviewed), 11 unresolved discussions (waiting on @bananabrick, @jbowens, and @sumeerbhola)


docs/RFCS/virtual_sstable.md line 67 at r1 (raw file):

There are cases where the ingesting sstables have no data overlap with existing
sstables, but we can't make use of virtual sstables. Consider:

Probably also worth mentioning here (or in the section before) that the easier-to-solve case happens very regularly when an sstable spans a range boundary (which pebble has no knowledge of), and we ingest a snapshot of a range in between the two already-present ranges. The harder case over here is less likely to happen.


docs/RFCS/virtual_sstable.md line 76 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Yea that makes sense. Would the goal be to mask f-n in every single sstable already existing in the lsm?

@bananabrick Yes, we'd mask the entirety of f-n in every sstable in the LSM. You could possibly mention this here as a feature that would be supported with the virtual sstable work (i.e. not every physical sstable key will belong to a virtual sstable), and some , and that uses of it will come later.


docs/RFCS/virtual_sstable.md line 102 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Yea should be easy as long as we can fall back to the worst case of scanning the sstables in L0 to compute smallest/largest sequence numbers.

The masking case will also, in the common disaggregated storage CockroachDB use case, only split one L0 sstable into at most two virtual sstables. But yeah not important to mention that here.


docs/RFCS/virtual_sstable.md line 109 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Is this saying that in the case where we're ingesting tiny ssts, we avoid the masking + virtualizing and just write a range delete like we currently do?

@bananabrick that's correct, the tiny ssts will be the local keyspace in a snapshot ingest.


docs/RFCS/virtual_sstable.md line 150 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Yea, it seems simpler to assign a unique file number to every virtual sst.

I also agree we should use a new FileNum, incrementally assigned like all other FileNums. An additional field can hold the physical FileNum. That also makes it easier to incorporate into the disaggregated storage case (where the physical file could be owned by another node).


docs/RFCS/virtual_sstable.md line 179 at r1 (raw file):

Previously, sumeerbhola wrote…

@itsbilal
Hmm, so we if we do virtualize L0 files for the masking case will we need to iterate over it to construct tight seqnum bounds?
I think its worth looking at the L0SubLevels logic again and see if we can manage with these being lower and upper bounds for the sublevel assignment. I suspect it can since these virtual ssts pointing to the same sst will never overlap in their keyspans.

That's accurate, we can get away with a looser seqnum bound in this case because the split files will never by definition overlap in user keys. Also easier to do now that we never split user keys across multiple files.


docs/RFCS/virtual_sstable.md line 287 at r1 (raw file):

   could filter keys above the `fragmentBlockIter` or add filtering within the
   `fragmentBlockIter`. To add filtering within the `fragmentBlockIter` we will
   initialize it with two additional `lower/upper []byte` fields.

In addition to this, you'd also need to update the SetBounds logic to never set the bounds of an sstable iterator (either singleLevelIterator or twoLevelIterator) to be outside the bounds of the virtual sstable. The bounds code does a lot of optimizations to just stop bounds-checking if the entirety of an sstable is within the bounds, but for virtual sstables you'd still need to keep bounds checks around because even if the entirety of a virtual sstable is within bounds, the physical sstable could span outside it, and you don't want to unintentionally surface keys that are masked or in other virtual sstables.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 11 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


docs/RFCS/virtual_sstable.md line 67 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Probably also worth mentioning here (or in the section before) that the easier-to-solve case happens very regularly when an sstable spans a range boundary (which pebble has no knowledge of), and we ingest a snapshot of a range in between the two already-present ranges. The harder case over here is less likely to happen.

Done.


docs/RFCS/virtual_sstable.md line 76 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

@bananabrick Yes, we'd mask the entirety of f-n in every sstable in the LSM. You could possibly mention this here as a feature that would be supported with the virtual sstable work (i.e. not every physical sstable key will belong to a virtual sstable), and some , and that uses of it will come later.

Done.


docs/RFCS/virtual_sstable.md line 102 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

The masking case will also, in the common disaggregated storage CockroachDB use case, only split one L0 sstable into at most two virtual sstables. But yeah not important to mention that here.

Added a bit about the masking case use case in L0.


docs/RFCS/virtual_sstable.md line 109 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

@bananabrick that's correct, the tiny ssts will be the local keyspace in a snapshot ingest.

Added a bit about the masking case.


docs/RFCS/virtual_sstable.md line 150 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I also agree we should use a new FileNum, incrementally assigned like all other FileNums. An additional field can hold the physical FileNum. That also makes it easier to incorporate into the disaggregated storage case (where the physical file could be owned by another node).

Done.


docs/RFCS/virtual_sstable.md line 179 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

That's accurate, we can get away with a looser seqnum bound in this case because the split files will never by definition overlap in user keys. Also easier to do now that we never split user keys across multiple files.

Done.


docs/RFCS/virtual_sstable.md line 220 at r1 (raw file):
When LM1 goes away the refcount will be decremented to 2. Since the refcount isn't 0, the physical sst number will not be added to the obsolete table list and it shouldn't be.

Since virtual sstables keep track of their parent physical sstable, we can just add the physical sstable
to the obsolete table list when the last virtual sstable which references it is deleted.

So we'd have to change the code a bit to make sure that the physical sst number is added to the obsolete file list when the refcount eventually hits 0 and the refcount will hit 0 when the last virtual sst is unrefed. Maybe I'm missing something.


docs/RFCS/virtual_sstable.md line 235 at r1 (raw file):

Previously, sumeerbhola wrote…

Don't quite understand this. The clean definition would be that zombie tables are those physical sstables that are not directly or indirectly (via virtual ssts) referenced by the latest version. This is not easy to compute given the incremental computation in BulkVersionEdit.Apply. But I think if we kept an additional refcount pointer in the FileMetadata that counts the number of refs in the latest version, we could compute this incrementally. I think it's important not to overcount zombie ssts.

We should consider having some additional counts related to number of physical ssts that have been virtualized, and number of virtual ssts. It would be useful for diagnosis to know if the latter is much larger than the former.

Updated.


docs/RFCS/virtual_sstable.md line 287 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

In addition to this, you'd also need to update the SetBounds logic to never set the bounds of an sstable iterator (either singleLevelIterator or twoLevelIterator) to be outside the bounds of the virtual sstable. The bounds code does a lot of optimizations to just stop bounds-checking if the entirety of an sstable is within the bounds, but for virtual sstables you'd still need to keep bounds checks around because even if the entirety of a virtual sstable is within bounds, the physical sstable could span outside it, and you don't want to unintentionally surface keys that are masked or in other virtual sstables.

Done.


docs/RFCS/virtual_sstable.md line 303 at r1 (raw file):
Updated. Didn't add all the code details from the struct, but wrote a paragraph on how we'll manage zombie sstables and compaction picking.

// Non-nil if virtual sst.
physicalSST *FileMetadata

We seem to be accumulating many fields which are optional in the file metadata. Each of these occupy 8 bytes per file metadata. I wonder if we should embed a struct in the DB struct for these optional fields.

Something like:

type DB struct {

    optionalFileMeta struct {
          physicalState map[FileNum]physicalState
          parentSST map[FileNum]FileNum

          // Other range key state
          ...

          // Other L0 sublevel state
          ...
}

Could save > 32 bytes per FilemetaData overall in the general case where an sst is not virtual, not in L0, doesn't have range keys.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @bananabrick, @itsbilal, and @jbowens)


docs/RFCS/virtual_sstable.md line 220 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

When LM1 goes away the refcount will be decremented to 2. Since the refcount isn't 0, the physical sst number will not be added to the obsolete table list and it shouldn't be.

Since virtual sstables keep track of their parent physical sstable, we can just add the physical sstable
to the obsolete table list when the last virtual sstable which references it is deleted.

So we'd have to change the code a bit to make sure that the physical sst number is added to the obsolete file list when the refcount eventually hits 0 and the refcount will hit 0 when the last virtual sst is unrefed. Maybe I'm missing something.

Maybe we are saying the same thing. I was objecting to the phrasing:

With virtual sstables, we can have a case where the last version which references a physical sstable is unrefed, but the physical sstable is not added to the obsolete table list because its FilemetaData.refs count is not 0.

We decrement FileMetadata refs when the btree node is being deleted because no version is referencing the node, so I still don't see a case where there is no version referencing a file and the refs is > 0.


docs/RFCS/virtual_sstable.md line 303 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Updated. Didn't add all the code details from the struct, but wrote a paragraph on how we'll manage zombie sstables and compaction picking.

// Non-nil if virtual sst.
physicalSST *FileMetadata

We seem to be accumulating many fields which are optional in the file metadata. Each of these occupy 8 bytes per file metadata. I wonder if we should embed a struct in the DB struct for these optional fields.

Something like:

type DB struct {

    optionalFileMeta struct {
          physicalState map[FileNum]physicalState
          parentSST map[FileNum]FileNum

          // Other range key state
          ...

          // Other L0 sublevel state
          ...
}

Could save > 32 bytes per FilemetaData overall in the general case where an sst is not virtual, not in L0, doesn't have range keys.

I don't think optional fields should be emdedded in the DB struct. But we should move them into single struct pointers in FileMetadata. I am all for FileMetadata cleanup since we already have optional stuff there like in #2047.


docs/RFCS/virtual_sstable.md line 81 at r2 (raw file):

Note that in Cockroach, the easier to solve case where an ingesting sstable can
slide in between two existing sstables is more likely to happen. It occurs when

nit: "slide in between two existing sstables" sounds like we don't need virtual sstables. Did you mean that the ingesting sstable overlaps with the bounds of one existing sstable and can slide in between two virtual ssts constructed from that existing sstable?

I prefer @itsbilal 's wording in his comment, since it talks about the actual CockroachDB scenario: two existing ranges with one sstable spanning the two ranges, and a new range being added that falls between these two ranges.

Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: modulo open comments

Reviewed all commit messages.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @bananabrick, @jbowens, and @sumeerbhola)


docs/RFCS/virtual_sstable.md line 28 at r2 (raw file):

read-amp and unnecessary write-amp as the file is moved lower down the lsm. See
https://github.com/cockroachdb/cockroach/issues/80589 for the problem occurring
in practice.

Not a big deal, but it's worthwhile mentioning here that this design will eventually also be used to enable the disaggregated storage use-case. You could link to that RFC: https://github.com/cockroachdb/cockroach/pull/70419/files


docs/RFCS/virtual_sstable.md line 81 at r2 (raw file):

Previously, sumeerbhola wrote…

nit: "slide in between two existing sstables" sounds like we don't need virtual sstables. Did you mean that the ingesting sstable overlaps with the bounds of one existing sstable and can slide in between two virtual ssts constructed from that existing sstable?

I prefer @itsbilal 's wording in his comment, since it talks about the actual CockroachDB scenario: two existing ranges with one sstable spanning the two ranges, and a new range being added that falls between these two ranges.

Yeah I agree that "slide in between two existing sstables" is confusing. Feel free to copy the wording in my comment.


docs/RFCS/virtual_sstable.md line 120 at r2 (raw file):

     higher level is smaller, so there is lower risk of littering the LSM with
     long-lived small virtual ssts.
  3. For disaggregated storage implementation, we can avoid masking for tiny

I believe this is the first mention of masking / excising but it's not defined otherwise in this RFC. You could end this sentence with "more details on that are out of scope for this RFC" to avoid questions about this, and that also avoids you having to define what you mean by masking.


docs/RFCS/virtual_sstable.md line 191 at r2 (raw file):

and an upper bound for the largest seq number work.

For the disaggregated storage masking use case, we would need to support the

I... don't think this is accurate, or in line with my comment above? We don't need tight seqnum bounds for L0 virtual sstables. Either way it's probably a good idea for this RFC to not be overly prescriptive on disaggregated storage just to keep its scope small. I would just delete this paragraph entirely.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


docs/RFCS/virtual_sstable.md line 220 at r1 (raw file):

Previously, sumeerbhola wrote…

Maybe we are saying the same thing. I was objecting to the phrasing:

With virtual sstables, we can have a case where the last version which references a physical sstable is unrefed, but the physical sstable is not added to the obsolete table list because its FilemetaData.refs count is not 0.

We decrement FileMetadata refs when the btree node is being deleted because no version is referencing the node, so I still don't see a case where there is no version referencing a file and the refs is > 0.

Clarified the wording. Added a notion of direct/indirect references.


docs/RFCS/virtual_sstable.md line 303 at r1 (raw file):

Previously, sumeerbhola wrote…

I don't think optional fields should be emdedded in the DB struct. But we should move them into single struct pointers in FileMetadata. I am all for FileMetadata cleanup since we already have optional stuff there like in #2047.

Done.


docs/RFCS/virtual_sstable.md line 120 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I believe this is the first mention of masking / excising but it's not defined otherwise in this RFC. You could end this sentence with "more details on that are out of scope for this RFC" to avoid questions about this, and that also avoids you having to define what you mean by masking.

Done.


docs/RFCS/virtual_sstable.md line 191 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I... don't think this is accurate, or in line with my comment above? We don't need tight seqnum bounds for L0 virtual sstables. Either way it's probably a good idea for this RFC to not be overly prescriptive on disaggregated storage just to keep its scope small. I would just delete this paragraph entirely.

Done.

@bananabrick bananabrick merged commit 936e011 into cockroachdb:master Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants