Skip to content

Commit

Permalink
rfc: virtual sstables in the ingestion path
Browse files Browse the repository at this point in the history
Issue: #1683
  • Loading branch information
bananabrick committed Nov 22, 2022
1 parent fcf9e40 commit 936e011
Showing 1 changed file with 366 additions and 0 deletions.
366 changes: 366 additions & 0 deletions docs/RFCS/virtual_sstable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,366 @@
- Feature Name: Virtual sstables
- Status: draft
- Start Date: 2022-10-27
- Authors: Arjun Nair
- RFC PR:
- Pebble Issues:
https://github.com/cockroachdb/pebble/issues/1683


** Design Draft**

# Summary

The RFC outlines the design to enable virtualizing of physical sstables
in Pebble.

A virtual sstable has no associated physical data on disk, and is instead backed
by an existing physical sstable. Each physical sstable may be shared by one, or
more than one virtual sstable.

Initially, the design will be used to lower the read-amp and the write-amp
caused by certain ingestions. Sometimes, ingestions are unable to place incoming
files, which have no data overlap with other files in the lsm, lower in the lsm
because of file boundary overlap with files in the lsm. In this case, we are
forced to place files higher in the lsm, sometimes in L0, which can cause higher
read-amp and unnecessary write-amp as the file is moved lower down the lsm. See
https://github.com/cockroachdb/cockroach/issues/80589 for the problem occurring
in practice.

Eventually, the design will also be used for the disaggregated storage masking
use-case: https://github.com/cockroachdb/cockroach/pull/70419/files.

This document describes the design of virtual sstables in Pebble with enough
detail to aid the implementation and code review.

# Design

### Ingestion

When an sstable is ingested into Pebble, we try to place it in the lowest level
without any data overlap, or any file boundary overlap. We can make use of
virtual sstables in the cases where we're forced to place the ingested sstable
at a higher level due to file boundary overlap, but no data overlap.

```
s2
ingest: [i-j-------n]
s1
L6: [e---g-----------------p---r]
a b c d e f g h i j k l m n o p q r s t u v w x y z
```

Consider the sstable s1 in L6 and the ingesting sstable s2. It is clear that
the file boundaries of s1 and s2 overlap, but there is no data overlap as shown
in the diagram. Currently, we will be forced to ingest the sstable s2 into a
level higher than L6. With virtual sstables, we can split the existing sstable
s1 into two sstables s3 and s4 as shown in the following diagram.

```
s3 s2 s4
L6: [e---g]-[i-j-------n]-[p---r]
a b c d e f g h i j k l m n o p q r s t u v w x y z
```

The sstable s1 will be deleted from the lsm. If s1 was a physical sstable, then
we will keep the file on disk as long as we need to so that it can back the
virtual sstables.

There are cases where the ingesting sstables have no data overlap with existing
sstables, but we can't make use of virtual sstables. Consider:
```
s2
ingest: [f-----i-j-------n]
s1
L6: [e---g-----------------p---r]
a b c d e f g h i j k l m n o p q r s t u v w x y z
```
We cannot use virtual sstables in the above scenario for two reasons:
1. We don't have a quick method of detecting no data overlap.
2. We will be forced to split the sstable in L6 into more than two virtual
sstables, but we want to avoid many small virtual sstables in the lsm.

Note that in Cockroach, the easier-to-solve case happens very regularly when an
sstable spans a range boundary (which pebble has no knowledge of), and we ingest
a snapshot of a range in between the two already-present ranges.

slide in between two existing sstables is more likely to happen. It occurs when
we ingest a snapshot of a range in between two already present ranges.

`ingestFindTargetLevel` changes:
- The `ingestFindTargetLevel` function is used to determine the target level
of the file which is being ingested. Currently, this function returns an `int`
which is the target level for the ingesting file. Two additional return
parameters, `[]manifest.NewFileEntry` and `*manifest.DeletedFileEntry`, will be
added to the function.
- If `ingestFindTargetLevel` decides to split an existing sstable into virtual
sstables, then it will return new and deleted entries. Otherwise, it will only
return the target level of the ingesting file.
- Within the `ingestFindTargetLevel` function, the `overlapWithIterator`
function is used to quickly detect data overlap. In the case with file
boundary overlap, but no data overlap, in the lowest possible level, we will
split the existing sstable into virtual sstables and generate the
`NewFileEntry`s and the `DeletedFileEntry`. The `FilemetaData` section
describes how the various fields in the `FilemetaData` will be computed for
the newly created virtual sstables.

- Note that we will not split physical sstables into virtual sstables in L0 for
the use case described in this RFC. The benefit of doing so would be to reduce
the number of L0 sublevels, but the cost would be additional implementation
complexity(see the `FilemetaData` section). We also want to avoid too many
virtual sstables in the lsm as they can lead to space amp(see `Compaction`
section). However, in the future, for the disaggregated storage masking case,
we would need to support ingestion and use of virtual sstables in L0.

- Note that we may need an upper bound on the number of times an sstable is
split into smaller virtual sstables. We can further reduce the risk of many
small sstables:
1. For CockroachDB's snapshot ingestion, there is one large sst (up to 512MB)
and many tiny ones. We can choose the apply this splitting logic only for
the large sst. It is ok for the tiny ssts to be ingested into L0.
2. Split only if the ingested sst is at least half the size of the sst being
split. So if we have a smaller ingested sst, we will pick a higher level to
split at (where the ssts are smaller). The lifetime of virtual ssts at a
higher level is smaller, so there is lower risk of littering the LSM with
long-lived small virtual ssts.
3. For disaggregated storage implementation, we can avoid masking for tiny
sstables being ingested and instead write a range delete like we currently
do. Precise details on the masking use case are out of the scope of this
RFC.

`ingestApply` changes:
- The new and deleted file entries returned by the `ingestFindTargetLevel`
function will be added to the version edit in `ingestApply`.
- We will appropriately update the `levelMetrics` based on the new information
returned by `ingestFindTargetLevel`.


### `FilemetaData` changes

Each virtual sstables will have a unique file metadata value associated with it.
The metadata may be borrowed from the backing physical sstable, or it may be
unique to the virtual sstable.

This rfc lists out the fields in the `FileMetadata` struct with information on
how each field will be populated.

`Atomic.AllowedSeeks`: Field is used for read triggered compactions, and we can
populate this field for each virtual sstable since virtual sstables can be
picked for compactions.

`Atomic.statsValid`: We can set this to true(`1`) when the virtual sstable is
created. On virtual sstable creation we will estimate the table stats of the
virtual sstable based on the table stats of the physical sstable. We can also
set this to `0` and let the table stats job asynchronously compute the stats.

`refs`: The will be turned into a pointer which will be shared by the
virtual/physical sstables. See the deletion section of the RFC to learn how the
`refs` count will be used.

`FileNum`: We could give each virtual sstable its own file number or share
the file number between all the virtual sstables. In the former case, the virtual
sstables will be distinguished by the file number, and will have an additional
metadata field to indicate the file number of the parent sstable. In the latter
case, we can use a few of the most significant bits of the 64 bit file number to
distinguish the virtual sstables.

The benefit of using a single file number for each virtual sstable, is that we
don't need to use additional space to store the file number of the backing
physical sstable.

It might make sense to give each virtual sstable its own file number. Virtual
sstables are picked for compactions, and compactions and compaction picking
expect a unique file number for each of the files which it is compacting.
For example, read compactions will use the file number of the file to determine
if a file picked for compaction has already been compacted, the version edit
will expect a different file number for each virtual sstable, etc.

There are direct references to the `FilemetaData.FileNum` throughout Pebble. For
example, the file number is accessed when the the `DB.Checkpoint` function is
called. This function iterates through the files in each level of the lsm,
constructs the filepath using the file number, and reads the file from disk. In
such cases, it is important to exclude virtual sstables.

`Size`: We compute this using linear interpolation on the number of blocks in
the parent sstable and the number of blocks in the newly created virtual sstable.

`SmallestSeqNum/LargestSeqNum`: These fields depend on the parent sstable,
but we would need to perform a scan of the physical sstable to compute these
accurately for the virtual sstable upon creation. Instead, we could convert
these fields into lower and upper bounds of the sequence numbers in a file.

These fields are used for l0 sublevels, pebble tooling, delete compaction hints,
and a lot of plumbing. We don't need to worry about the L0 sublevels use case
because we won't have virtual sstables in L0 for the use case in this RFC. For
the rest of the use cases we can use lower bound for the smallest seq number,
and an upper bound for the largest seq number work.

TODO(bananabrick): Add more detail for any delete compaction hint changes if
necessary.

`Smallest/Largest`: These, along with the smallest/largest ranges for the range
and point keys can be computed upon virtual sstable creation. Precisely, these
can be computed when we try and detect data overlap in the `overlapWithIterator`
function during ingestion.

`Stats`: `TableStats` will either be computed upon virtual sstable creation
using linear interpolation on the block counts of the virtual/physical sstables
or asynchronously using the file bounds of the virtual sstable.

`PhysicalState`: We can add an additional struct with state associated with
physical ssts which have been virtualized.

```
type PhysicalState struct {
// Total refs across all virtual ssts * versions. That is, if the same virtual
// sst is present in multiple versions, it may have multiple refs, if the
// btree node is not the same.
totalRefs int32
// Number of virtual ssts in the latest version that refer to this physical
// SST. Will be 1 if there is only a physical sst, or there is only 1 virtual
// sst referencing this physical sst.
// INVARIANT: refsInLatestVersion <= totalRefs
// refsInLatestVersion == 0 is a zombie sstable.
refsInLatestVersion int32
fileSize uint64
// If sst is not virtualized and in latest version
// virtualSizeSumInLatestVersion == fileSize. If
// virtualSizeSumInLatestVersion > 0 and
// virtualSizeSumInLatestVersion/fileSize is very small, the corresponding
// virtual sst(s) should be candidates for compaction. These candidates can be
// tracked via btree annotations. Incrementlly updated in
// BulkVersionEdit.Apply, when updating refsInLatestVersion.
virtualSizeSumInLatestVersion uint64
}
```

The `Deletion` section and the `Compactions` section describe why we need to
store the `PhysicalState`.

### Deletion of physical and virtual sstables

We want to ensure that the physical sstable is only deleted from disk when no
version references it, and when there are no virtual sstables which are backed
by the physical sstable.

Since `FilemetaData.refs` is a pointer which is shared by the physical and
virtual sstables, the physical sstable won't be deleted when it is removed
from the latest version as the `FilemetaData.refs` will have been increased
when the virtual sstable is added to a version. Therefore, we only need to
ensure that the physical sstable is eventually deleted when there are no
versions which reference it.

Sstables are deleted from disk by the `DB.doDeleteObsoleteFiles` function which
looks for files to delete in the the `DB.mu.versions.obsoleteTables` slice.
So we need to ensure that any physical sstable which was virtualized is added to
the obsolete tables list iff `FilemetaData.refs` is 0.

Sstable are added to the obsolete file list when a `Version` is unrefed and
when `DB.scanObsoleteFiles` is called when Pebble is opened.

When a `Version` is unrefed, sstables referenced by it are only added to the
obsolete table list if the `FilemetaData.refs` hits 0 for the sstable. With
virtual sstables, we can have a case where the last version which directly
references a physical sstable is unrefed, but the physical sstable is not added
to the obsolete table list because its `FilemetaData.refs` count is not 0
because of indirect references through virtual sstables. Since the last Version
which directly references the physical sstable is deleted, the physical sstable
will never get added to the obsolete table list. Since virtual sstables keep
track of their parent physical sstable, we can just add the physical sstable to
the obsolete table list when the last virtual sstable which references it is
deleted.

`DB.scanObsoleteFiles` will delete any file which isn't referenced by the
`VersionSet.versions` list. So, it's possible that a physical sstable associated
with a virtual sstable will be deleted. This problem can be fixed by a small
tweak in the `d.mu.versions.addLiveFileNums` to treat the parent sstable of
a virtual sstable as a live file.

Deleted files still referenced by older versions are considered zombie sstables.
We can extend the definition of zombie sstables to be any sstable which is not
directly, or indirectly through virtual sstables, referenced by the latest
version. See the `PhysicalState` subsection of the `FilemetaData` section
where we describe how the references in the latest version will be tracked.


### Reading from virtual sstables

Since virtual sstables do not exist on disk, we will have to redirect reads
to the physical sstable which backs the virtual sstable.

All reads to the physical files go through the table cache which opens the file
on disk and creates a `Reader` for the reads. The table cache currently creates
a `FileNum` -> `Reader` mapping for the physical sstables.

Most of the functions in table cache API take the file metadata of the file as
a parameter. Examples include `newIters`, `newRangeKeyIter`, `withReader`, etc.
Each of these functions then calls a subsequent function on the sstable
`Reader`.

In the `Reader` API, some functions only really need to be called on physical
sstables, whereas some functions need to be called on both physical and virtual
sstables. For example, the `Reader.EstimateDiskUsage` usage function, or the
`Reader.Layout` function only need to be called on physical sstables, whereas,
some function like, `Reader.NewIter`, and `Reader.NewCompactionIter` need to
work with virtual sstables.

We could either have an abstraction over the physical sstable `Reader` per
virtual sstable, or update the `Reader` API to accept file bounds of the
sstable. In the latter case, we would create one `Reader` on the physical
sstable for all of the virtual sstables, and update the `Reader` API to accept
the file bounds of the sstable.

Changes required to share a `Reader` on the physical sstable among the virtual
sstable:
- If the file metadata of the virtual sstable is passed into the table cache, on
a table cache miss, the table cache will load the Reader for the physical
sstable. This step can be performed in the `tableCacheValue.load` function. On
a table cache hit, the file number of the parent sstable will be used to fetch
the appropriate sstable `Reader`.
- The `Reader` api will be updated to support reads from virtual sstables. For
example, the `NewCompactionIter` function will take additional
`lower,upper []byte` parameters.

Updates to iterators:
- `Reader.NewIter` already has `lower,upper []byte` parameters so this requires
no change.
- Add `lower,upper` fields to the `Reader.NewCompactionIter`. The function
initializes single level and two level iterators, and we can pass in the
`lower,upper` values to those. TODO(bananabrick): Make sure that the value
of `bytesIterated` in the compaction iterator is still accurate.
- `Reader.NewRawRangeKeyIter/NewRawRangeDelIter`: We need to add `lower/upper`
fields to the functions. Both iterators make use of a `fragmentBlockIter`. We
could filter keys above the `fragmentBlockIter` or add filtering within the
`fragmentBlockIter`. To add filtering within the `fragmentBlockIter` we will
initialize it with two additional `lower/upper []byte` fields.
- We would need to update the `SetBounds` logic for the sstable iterators to
never set bounds for the iterators outside the virtual sstable bounds. This
could lead to keys outside the virtual sstable bounds, but inside the physical
sstable bounds, to be surfaced.

TODO(bananabrick): Add a section about sstable properties, if necessary.

### Compactions

Virtual sstables can be picked for compactions. If the `FilemetaData` and the
iterator stack changes work, then compaction shouldn't require much, if any,
additional work.

Virtual sstables which are picked for compactions may cause space amplification.
For example, if we have two virtual sstables `a` and `b` in L5, backed by a
physical sstable `c`, and the sstable `a` is picked for a compaction. We will
write some additional data into L6, but we won't delete sstable `c` because
sstable `b` still refers to it. In the worst case, sstable `b` will never be
picked for compaction and will never be compacted into and we'll have permanent
space amplification. We should try prioritize compaction of sstable `b` to
prevent such a scenario.

See the `PhysicalState` subsection in the `FilemetaData` section to see how
we'll store compaction picking metrics to reduce virtual sstable space-amp.

### `VersionEdit` decode/encode
Any additional fields added to the `FilemetaData` need to be supported in the
version edit `decode/encode` functions.

0 comments on commit 936e011

Please sign in to comment.