Skip to content

Latest commit

 

History

History
366 lines (298 loc) · 18.1 KB

20221122_virtual_sstable.md

File metadata and controls

366 lines (298 loc) · 18.1 KB
  • Feature Name: Virtual sstables
  • Status: in-progress
  • Start Date: 2022-10-27
  • Authors: Arjun Nair
  • RFC PR: #2116
  • Pebble Issues: #1683

** Design Draft**

Summary

The RFC outlines the design to enable virtualizing of physical sstables in Pebble.

A virtual sstable has no associated physical data on disk, and is instead backed by an existing physical sstable. Each physical sstable may be shared by one, or more than one virtual sstable.

Initially, the design will be used to lower the read-amp and the write-amp caused by certain ingestions. Sometimes, ingestions are unable to place incoming files, which have no data overlap with other files in the lsm, lower in the lsm because of file boundary overlap with files in the lsm. In this case, we are forced to place files higher in the lsm, sometimes in L0, which can cause higher read-amp and unnecessary write-amp as the file is moved lower down the lsm. See cockroachdb/cockroach#80589 for the problem occurring in practice.

Eventually, the design will also be used for the disaggregated storage masking use-case: https://github.com/cockroachdb/cockroach/pull/70419/files.

This document describes the design of virtual sstables in Pebble with enough detail to aid the implementation and code review.

Design

Ingestion

When an sstable is ingested into Pebble, we try to place it in the lowest level without any data overlap, or any file boundary overlap. We can make use of virtual sstables in the cases where we're forced to place the ingested sstable at a higher level due to file boundary overlap, but no data overlap.

                                  s2
ingest:                     [i-j-------n]
                                  s1
L6:                 [e---g-----------------p---r]
             a b c d e f g h i j k l m n o p q r s t u v w x y z

Consider the sstable s1 in L6 and the ingesting sstable s2. It is clear that the file boundaries of s1 and s2 overlap, but there is no data overlap as shown in the diagram. Currently, we will be forced to ingest the sstable s2 into a level higher than L6. With virtual sstables, we can split the existing sstable s1 into two sstables s3 and s4 as shown in the following diagram.

                       s3         s2        s4
L6:                 [e---g]-[i-j-------n]-[p---r]
             a b c d e f g h i j k l m n o p q r s t u v w x y z

The sstable s1 will be deleted from the lsm. If s1 was a physical sstable, then we will keep the file on disk as long as we need to so that it can back the virtual sstables.

There are cases where the ingesting sstables have no data overlap with existing sstables, but we can't make use of virtual sstables. Consider:

                                  s2
ingest:               [f-----i-j-------n]
                                  s1
L6:                 [e---g-----------------p---r]
             a b c d e f g h i j k l m n o p q r s t u v w x y z

We cannot use virtual sstables in the above scenario for two reasons:

  1. We don't have a quick method of detecting no data overlap.
  2. We will be forced to split the sstable in L6 into more than two virtual sstables, but we want to avoid many small virtual sstables in the lsm.

Note that in Cockroach, the easier-to-solve case happens very regularly when an sstable spans a range boundary (which pebble has no knowledge of), and we ingest a snapshot of a range in between the two already-present ranges.

slide in between two existing sstables is more likely to happen. It occurs when we ingest a snapshot of a range in between two already present ranges.

ingestFindTargetLevel changes:

  • The ingestFindTargetLevel function is used to determine the target level of the file which is being ingested. Currently, this function returns an int which is the target level for the ingesting file. Two additional return parameters, []manifest.NewFileEntry and *manifest.DeletedFileEntry, will be added to the function.

  • If ingestFindTargetLevel decides to split an existing sstable into virtual sstables, then it will return new and deleted entries. Otherwise, it will only return the target level of the ingesting file.

  • Within the ingestFindTargetLevel function, the overlapWithIterator function is used to quickly detect data overlap. In the case with file boundary overlap, but no data overlap, in the lowest possible level, we will split the existing sstable into virtual sstables and generate the NewFileEntrys and the DeletedFileEntry. The FilemetaData section describes how the various fields in the FilemetaData will be computed for the newly created virtual sstables.

  • Note that we will not split physical sstables into virtual sstables in L0 for the use case described in this RFC. The benefit of doing so would be to reduce the number of L0 sublevels, but the cost would be additional implementation complexity(see the FilemetaData section). We also want to avoid too many virtual sstables in the lsm as they can lead to space amp(see Compaction section). However, in the future, for the disaggregated storage masking case, we would need to support ingestion and use of virtual sstables in L0.

  • Note that we may need an upper bound on the number of times an sstable is split into smaller virtual sstables. We can further reduce the risk of many small sstables:

    1. For CockroachDB's snapshot ingestion, there is one large sst (up to 512MB) and many tiny ones. We can choose the apply this splitting logic only for the large sst. It is ok for the tiny ssts to be ingested into L0.
    2. Split only if the ingested sst is at least half the size of the sst being split. So if we have a smaller ingested sst, we will pick a higher level to split at (where the ssts are smaller). The lifetime of virtual ssts at a higher level is smaller, so there is lower risk of littering the LSM with long-lived small virtual ssts.
    3. For disaggregated storage implementation, we can avoid masking for tiny sstables being ingested and instead write a range delete like we currently do. Precise details on the masking use case are out of the scope of this RFC.

ingestApply changes:

  • The new and deleted file entries returned by the ingestFindTargetLevel function will be added to the version edit in ingestApply.
  • We will appropriately update the levelMetrics based on the new information returned by ingestFindTargetLevel.

FilemetaData changes

Each virtual sstables will have a unique file metadata value associated with it. The metadata may be borrowed from the backing physical sstable, or it may be unique to the virtual sstable.

This rfc lists out the fields in the FileMetadata struct with information on how each field will be populated.

Atomic.AllowedSeeks: Field is used for read triggered compactions, and we can populate this field for each virtual sstable since virtual sstables can be picked for compactions.

Atomic.statsValid: We can set this to true(1) when the virtual sstable is created. On virtual sstable creation we will estimate the table stats of the virtual sstable based on the table stats of the physical sstable. We can also set this to 0 and let the table stats job asynchronously compute the stats.

refs: The will be turned into a pointer which will be shared by the virtual/physical sstables. See the deletion section of the RFC to learn how the refs count will be used.

FileNum: We could give each virtual sstable its own file number or share the file number between all the virtual sstables. In the former case, the virtual sstables will be distinguished by the file number, and will have an additional metadata field to indicate the file number of the parent sstable. In the latter case, we can use a few of the most significant bits of the 64 bit file number to distinguish the virtual sstables.

The benefit of using a single file number for each virtual sstable, is that we don't need to use additional space to store the file number of the backing physical sstable.

It might make sense to give each virtual sstable its own file number. Virtual sstables are picked for compactions, and compactions and compaction picking expect a unique file number for each of the files which it is compacting. For example, read compactions will use the file number of the file to determine if a file picked for compaction has already been compacted, the version edit will expect a different file number for each virtual sstable, etc.

There are direct references to the FilemetaData.FileNum throughout Pebble. For example, the file number is accessed when the DB.Checkpoint function is called. This function iterates through the files in each level of the lsm, constructs the filepath using the file number, and reads the file from disk. In such cases, it is important to exclude virtual sstables.

Size: We compute this using linear interpolation on the number of blocks in the parent sstable and the number of blocks in the newly created virtual sstable.

SmallestSeqNum/LargestSeqNum: These fields depend on the parent sstable, but we would need to perform a scan of the physical sstable to compute these accurately for the virtual sstable upon creation. Instead, we could convert these fields into lower and upper bounds of the sequence numbers in a file.

These fields are used for l0 sublevels, pebble tooling, delete compaction hints, and a lot of plumbing. We don't need to worry about the L0 sublevels use case because we won't have virtual sstables in L0 for the use case in this RFC. For the rest of the use cases we can use lower bound for the smallest seq number, and an upper bound for the largest seq number work.

TODO(bananabrick): Add more detail for any delete compaction hint changes if necessary.

Smallest/Largest: These, along with the smallest/largest ranges for the range and point keys can be computed upon virtual sstable creation. Precisely, these can be computed when we try and detect data overlap in the overlapWithIterator function during ingestion.

Stats: TableStats will either be computed upon virtual sstable creation using linear interpolation on the block counts of the virtual/physical sstables or asynchronously using the file bounds of the virtual sstable.

PhysicalState: We can add an additional struct with state associated with physical ssts which have been virtualized.

type PhysicalState struct {
  // Total refs across all virtual ssts * versions. That is, if the same virtual
  // sst is present in multiple versions, it may have multiple refs, if the
  // btree node is not the same.
  totalRefs int32

  // Number of virtual ssts in the latest version that refer to this physical
  // SST. Will be 1 if there is only a physical sst, or there is only 1 virtual
  // sst referencing this physical sst.
  // INVARIANT: refsInLatestVersion <= totalRefs
  // refsInLatestVersion == 0 is a zombie sstable.
  refsInLatestVersion int32

  fileSize uint64

  // If sst is not virtualized and in latest version
  // virtualSizeSumInLatestVersion == fileSize. If
  // virtualSizeSumInLatestVersion > 0 and
  // virtualSizeSumInLatestVersion/fileSize is very small, the corresponding
  // virtual sst(s) should be candidates for compaction. These candidates can be
  // tracked via btree annotations. Incrementlly updated in
  // BulkVersionEdit.Apply, when updating refsInLatestVersion.
  virtualSizeSumInLatestVersion uint64
}

The Deletion section and the Compactions section describe why we need to store the PhysicalState.

Deletion of physical and virtual sstables

We want to ensure that the physical sstable is only deleted from disk when no version references it, and when there are no virtual sstables which are backed by the physical sstable.

Since FilemetaData.refs is a pointer which is shared by the physical and virtual sstables, the physical sstable won't be deleted when it is removed from the latest version as the FilemetaData.refs will have been increased when the virtual sstable is added to a version. Therefore, we only need to ensure that the physical sstable is eventually deleted when there are no versions which reference it.

Sstables are deleted from disk by the DB.doDeleteObsoleteFiles function which looks for files to delete in the DB.mu.versions.obsoleteTables slice. So we need to ensure that any physical sstable which was virtualized is added to the obsolete tables list iff FilemetaData.refs is 0.

Sstable are added to the obsolete file list when a Version is unrefed and when DB.scanObsoleteFiles is called when Pebble is opened.

When a Version is unrefed, sstables referenced by it are only added to the obsolete table list if the FilemetaData.refs hits 0 for the sstable. With virtual sstables, we can have a case where the last version which directly references a physical sstable is unrefed, but the physical sstable is not added to the obsolete table list because its FilemetaData.refs count is not 0 because of indirect references through virtual sstables. Since the last Version which directly references the physical sstable is deleted, the physical sstable will never get added to the obsolete table list. Since virtual sstables keep track of their parent physical sstable, we can just add the physical sstable to the obsolete table list when the last virtual sstable which references it is deleted.

DB.scanObsoleteFiles will delete any file which isn't referenced by the VersionSet.versions list. So, it's possible that a physical sstable associated with a virtual sstable will be deleted. This problem can be fixed by a small tweak in the d.mu.versions.addLiveFileNums to treat the parent sstable of a virtual sstable as a live file.

Deleted files still referenced by older versions are considered zombie sstables. We can extend the definition of zombie sstables to be any sstable which is not directly, or indirectly through virtual sstables, referenced by the latest version. See the PhysicalState subsection of the FilemetaData section where we describe how the references in the latest version will be tracked.

Reading from virtual sstables

Since virtual sstables do not exist on disk, we will have to redirect reads to the physical sstable which backs the virtual sstable.

All reads to the physical files go through the table cache which opens the file on disk and creates a Reader for the reads. The table cache currently creates a FileNum -> Reader mapping for the physical sstables.

Most of the functions in table cache API take the file metadata of the file as a parameter. Examples include newIters, newRangeKeyIter, withReader, etc. Each of these functions then calls a subsequent function on the sstable Reader.

In the Reader API, some functions only really need to be called on physical sstables, whereas some functions need to be called on both physical and virtual sstables. For example, the Reader.EstimateDiskUsage usage function, or the Reader.Layout function only need to be called on physical sstables, whereas, some function like, Reader.NewIter, and Reader.NewCompactionIter need to work with virtual sstables.

We could either have an abstraction over the physical sstable Reader per virtual sstable, or update the Reader API to accept file bounds of the sstable. In the latter case, we would create one Reader on the physical sstable for all of the virtual sstables, and update the Reader API to accept the file bounds of the sstable.

Changes required to share a Reader on the physical sstable among the virtual sstable:

  • If the file metadata of the virtual sstable is passed into the table cache, on a table cache miss, the table cache will load the Reader for the physical sstable. This step can be performed in the tableCacheValue.load function. On a table cache hit, the file number of the parent sstable will be used to fetch the appropriate sstable Reader.
  • The Reader api will be updated to support reads from virtual sstables. For example, the NewCompactionIter function will take additional lower,upper []byte parameters.

Updates to iterators:

  • Reader.NewIter already has lower,upper []byte parameters so this requires no change.
  • Add lower,upper fields to the Reader.NewCompactionIter. The function initializes single level and two level iterators, and we can pass in the lower,upper values to those. TODO(bananabrick): Make sure that the value of bytesIterated in the compaction iterator is still accurate.
  • Reader.NewRawRangeKeyIter/NewRawRangeDelIter: We need to add lower/upper fields to the functions. Both iterators make use of a fragmentBlockIter. We could filter keys above the fragmentBlockIter or add filtering within the fragmentBlockIter. To add filtering within the fragmentBlockIter we will initialize it with two additional lower/upper []byte fields.
  • We would need to update the SetBounds logic for the sstable iterators to never set bounds for the iterators outside the virtual sstable bounds. This could lead to keys outside the virtual sstable bounds, but inside the physical sstable bounds, to be surfaced.

TODO(bananabrick): Add a section about sstable properties, if necessary.

Compactions

Virtual sstables can be picked for compactions. If the FilemetaData and the iterator stack changes work, then compaction shouldn't require much, if any, additional work.

Virtual sstables which are picked for compactions may cause space amplification. For example, if we have two virtual sstables a and b in L5, backed by a physical sstable c, and the sstable a is picked for a compaction. We will write some additional data into L6, but we won't delete sstable c because sstable b still refers to it. In the worst case, sstable b will never be picked for compaction and will never be compacted into and we'll have permanent space amplification. We should try prioritize compaction of sstable b to prevent such a scenario.

See the PhysicalState subsection in the FilemetaData section to see how we'll store compaction picking metrics to reduce virtual sstable space-amp.

VersionEdit decode/encode

Any additional fields added to the FilemetaData need to be supported in the version edit decode/encode functions.