Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: investigate WAL file reuse #41

Closed
petermattis opened this issue Feb 22, 2019 · 4 comments · Fixed by #95
Closed

perf: investigate WAL file reuse #41

petermattis opened this issue Feb 22, 2019 · 4 comments · Fixed by #95

Comments

@petermattis
Copy link
Collaborator

RocksDB supports preallocating space for the WAL which can reduce the data that needs to be synced whenever fdatasync is called. In particular, if we pre-extend the size of the file, calling fdatasync will only need to write the data blocks and not the inode. Under RocksDB this is achieved by setting fallocate_with_keep_size == false.

See preallocate.go, preallocate_unix.go, and preallocate_darwin.go for the etcd code which performs the OS specific system calls.

@petermattis
Copy link
Collaborator Author

In addition to preallocating WAL file space, we should investigate reusing WAL files. RocksDB does this via the recycle_log_file_num option.

  // If non-zero, we will reuse previously written log files for new
  // logs, overwriting the old data.  The value indicates how many
  // such files we will keep around at any point in time for later
  // use.  This is more efficient because the blocks are already
  // allocated and fdatasync does not need to update the inode after
  // each write.

@ajkr expanded on this in another comment:

  // On ext4 and xfs, at least, `fallocate()`ing a large empty WAL is not enough
  // to avoid inode writeback on every `fdatasync()`. Although `fallocate()` can
  // preallocate space and preset the file size, it marks the preallocated
  // "extents" as unwritten in the inode to guarantee readers cannot be exposed
  // to data belonging to others. Every time `fdatasync()` happens, an inode
  // writeback happens for the update to split an unwritten extent and mark part
  // of it as written.
  //
  // Setting `recycle_log_file_num > 0` circumvents this as it'll eventually
  // reuse WALs where extents are already all marked as written. When the DB
  // opens, the first WAL will have its space preallocated as unwritten extents,
  // so will still incur frequent inode writebacks. The second WAL will as well
  // since the first WAL cannot be recycled until the first flush completes.
  // From the third WAL onwards, however, we will have a previously written WAL
  // readily available to recycle.
  //
  // We could pick a higher value if we see memtable flush backing up, or if we
  // start using column families (WAL changes every time any column family
  // initiates a flush, and WAL cannot be reused until that flush completes).

@ajkr also notes that there is small possibility of badness with the RocksDB implementation of WAL reuse:

There appears to be an infinitesimally small chance of a wrong record to be replayed during recovery -- a user key or value written to an old WAL could contain bytes that form a valid entry for the recycled WAL, and those bytes would have to immediately follow the final entry written to the recycled WAL.

@petermattis
Copy link
Collaborator Author

In order to support recycling WAL files, RocksDB extends the WAL entry to include the log number:

 * Legacy record format:
 *
 * +---------+-----------+-----------+--- ... ---+
 * |CRC (4B) | Size (2B) | Type (1B) | Payload   |
 * +---------+-----------+-----------+--- ... ---+
 *
 * CRC = 32bit hash computed over the record type and payload using CRC
 * Size = Length of the payload data
 * Type = Type of record
 *        (kZeroType, kFullType, kFirstType, kLastType, kMiddleType )
 *        The type is used to group a bunch of records together to represent
 *        blocks that are larger than kBlockSize
 * Payload = Byte stream as long as specified by the payload size
 *
 * Recyclable record format:
 *
 * +---------+-----------+-----------+----------------+--- ... ---+
 * |CRC (4B) | Size (2B) | Type (1B) | Log number (4B)| Payload   |
 * +---------+-----------+-----------+----------------+--- ... ---+
 *
 * Same as above, with the addition of
 * Log number = 32bit log file number, so that we can distinguish between
 * records written by the most recent log writer vs a previous one.

@petermattis
Copy link
Collaborator Author

Supporting the recyclable record format looks relatively straightforward. The Type field is extended with "recyclable" versions:

enum RecordType {
  // Zero is reserved for preallocated files
  kZeroType = 0,
  kFullType = 1,

  // For fragments
  kFirstType = 2,
  kMiddleType = 3,
  kLastType = 4,

  // For recycled log files
  kRecyclableFullType = 5,
  kRecyclableFirstType = 6,
  kRecyclableMiddleType = 7,
  kRecyclableLastType = 8,
};

Log reading examines the first 6 bytes of the record. If the Type is one of the recyclable types it then verifies that the Log number matches the expected value, otherwise it considers log reading to have reached EOF. As is often the case, adding tests will likely be the largest chunk of work.

@petermattis petermattis changed the title perf: preallocate WAL file space perf: investigate WAL file reuse Apr 14, 2019
petermattis added a commit that referenced this issue Apr 17, 2019
On Linux, preallocation makes a huge difference in sync performance.
WAL reuse (aka recycling...not implemented yet) provides a further
improvement. And direct IO provides more stable performance on GCE Local
SSD. Note that direct IO implies WAL reuse. The numbers below were
gathered on an AWS m5.xlarge.

name                                  time/op
DirectIOWrite/wsize=4096-4            34.4µs ± 1%
DirectIOWrite/wsize=8192-4            61.0µs ± 0%
DirectIOWrite/wsize=16384-4            122µs ± 0%
DirectIOWrite/wsize=32768-4            244µs ± 0%
SyncWrite/no-prealloc/wsize=64-4       128µs ± 8%
SyncWrite/no-prealloc/wsize=512-4      146µs ± 0%
SyncWrite/no-prealloc/wsize=1024-4     155µs ± 0%
SyncWrite/no-prealloc/wsize=2048-4     172µs ± 0%
SyncWrite/no-prealloc/wsize=4096-4     206µs ± 0%
SyncWrite/no-prealloc/wsize=8192-4     206µs ± 0%
SyncWrite/no-prealloc/wsize=16384-4    274µs ± 0%
SyncWrite/no-prealloc/wsize=32768-4    407µs ± 4%
SyncWrite/prealloc-4MB/wsize=64-4     34.2µs ± 7%
SyncWrite/prealloc-4MB/wsize=512-4    47.5µs ± 0%
SyncWrite/prealloc-4MB/wsize=1024-4   60.4µs ± 0%
SyncWrite/prealloc-4MB/wsize=2048-4   86.4µs ± 0%
SyncWrite/prealloc-4MB/wsize=4096-4    137µs ± 0%
SyncWrite/prealloc-4MB/wsize=8192-4    143µs ± 7%
SyncWrite/prealloc-4MB/wsize=16384-4   214µs ± 0%
SyncWrite/prealloc-4MB/wsize=32768-4   337µs ± 0%
SyncWrite/reuse/wsize=64-4            31.6µs ± 4%
SyncWrite/reuse/wsize=512-4           31.8µs ± 4%
SyncWrite/reuse/wsize=1024-4          32.4µs ± 7%
SyncWrite/reuse/wsize=2048-4          31.3µs ± 1%
SyncWrite/reuse/wsize=4096-4          32.2µs ± 5%
SyncWrite/reuse/wsize=8192-4          61.1µs ± 0%
SyncWrite/reuse/wsize=16384-4          122µs ± 0%
SyncWrite/reuse/wsize=32768-4          244µs ± 0%

See #41
@petermattis
Copy link
Collaborator Author

The benchmarks added in #76 point to WAL file reuse being a win. They also point to using direct IO as being an additional win, providing more regular sync performance across supported filesystems. Direct IO comes with caveats under Linux. See clarifying direct IO semantics. In particular, direct IO should be viewed as an additional specialization on top of WAL file reuse.

Even with grouping of write batches, most WAL syncs are small. Instrumentation of cockroach shows that 90% of WAL syncs are for less than 4KB of data on a TPCC workload. 99% of WAL syncs are for less than 16KB in size.

Direct IO requires writing full filesystem pages aligned on page boundaries. Under ext4 this is 4KB aligned (TODO check what the alignment requirements are for xfs, though they are likely similar). There is a question of what to do with the tail of the WAL that doesn't fill a page. We could overwrite that data repeatedly, though there might be a problem with concurrently modifying that tail buffer and writing it to disk (it is unclear if this is safe). An alternative is to pad the WAL to a page boundary whenever a sync occurs. On a TPCC workload, this would increase space usage for the WAL by 30%. For a KV write-only workload, such padding would increase space usage by nearly 100%. Neither increase seems problematic as the WAL files are on the orders of hundreds of megabytes in size, a small fraction of the database size. Note that the amount of data being written to the SSD isn't changing. And padding to page boundaries might actually be kinder to the SSD hardware (I'm always a bit unclear on how wear leveling is done).

How to pad to a page boundary? Add a LogData chunk of the desired size.

petermattis added a commit that referenced this issue Apr 18, 2019
On Linux, preallocation makes a huge difference in sync performance.
WAL reuse (aka recycling...not implemented yet) provides a further
improvement. And direct IO provides more stable performance on GCE Local
SSD. Note that direct IO implies WAL reuse. The numbers below were
gathered on an AWS m5.xlarge.

name                                  time/op
DirectIOWrite/wsize=4096-4            34.4µs ± 1%
DirectIOWrite/wsize=8192-4            61.0µs ± 0%
DirectIOWrite/wsize=16384-4            122µs ± 0%
DirectIOWrite/wsize=32768-4            244µs ± 0%
SyncWrite/no-prealloc/wsize=64-4       128µs ± 8%
SyncWrite/no-prealloc/wsize=512-4      146µs ± 0%
SyncWrite/no-prealloc/wsize=1024-4     155µs ± 0%
SyncWrite/no-prealloc/wsize=2048-4     172µs ± 0%
SyncWrite/no-prealloc/wsize=4096-4     206µs ± 0%
SyncWrite/no-prealloc/wsize=8192-4     206µs ± 0%
SyncWrite/no-prealloc/wsize=16384-4    274µs ± 0%
SyncWrite/no-prealloc/wsize=32768-4    407µs ± 4%
SyncWrite/prealloc-4MB/wsize=64-4     34.2µs ± 7%
SyncWrite/prealloc-4MB/wsize=512-4    47.5µs ± 0%
SyncWrite/prealloc-4MB/wsize=1024-4   60.4µs ± 0%
SyncWrite/prealloc-4MB/wsize=2048-4   86.4µs ± 0%
SyncWrite/prealloc-4MB/wsize=4096-4    137µs ± 0%
SyncWrite/prealloc-4MB/wsize=8192-4    143µs ± 7%
SyncWrite/prealloc-4MB/wsize=16384-4   214µs ± 0%
SyncWrite/prealloc-4MB/wsize=32768-4   337µs ± 0%
SyncWrite/reuse/wsize=64-4            31.6µs ± 4%
SyncWrite/reuse/wsize=512-4           31.8µs ± 4%
SyncWrite/reuse/wsize=1024-4          32.4µs ± 7%
SyncWrite/reuse/wsize=2048-4          31.3µs ± 1%
SyncWrite/reuse/wsize=4096-4          32.2µs ± 5%
SyncWrite/reuse/wsize=8192-4          61.1µs ± 0%
SyncWrite/reuse/wsize=16384-4          122µs ± 0%
SyncWrite/reuse/wsize=32768-4          244µs ± 0%

See #41
petermattis added a commit that referenced this issue Apr 22, 2019
The recyclable log record format associated a log file number with each
record in a file which allows a log file to be reused safely without
truncating the file. This is an important performance optimization in
itself and a prerequisite for writing the WAL using direct IO.

Changed `record.LogWriter` to always use the recyclable log record
format which in turn changes Pebble to use the recyclable format for the
WAL. `record.Writer` (used to write the `MANIFEST`) continues to use the
legacy log record format. `record.Reader` can now read either format and
will return an error when a record with a log number does not match the
supplied log number.

Took the opportunity to unexport some methods that were not being used
outside of the `internal/record` package such as `Reader.SeekRecord` and
`LogWriter.Flush`.

Actual reuse of WAL files will be implemented in a future change.

See #41
petermattis added a commit that referenced this issue May 1, 2019
WAL recycling is an important performance optimization as it is faster
to sync a file that has already been written, than one which is being
written for the first time. This is due to the need to sync file
metadata when a file is being written for the first time. Note this is
true even if file preallocation is performed (e.g. fallocate).

Fixes #41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant