perf: investigate WAL file reuse #41

petermattis · 2019-02-22T15:27:21Z

RocksDB supports preallocating space for the WAL which can reduce the data that needs to be synced whenever fdatasync is called. In particular, if we pre-extend the size of the file, calling fdatasync will only need to write the data blocks and not the inode. Under RocksDB this is achieved by setting fallocate_with_keep_size == false.

See preallocate.go, preallocate_unix.go, and preallocate_darwin.go for the etcd code which performs the OS specific system calls.

The text was updated successfully, but these errors were encountered:

petermattis · 2019-03-11T13:29:55Z

In addition to preallocating WAL file space, we should investigate reusing WAL files. RocksDB does this via the recycle_log_file_num option.

  // If non-zero, we will reuse previously written log files for new
  // logs, overwriting the old data.  The value indicates how many
  // such files we will keep around at any point in time for later
  // use.  This is more efficient because the blocks are already
  // allocated and fdatasync does not need to update the inode after
  // each write.

@ajkr expanded on this in another comment:

  // On ext4 and xfs, at least, `fallocate()`ing a large empty WAL is not enough
  // to avoid inode writeback on every `fdatasync()`. Although `fallocate()` can
  // preallocate space and preset the file size, it marks the preallocated
  // "extents" as unwritten in the inode to guarantee readers cannot be exposed
  // to data belonging to others. Every time `fdatasync()` happens, an inode
  // writeback happens for the update to split an unwritten extent and mark part
  // of it as written.
  //
  // Setting `recycle_log_file_num > 0` circumvents this as it'll eventually
  // reuse WALs where extents are already all marked as written. When the DB
  // opens, the first WAL will have its space preallocated as unwritten extents,
  // so will still incur frequent inode writebacks. The second WAL will as well
  // since the first WAL cannot be recycled until the first flush completes.
  // From the third WAL onwards, however, we will have a previously written WAL
  // readily available to recycle.
  //
  // We could pick a higher value if we see memtable flush backing up, or if we
  // start using column families (WAL changes every time any column family
  // initiates a flush, and WAL cannot be reused until that flush completes).

@ajkr also notes that there is small possibility of badness with the RocksDB implementation of WAL reuse:

There appears to be an infinitesimally small chance of a wrong record to be replayed during recovery -- a user key or value written to an old WAL could contain bytes that form a valid entry for the recycled WAL, and those bytes would have to immediately follow the final entry written to the recycled WAL.

petermattis · 2019-03-13T16:40:19Z

In order to support recycling WAL files, RocksDB extends the WAL entry to include the log number:

 * Legacy record format:
 *
 * +---------+-----------+-----------+--- ... ---+
 * |CRC (4B) | Size (2B) | Type (1B) | Payload   |
 * +---------+-----------+-----------+--- ... ---+
 *
 * CRC = 32bit hash computed over the record type and payload using CRC
 * Size = Length of the payload data
 * Type = Type of record
 *        (kZeroType, kFullType, kFirstType, kLastType, kMiddleType )
 *        The type is used to group a bunch of records together to represent
 *        blocks that are larger than kBlockSize
 * Payload = Byte stream as long as specified by the payload size
 *
 * Recyclable record format:
 *
 * +---------+-----------+-----------+----------------+--- ... ---+
 * |CRC (4B) | Size (2B) | Type (1B) | Log number (4B)| Payload   |
 * +---------+-----------+-----------+----------------+--- ... ---+
 *
 * Same as above, with the addition of
 * Log number = 32bit log file number, so that we can distinguish between
 * records written by the most recent log writer vs a previous one.

petermattis · 2019-04-11T00:09:29Z

Supporting the recyclable record format looks relatively straightforward. The Type field is extended with "recyclable" versions:

enum RecordType {
  // Zero is reserved for preallocated files
  kZeroType = 0,
  kFullType = 1,

  // For fragments
  kFirstType = 2,
  kMiddleType = 3,
  kLastType = 4,

  // For recycled log files
  kRecyclableFullType = 5,
  kRecyclableFirstType = 6,
  kRecyclableMiddleType = 7,
  kRecyclableLastType = 8,
};

Log reading examines the first 6 bytes of the record. If the Type is one of the recyclable types it then verifies that the Log number matches the expected value, otherwise it considers log reading to have reached EOF. As is often the case, adding tests will likely be the largest chunk of work.

On Linux, preallocation makes a huge difference in sync performance. WAL reuse (aka recycling...not implemented yet) provides a further improvement. And direct IO provides more stable performance on GCE Local SSD. Note that direct IO implies WAL reuse. The numbers below were gathered on an AWS m5.xlarge. name time/op DirectIOWrite/wsize=4096-4 34.4µs ± 1% DirectIOWrite/wsize=8192-4 61.0µs ± 0% DirectIOWrite/wsize=16384-4 122µs ± 0% DirectIOWrite/wsize=32768-4 244µs ± 0% SyncWrite/no-prealloc/wsize=64-4 128µs ± 8% SyncWrite/no-prealloc/wsize=512-4 146µs ± 0% SyncWrite/no-prealloc/wsize=1024-4 155µs ± 0% SyncWrite/no-prealloc/wsize=2048-4 172µs ± 0% SyncWrite/no-prealloc/wsize=4096-4 206µs ± 0% SyncWrite/no-prealloc/wsize=8192-4 206µs ± 0% SyncWrite/no-prealloc/wsize=16384-4 274µs ± 0% SyncWrite/no-prealloc/wsize=32768-4 407µs ± 4% SyncWrite/prealloc-4MB/wsize=64-4 34.2µs ± 7% SyncWrite/prealloc-4MB/wsize=512-4 47.5µs ± 0% SyncWrite/prealloc-4MB/wsize=1024-4 60.4µs ± 0% SyncWrite/prealloc-4MB/wsize=2048-4 86.4µs ± 0% SyncWrite/prealloc-4MB/wsize=4096-4 137µs ± 0% SyncWrite/prealloc-4MB/wsize=8192-4 143µs ± 7% SyncWrite/prealloc-4MB/wsize=16384-4 214µs ± 0% SyncWrite/prealloc-4MB/wsize=32768-4 337µs ± 0% SyncWrite/reuse/wsize=64-4 31.6µs ± 4% SyncWrite/reuse/wsize=512-4 31.8µs ± 4% SyncWrite/reuse/wsize=1024-4 32.4µs ± 7% SyncWrite/reuse/wsize=2048-4 31.3µs ± 1% SyncWrite/reuse/wsize=4096-4 32.2µs ± 5% SyncWrite/reuse/wsize=8192-4 61.1µs ± 0% SyncWrite/reuse/wsize=16384-4 122µs ± 0% SyncWrite/reuse/wsize=32768-4 244µs ± 0% See #41

petermattis · 2019-04-17T21:00:54Z

The benchmarks added in #76 point to WAL file reuse being a win. They also point to using direct IO as being an additional win, providing more regular sync performance across supported filesystems. Direct IO comes with caveats under Linux. See clarifying direct IO semantics. In particular, direct IO should be viewed as an additional specialization on top of WAL file reuse.

Even with grouping of write batches, most WAL syncs are small. Instrumentation of cockroach shows that 90% of WAL syncs are for less than 4KB of data on a TPCC workload. 99% of WAL syncs are for less than 16KB in size.

Direct IO requires writing full filesystem pages aligned on page boundaries. Under ext4 this is 4KB aligned (TODO check what the alignment requirements are for xfs, though they are likely similar). There is a question of what to do with the tail of the WAL that doesn't fill a page. We could overwrite that data repeatedly, though there might be a problem with concurrently modifying that tail buffer and writing it to disk (it is unclear if this is safe). An alternative is to pad the WAL to a page boundary whenever a sync occurs. On a TPCC workload, this would increase space usage for the WAL by 30%. For a KV write-only workload, such padding would increase space usage by nearly 100%. Neither increase seems problematic as the WAL files are on the orders of hundreds of megabytes in size, a small fraction of the database size. Note that the amount of data being written to the SSD isn't changing. And padding to page boundaries might actually be kinder to the SSD hardware (I'm always a bit unclear on how wear leveling is done).

How to pad to a page boundary? Add a LogData chunk of the desired size.

On Linux, preallocation makes a huge difference in sync performance. WAL reuse (aka recycling...not implemented yet) provides a further improvement. And direct IO provides more stable performance on GCE Local SSD. Note that direct IO implies WAL reuse. The numbers below were gathered on an AWS m5.xlarge. name time/op DirectIOWrite/wsize=4096-4 34.4µs ± 1% DirectIOWrite/wsize=8192-4 61.0µs ± 0% DirectIOWrite/wsize=16384-4 122µs ± 0% DirectIOWrite/wsize=32768-4 244µs ± 0% SyncWrite/no-prealloc/wsize=64-4 128µs ± 8% SyncWrite/no-prealloc/wsize=512-4 146µs ± 0% SyncWrite/no-prealloc/wsize=1024-4 155µs ± 0% SyncWrite/no-prealloc/wsize=2048-4 172µs ± 0% SyncWrite/no-prealloc/wsize=4096-4 206µs ± 0% SyncWrite/no-prealloc/wsize=8192-4 206µs ± 0% SyncWrite/no-prealloc/wsize=16384-4 274µs ± 0% SyncWrite/no-prealloc/wsize=32768-4 407µs ± 4% SyncWrite/prealloc-4MB/wsize=64-4 34.2µs ± 7% SyncWrite/prealloc-4MB/wsize=512-4 47.5µs ± 0% SyncWrite/prealloc-4MB/wsize=1024-4 60.4µs ± 0% SyncWrite/prealloc-4MB/wsize=2048-4 86.4µs ± 0% SyncWrite/prealloc-4MB/wsize=4096-4 137µs ± 0% SyncWrite/prealloc-4MB/wsize=8192-4 143µs ± 7% SyncWrite/prealloc-4MB/wsize=16384-4 214µs ± 0% SyncWrite/prealloc-4MB/wsize=32768-4 337µs ± 0% SyncWrite/reuse/wsize=64-4 31.6µs ± 4% SyncWrite/reuse/wsize=512-4 31.8µs ± 4% SyncWrite/reuse/wsize=1024-4 32.4µs ± 7% SyncWrite/reuse/wsize=2048-4 31.3µs ± 1% SyncWrite/reuse/wsize=4096-4 32.2µs ± 5% SyncWrite/reuse/wsize=8192-4 61.1µs ± 0% SyncWrite/reuse/wsize=16384-4 122µs ± 0% SyncWrite/reuse/wsize=32768-4 244µs ± 0% See #41

The recyclable log record format associated a log file number with each record in a file which allows a log file to be reused safely without truncating the file. This is an important performance optimization in itself and a prerequisite for writing the WAL using direct IO. Changed `record.LogWriter` to always use the recyclable log record format which in turn changes Pebble to use the recyclable format for the WAL. `record.Writer` (used to write the `MANIFEST`) continues to use the legacy log record format. `record.Reader` can now read either format and will return an error when a record with a log number does not match the supplied log number. Took the opportunity to unexport some methods that were not being used outside of the `internal/record` package such as `Reader.SeekRecord` and `LogWriter.Flush`. Actual reuse of WAL files will be implemented in a future change. See #41

WAL recycling is an important performance optimization as it is faster to sync a file that has already been written, than one which is being written for the first time. This is due to the need to sync file metadata when a file is being written for the first time. Note this is true even if file preallocation is performed (e.g. fallocate). Fixes #41

petermattis mentioned this issue Apr 13, 2019

Add support for WAL preallocation #76

Merged

petermattis changed the title ~~perf: preallocate WAL file space~~ perf: investigate WAL file reuse Apr 14, 2019

petermattis mentioned this issue Apr 18, 2019

Add support for the "recyclable" log record format #80

Merged

petermattis mentioned this issue Apr 28, 2019

Implement WAL recycling #95

Merged

petermattis closed this as completed in #95 May 3, 2019

petermattis mentioned this issue Jun 4, 2021

internal/record: use direct I/O for the WAL and MANIFEST #1159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: investigate WAL file reuse #41

perf: investigate WAL file reuse #41

petermattis commented Feb 22, 2019

petermattis commented Mar 11, 2019

petermattis commented Mar 13, 2019

petermattis commented Apr 11, 2019

petermattis commented Apr 17, 2019

perf: investigate WAL file reuse #41

perf: investigate WAL file reuse #41

Comments

petermattis commented Feb 22, 2019

petermattis commented Mar 11, 2019

petermattis commented Mar 13, 2019

petermattis commented Apr 11, 2019

petermattis commented Apr 17, 2019