-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: investigate WAL file reuse #41
Comments
In addition to preallocating WAL file space, we should investigate reusing WAL files. RocksDB does this via the
@ajkr expanded on this in another comment:
@ajkr also notes that there is small possibility of badness with the RocksDB implementation of WAL reuse:
|
In order to support recycling WAL files, RocksDB extends the WAL entry to include the
|
Supporting the recyclable record format looks relatively straightforward. The
Log reading examines the first 6 bytes of the record. If the |
On Linux, preallocation makes a huge difference in sync performance. WAL reuse (aka recycling...not implemented yet) provides a further improvement. And direct IO provides more stable performance on GCE Local SSD. Note that direct IO implies WAL reuse. The numbers below were gathered on an AWS m5.xlarge. name time/op DirectIOWrite/wsize=4096-4 34.4µs ± 1% DirectIOWrite/wsize=8192-4 61.0µs ± 0% DirectIOWrite/wsize=16384-4 122µs ± 0% DirectIOWrite/wsize=32768-4 244µs ± 0% SyncWrite/no-prealloc/wsize=64-4 128µs ± 8% SyncWrite/no-prealloc/wsize=512-4 146µs ± 0% SyncWrite/no-prealloc/wsize=1024-4 155µs ± 0% SyncWrite/no-prealloc/wsize=2048-4 172µs ± 0% SyncWrite/no-prealloc/wsize=4096-4 206µs ± 0% SyncWrite/no-prealloc/wsize=8192-4 206µs ± 0% SyncWrite/no-prealloc/wsize=16384-4 274µs ± 0% SyncWrite/no-prealloc/wsize=32768-4 407µs ± 4% SyncWrite/prealloc-4MB/wsize=64-4 34.2µs ± 7% SyncWrite/prealloc-4MB/wsize=512-4 47.5µs ± 0% SyncWrite/prealloc-4MB/wsize=1024-4 60.4µs ± 0% SyncWrite/prealloc-4MB/wsize=2048-4 86.4µs ± 0% SyncWrite/prealloc-4MB/wsize=4096-4 137µs ± 0% SyncWrite/prealloc-4MB/wsize=8192-4 143µs ± 7% SyncWrite/prealloc-4MB/wsize=16384-4 214µs ± 0% SyncWrite/prealloc-4MB/wsize=32768-4 337µs ± 0% SyncWrite/reuse/wsize=64-4 31.6µs ± 4% SyncWrite/reuse/wsize=512-4 31.8µs ± 4% SyncWrite/reuse/wsize=1024-4 32.4µs ± 7% SyncWrite/reuse/wsize=2048-4 31.3µs ± 1% SyncWrite/reuse/wsize=4096-4 32.2µs ± 5% SyncWrite/reuse/wsize=8192-4 61.1µs ± 0% SyncWrite/reuse/wsize=16384-4 122µs ± 0% SyncWrite/reuse/wsize=32768-4 244µs ± 0% See #41
The benchmarks added in #76 point to WAL file reuse being a win. They also point to using direct IO as being an additional win, providing more regular sync performance across supported filesystems. Direct IO comes with caveats under Linux. See clarifying direct IO semantics. In particular, direct IO should be viewed as an additional specialization on top of WAL file reuse. Even with grouping of write batches, most WAL syncs are small. Instrumentation of cockroach shows that 90% of WAL syncs are for less than 4KB of data on a TPCC workload. 99% of WAL syncs are for less than 16KB in size. Direct IO requires writing full filesystem pages aligned on page boundaries. Under ext4 this is 4KB aligned (TODO check what the alignment requirements are for xfs, though they are likely similar). There is a question of what to do with the tail of the WAL that doesn't fill a page. We could overwrite that data repeatedly, though there might be a problem with concurrently modifying that tail buffer and writing it to disk (it is unclear if this is safe). An alternative is to pad the WAL to a page boundary whenever a sync occurs. On a TPCC workload, this would increase space usage for the WAL by 30%. For a KV write-only workload, such padding would increase space usage by nearly 100%. Neither increase seems problematic as the WAL files are on the orders of hundreds of megabytes in size, a small fraction of the database size. Note that the amount of data being written to the SSD isn't changing. And padding to page boundaries might actually be kinder to the SSD hardware (I'm always a bit unclear on how wear leveling is done). How to pad to a page boundary? Add a |
On Linux, preallocation makes a huge difference in sync performance. WAL reuse (aka recycling...not implemented yet) provides a further improvement. And direct IO provides more stable performance on GCE Local SSD. Note that direct IO implies WAL reuse. The numbers below were gathered on an AWS m5.xlarge. name time/op DirectIOWrite/wsize=4096-4 34.4µs ± 1% DirectIOWrite/wsize=8192-4 61.0µs ± 0% DirectIOWrite/wsize=16384-4 122µs ± 0% DirectIOWrite/wsize=32768-4 244µs ± 0% SyncWrite/no-prealloc/wsize=64-4 128µs ± 8% SyncWrite/no-prealloc/wsize=512-4 146µs ± 0% SyncWrite/no-prealloc/wsize=1024-4 155µs ± 0% SyncWrite/no-prealloc/wsize=2048-4 172µs ± 0% SyncWrite/no-prealloc/wsize=4096-4 206µs ± 0% SyncWrite/no-prealloc/wsize=8192-4 206µs ± 0% SyncWrite/no-prealloc/wsize=16384-4 274µs ± 0% SyncWrite/no-prealloc/wsize=32768-4 407µs ± 4% SyncWrite/prealloc-4MB/wsize=64-4 34.2µs ± 7% SyncWrite/prealloc-4MB/wsize=512-4 47.5µs ± 0% SyncWrite/prealloc-4MB/wsize=1024-4 60.4µs ± 0% SyncWrite/prealloc-4MB/wsize=2048-4 86.4µs ± 0% SyncWrite/prealloc-4MB/wsize=4096-4 137µs ± 0% SyncWrite/prealloc-4MB/wsize=8192-4 143µs ± 7% SyncWrite/prealloc-4MB/wsize=16384-4 214µs ± 0% SyncWrite/prealloc-4MB/wsize=32768-4 337µs ± 0% SyncWrite/reuse/wsize=64-4 31.6µs ± 4% SyncWrite/reuse/wsize=512-4 31.8µs ± 4% SyncWrite/reuse/wsize=1024-4 32.4µs ± 7% SyncWrite/reuse/wsize=2048-4 31.3µs ± 1% SyncWrite/reuse/wsize=4096-4 32.2µs ± 5% SyncWrite/reuse/wsize=8192-4 61.1µs ± 0% SyncWrite/reuse/wsize=16384-4 122µs ± 0% SyncWrite/reuse/wsize=32768-4 244µs ± 0% See #41
The recyclable log record format associated a log file number with each record in a file which allows a log file to be reused safely without truncating the file. This is an important performance optimization in itself and a prerequisite for writing the WAL using direct IO. Changed `record.LogWriter` to always use the recyclable log record format which in turn changes Pebble to use the recyclable format for the WAL. `record.Writer` (used to write the `MANIFEST`) continues to use the legacy log record format. `record.Reader` can now read either format and will return an error when a record with a log number does not match the supplied log number. Took the opportunity to unexport some methods that were not being used outside of the `internal/record` package such as `Reader.SeekRecord` and `LogWriter.Flush`. Actual reuse of WAL files will be implemented in a future change. See #41
WAL recycling is an important performance optimization as it is faster to sync a file that has already been written, than one which is being written for the first time. This is due to the need to sync file metadata when a file is being written for the first time. Note this is true even if file preallocation is performed (e.g. fallocate). Fixes #41
RocksDB supports preallocating space for the WAL which can reduce the data that needs to be synced whenever
fdatasync
is called. In particular, if we pre-extend the size of the file, callingfdatasync
will only need to write the data blocks and not the inode. Under RocksDB this is achieved by settingfallocate_with_keep_size == false
.See preallocate.go, preallocate_unix.go, and preallocate_darwin.go for the etcd code which performs the OS specific system calls.
The text was updated successfully, but these errors were encountered: