-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dnm] kvserver: play with O_DIRECT #91272
Conversation
c099c68
to
0719c06
Compare
O_DIRECT is meant to be used with multiple I/O threads, up to the number of I/O channels to your hardware. It doesn't parallelize operations internally, unlike a sync orchestrated by the OS itself. |
Here's the pebble flavor (cockroachdb/pebble#76) @knz the benchmark here writes 4k blocks so there wouldn't be any chance for concurrency anyway. I think you're saying that if this were production code and were given larger (SSD-block aligned) writes, we'd need to do our own concurrency to get better throughput, right? |
Using a combination of preallocated files, aligned writes, direct I/O and fdatasync it is supposedly[^1] possible to get fast durable writes. This is appealing and I wanted to see if I could actually make it happen in a toy experiment. Bench results show that whatever I'm doing isn't producing the desired effect of high-throughput, durable writes on either gp3 or GCE local SSD. Just using O_DIRECT alone is enough to move throughput well below the 10mb/s threshold. Is is true that O_DSYNC doesn't add a large additional penalty, but the damage is already done at that point. The one, maybe, exception is pd-ssd (GCE's attached storage), where we at least get 48mb/s, though with the same pattern of O_DIRECT alone causing the major regression. Detailed results below. **a) AWS gp3** ``` roachprod create -n 1 --clouds aws --aws-ebs-iops 16000 \ --aws-ebs-throughput 1000 --aws-ebs-volume-size 500 \ --aws-ebs-volume-type=gp3 --aws-machine-type=m5.4xlarge tobias-dio \ --local-ssd=false ``` ``` $ export HOME=/mnt/data1 $ cd $HOME $ ./syncexp.test -test.benchtime 10240x -test.bench . -test.cpu 1 goos: linux goarch: amd64 pkg: github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp cpu: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz BenchmarkFoo/none 10240 6523 ns/op 598.8 mb/s BenchmarkFoo/dsync 10240 716752 ns/op 5.450 mb/s BenchmarkFoo/direct 10240 694162 ns/op 5.627 mb/s BenchmarkFoo/dsync,direct 10240 708828 ns/op 5.511 mb/s PASS ``` **b) gceworker local SSD** ``` $ go test -benchtime=10240x . -bench . -cpu 1 goos: linux goarch: amd64 pkg: github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp cpu: Intel(R) Xeon(R) CPU @ 2.30GHz BenchmarkFoo/none 10240 6833 ns/op 571.6 mb/s BenchmarkFoo/dsync 10240 476861 ns/op 8.192 mb/s BenchmarkFoo/direct 10240 411426 ns/op 9.494 mb/s BenchmarkFoo/dsync,direct 10240 498408 ns/op 7.837 mb/s PASS ok github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp 14.283s ``` **c) GCE pd-ssd** ``` $ ./syncexp.test -test.benchtime 10240x -test.bench . -test.cpu 1 goos: linux goarch: amd64 pkg: github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp cpu: Intel(R) Xeon(R) CPU @ 2.30GHz BenchmarkFoo/none 10240 6869 ns/op 568.5 mb/s --- BENCH: BenchmarkFoo/none sync_test.go:70: initialized /mnt/data1/wal-4096.bin (4.0 KiB) sync_test.go:70: initialized /mnt/data1/wal-41943040.bin (40 MiB) BenchmarkFoo/dsync 10240 86123 ns/op 45.36 mb/s BenchmarkFoo/direct 10240 80876 ns/op 48.30 mb/s BenchmarkFoo/dsync,direct 10240 80814 ns/op 48.34 mb/s PASS ``` Release note: None [^1]: cockroachdb#88442 (comment)
Running
That makes sense - my code basically corresponds to I added some even larger write sizes and was able to get up to the 300mb/s that are likely the throughput limit of the gceworker local SSD. On a beefy aws gp3 1000mb/s 16k iops volume I wasn't able to push past 170mb/s, and in fact the peak declined as the writes got even bigger.
|
Long story short, unsurprisingly, it looks like pebble's WAL is already close to optimal, with its WAL reuse + fdatasync strategy. As far as I can tell, the latencies in cockroachdb/pebble#76 are basically identical ( |
For something else that I find interesting is that @nvanbenschoten is able to push (at least) 583mb/s of WAL writes in the experiment in etcd-io/etcd#14627 (comment) (no dirty pages in this experiment since the state machine is essentially kept off disk). For the AWS runs I've been using the same provisioned hardware and machine type, yet I can't seem to push past 170mb/s. I'm likely holding something wrong. |
^-- I can also hit ~600mb/s with
and also with Footnotes
|
I did some more testing to understand the results better. In all the tests I am writing 8KB write size. Some unnecessary lines are removed. The following set of lines repeat during the run (with different offsets) O_DIRECT
DSYNC
O_DIRECT + DSYNC
What I was really expecting for the last test was just the write command with the FUA bit set:
And then no calls from kworker. I'm not sure if there is a way to make that happen, but I'll look at a few other flags and see what they do. |
Here are the results from running with and without a background load: The background load is created by running Without background
With light synchronous background
With heavy non-sync background
For reference "pre" means the file is pre-allocated. So in all cases, huge differences when there is a background job running on the disks 20x or 400x slower depending on the background job running. |
Just for completeness, I ran the same tests using GCP local SSD. Results are similar to before. No load
Low load
High load
|
Thanks Andrew! The comparison we're most interested in is preallocated+dsync vs preallocated+direct+dsync, right (though in reality we'd want to compare recycled files, not just preallocated, or is that really the same in this test?) Looking at the numbers, it seems like O_DIRECT is slightly better but not by a ton, even on p99. Is that correct? |
Using a combination of preallocated files, aligned writes, direct I/O and fdatasync it is supposedly possible to get fast durable writes. This is appealing and I wanted to see if I could actually make it happen in a toy experiment.
Bench results show that whatever I'm doing isn't producing the desired effect of high-throughput, durable writes on either gp3 or GCE local SSD. Just using O_DIRECT alone is enough to move throughput well below the 10mb/s threshold. Is is true that O_DSYNC doesn't add a large additional penalty, but the damage is already done at that point. The one, maybe, exception is pd-ssd (GCE's attached storage), where we at least get 48mb/s, though with the same pattern of O_DIRECT alone causing the major regression.
See #88442 (comment).
Detailed results below.
Details
a) AWS gp3
b) gceworker local SSD
c) GCE pd-ssd
Release note: None