[dnm] kvserver: play with O_DIRECT #91272

tbg · 2022-11-04T09:43:42Z

Using a combination of preallocated files, aligned writes, direct I/O and fdatasync it is supposedly possible to get fast durable writes. This is appealing and I wanted to see if I could actually make it happen in a toy experiment.

Bench results show that whatever I'm doing isn't producing the desired effect of high-throughput, durable writes on either gp3 or GCE local SSD. Just using O_DIRECT alone is enough to move throughput well below the 10mb/s threshold. Is is true that O_DSYNC doesn't add a large additional penalty, but the damage is already done at that point. The one, maybe, exception is pd-ssd (GCE's attached storage), where we at least get 48mb/s, though with the same pattern of O_DIRECT alone causing the major regression.

See #88442 (comment).

Detailed results below.

Details

a) AWS gp3

roachprod create -n 1 --clouds aws --aws-ebs-iops 16000 \
  --aws-ebs-throughput 1000 --aws-ebs-volume-size 500 \
  --aws-ebs-volume-type=gp3 --aws-machine-type=m5.4xlarge tobias-dio \
  --local-ssd=false

$ export HOME=/mnt/data1
$ cd $HOME
$ ./syncexp.test -test.benchtime 10240x -test.bench . -test.cpu 1
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp
cpu: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
BenchmarkFoo/none         	   10240	      6523 ns/op	       598.8 mb/s
BenchmarkFoo/dsync        	   10240	    716752 ns/op	         5.450 mb/s
BenchmarkFoo/direct       	   10240	    694162 ns/op	         5.627 mb/s
BenchmarkFoo/dsync,direct 	   10240	    708828 ns/op	         5.511 mb/s
PASS

b) gceworker local SSD

$ go test -benchtime=10240x . -bench . -cpu 1
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp
cpu: Intel(R) Xeon(R) CPU @ 2.30GHz
BenchmarkFoo/none         	   10240	      6833 ns/op	       571.6 mb/s
BenchmarkFoo/dsync        	   10240	    476861 ns/op	         8.192 mb/s
BenchmarkFoo/direct       	   10240	    411426 ns/op	         9.494 mb/s
BenchmarkFoo/dsync,direct 	   10240	    498408 ns/op	         7.837 mb/s
PASS
ok  	github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp	14.283s

c) GCE pd-ssd

$ roachprod create -n 1 tobias-gce

$ ./syncexp.test -test.benchtime 10240x -test.bench . -test.cpu 1
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp
cpu: Intel(R) Xeon(R) CPU @ 2.30GHz
BenchmarkFoo/none         	   10240	      6869 ns/op	       568.5 mb/s
--- BENCH: BenchmarkFoo/none
BenchmarkFoo/dsync        	   10240	     86123 ns/op	        45.36 mb/s
BenchmarkFoo/direct       	   10240	     80876 ns/op	        48.30 mb/s
BenchmarkFoo/dsync,direct 	   10240	     80814 ns/op	        48.34 mb/s
PASS

Release note: None

cockroach-teamcity · 2022-11-04T09:43:50Z

This change is

knz · 2022-11-04T12:22:34Z

O_DIRECT is meant to be used with multiple I/O threads, up to the number of I/O channels to your hardware. It doesn't parallelize operations internally, unlike a sync orchestrated by the OS itself.

tbg · 2022-11-04T12:55:52Z

Here's the pebble flavor (cockroachdb/pebble#76)

https://github.com/cockroachdb/pebble/blob/5f8eb821a4b1bf69214217f658743b721a973c67/vfs/syncing_file_linux_test.go#L43-L107

@knz the benchmark here writes 4k blocks so there wouldn't be any chance for concurrency anyway. I think you're saying that if this were production code and were given larger (SSD-block aligned) writes, we'd need to do our own concurrency to get better throughput, right?

Using a combination of preallocated files, aligned writes, direct I/O and fdatasync it is supposedly[^1] possible to get fast durable writes. This is appealing and I wanted to see if I could actually make it happen in a toy experiment. Bench results show that whatever I'm doing isn't producing the desired effect of high-throughput, durable writes on either gp3 or GCE local SSD. Just using O_DIRECT alone is enough to move throughput well below the 10mb/s threshold. Is is true that O_DSYNC doesn't add a large additional penalty, but the damage is already done at that point. The one, maybe, exception is pd-ssd (GCE's attached storage), where we at least get 48mb/s, though with the same pattern of O_DIRECT alone causing the major regression. Detailed results below. **a) AWS gp3** ``` roachprod create -n 1 --clouds aws --aws-ebs-iops 16000 \ --aws-ebs-throughput 1000 --aws-ebs-volume-size 500 \ --aws-ebs-volume-type=gp3 --aws-machine-type=m5.4xlarge tobias-dio \ --local-ssd=false ``` ``` $ export HOME=/mnt/data1 $ cd $HOME $ ./syncexp.test -test.benchtime 10240x -test.bench . -test.cpu 1 goos: linux goarch: amd64 pkg: github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp cpu: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz BenchmarkFoo/none 10240 6523 ns/op 598.8 mb/s BenchmarkFoo/dsync 10240 716752 ns/op 5.450 mb/s BenchmarkFoo/direct 10240 694162 ns/op 5.627 mb/s BenchmarkFoo/dsync,direct 10240 708828 ns/op 5.511 mb/s PASS ``` **b) gceworker local SSD** ``` $ go test -benchtime=10240x . -bench . -cpu 1 goos: linux goarch: amd64 pkg: github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp cpu: Intel(R) Xeon(R) CPU @ 2.30GHz BenchmarkFoo/none 10240 6833 ns/op 571.6 mb/s BenchmarkFoo/dsync 10240 476861 ns/op 8.192 mb/s BenchmarkFoo/direct 10240 411426 ns/op 9.494 mb/s BenchmarkFoo/dsync,direct 10240 498408 ns/op 7.837 mb/s PASS ok github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp 14.283s ``` **c) GCE pd-ssd** ``` $ ./syncexp.test -test.benchtime 10240x -test.bench . -test.cpu 1 goos: linux goarch: amd64 pkg: github.com/cockroachdb/cockroach/pkg/kv/kvserver/syncexp cpu: Intel(R) Xeon(R) CPU @ 2.30GHz BenchmarkFoo/none 10240 6869 ns/op 568.5 mb/s --- BENCH: BenchmarkFoo/none sync_test.go:70: initialized /mnt/data1/wal-4096.bin (4.0 KiB) sync_test.go:70: initialized /mnt/data1/wal-41943040.bin (40 MiB) BenchmarkFoo/dsync 10240 86123 ns/op 45.36 mb/s BenchmarkFoo/direct 10240 80876 ns/op 48.30 mb/s BenchmarkFoo/dsync,direct 10240 80814 ns/op 48.34 mb/s PASS ``` Release note: None [^1]: cockroachdb#88442 (comment)

tbg · 2022-11-04T13:40:19Z

Running BenchmarkDirectIO shows:

$ go test -v ./vfs/ -bench BenchmarkDirectIOWrite -run -
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble/vfs
cpu: Intel(R) Xeon(R) CPU @ 2.30GHz
BenchmarkDirectIOWrite
BenchmarkDirectIOWrite/wsize=4096
BenchmarkDirectIOWrite/wsize=4096-24         	    2323	    543014 ns/op	   7.54 MB/s
BenchmarkDirectIOWrite/wsize=8192
BenchmarkDirectIOWrite/wsize=8192-24         	    2212	    560498 ns/op	  14.62 MB/s
BenchmarkDirectIOWrite/wsize=16384
BenchmarkDirectIOWrite/wsize=16384-24        	    2110	    574114 ns/op	  28.54 MB/s
BenchmarkDirectIOWrite/wsize=32768
BenchmarkDirectIOWrite/wsize=32768-24        	    2014	    597021 ns/op	  54.89 MB/s
PASS

That makes sense - my code basically corresponds to wsize=4096 and the results here are similar (this was run on gceworker, and very closely matches 7.837 mb/s that I got earlier, but the latency - in this single, possibly noisy, run - was better for dsync+o_direct vs o_direct + flush() which is what pebble does; this makes sense).

I added some even larger write sizes and was able to get up to the 300mb/s that are likely the throughput limit of the gceworker local SSD.

On a beefy aws gp3 1000mb/s 16k iops volume I wasn't able to push past 170mb/s, and in fact the peak declined as the writes got even bigger.

$ ./vfs.test -test.run - -test.bench BenchmarkDirectIO -test.v -test.benchtime=1000x
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/pebble/vfs
cpu: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
BenchmarkDirectIOWrite
BenchmarkDirectIOWrite/wsize=4096
BenchmarkDirectIOWrite/wsize=4096-16         	    1000	    706471 ns/op	   5.80 MB/s
BenchmarkDirectIOWrite/wsize=8192
BenchmarkDirectIOWrite/wsize=8192-16         	    1000	    791918 ns/op	  10.34 MB/s
BenchmarkDirectIOWrite/wsize=16384
BenchmarkDirectIOWrite/wsize=16384-16        	    1000	    954515 ns/op	  17.16 MB/s
BenchmarkDirectIOWrite/wsize=32768
BenchmarkDirectIOWrite/wsize=32768-16        	    1000	    986956 ns/op	  33.20 MB/s
BenchmarkDirectIOWrite/wsize=65536
BenchmarkDirectIOWrite/wsize=65536-16        	    1000	    972956 ns/op	  67.36 MB/s
BenchmarkDirectIOWrite/wsize=131072
BenchmarkDirectIOWrite/wsize=131072-16       	    1000	   1093665 ns/op	 119.85 MB/s
BenchmarkDirectIOWrite/wsize=262144
BenchmarkDirectIOWrite/wsize=262144-16       	    1000	   1535624 ns/op	 170.71 MB/s
BenchmarkDirectIOWrite/wsize=524288
BenchmarkDirectIOWrite/wsize=524288-16       	    1000	   3925632 ns/op	 133.56 MB/s
BenchmarkDirectIOWrite/wsize=1048576
BenchmarkDirectIOWrite/wsize=1048576-16      	    1000	   7643925 ns/op	 137.18 MB/s
BenchmarkDirectIOWrite/wsize=2097152
BenchmarkDirectIOWrite/wsize=2097152-16      	    1000	  15307620 ns/op	 137.00 MB/s
BenchmarkDirectIOWrite/wsize=4194304
BenchmarkDirectIOWrite/wsize=4194304-16      	    1000	  30598015 ns/op	 137.08 MB/s
PASS

tbg · 2022-11-04T13:57:34Z

Long story short, unsurprisingly, it looks like pebble's WAL is already close to optimal, with its WAL reuse + fdatasync strategy. As far as I can tell, the latencies in cockroachdb/pebble#76 are basically identical (DirectIOWrite vs SyncWrite/reuse)

tbg · 2022-11-04T14:03:25Z

For something else that I find interesting is that @nvanbenschoten is able to push (at least) 583mb/s of WAL writes in the experiment in etcd-io/etcd#14627 (comment) (no dirty pages in this experiment since the state machine is essentially kept off disk). For the AWS runs I've been using the same provisioned hardware and machine type, yet I can't seem to push past 170mb/s. I'm likely holding something wrong.

tbg · 2022-11-04T16:09:23Z

^-- I can also hit ~600mb/s with

fio --rw=write --name=test --size=10gb --direct=0 --fdatasync=32 --bs=4096k --numjobs=1 --group_reporting

and also with --direct=1. This is less than what the volume is provisioned for, but it's the max EBS bandwidth for the m5.4xlarge instance type¹. I tried with m5.24xlarge and then am able to get to the provisioned throughput of 1GB/s.

https://aws.amazon.com/ec2/instance-types/?trk=1c70ffc0-2c7c-41d2-802c-21145de63ecb&sc_channel=ps&s_kwcid=AL!4422!3!536451507014!e!!g!!aws%20instance%20types&ef_id=CjwKCAjw8JKbBhBYEiwAs3sxN7TFfAo2HURDlqjmDZ8YcP_QND4-fR9M6gPuf6rF_duCa_S_m6aOGxoCsbcQAvD_BwE:G:s&s_kwcid=AL!4422!3!536451507014!e!!g!!aws%20instance%20types ↩

andrewbaptist · 2022-11-04T17:43:55Z

I did some more testing to understand the results better.
First - it appears running with fdatasync vs fdatasync + o_direct issue virtually identical commands to the drive (here is a short trace of them captured using blktrace / blkparse)

In all the tests I am writing 8KB write size. Some unnecessary lines are removed. The following set of lines repeat during the run (with different offsets)

O_DIRECT

  8,1    6       90     0.005030102 47868  D  WS 44124000 + 8 [disk.test] // Same write command as above
  8,1    6       86     0.005028556 47868  Q  WS 44124000 + 8 [disk.test] // Notice the missing F on the WS

DSYNC

  8,1    0    11700     0.693858781 45628  D  WS 44128992 + 8 [disk.test] // Write 8 blocks with offset 44128992 and sync flag set.
  8,1    0    11701     0.694208814 45628  Q FWS [disk.test] // IO handled by request queue???
  8,1    0    11703     0.694218104     9  D  FN [kworker/0:1H] // Kernel (kworker) calls FUA on a no-op command to force it to write.

O_DIRECT + DSYNC

  8,1   23    19041     1.229416942 48420  D  WS 35786368 + 8 [disk.test] // Same write command again
  8,1   23    19043     1.229731794 48420  Q FWS [disk.test] // Same Q command as DSYNC
  8,1   23    19045     1.229734470   236  D  FN [kworker/23:1H] // Same No-op command

What I was really expecting for the last test was just the write command with the FUA bit set:

  8,1   23    19041     1.229416942 48420  D  FWS 35786368 + 8 [disk.test]

And then no calls from kworker. I'm not sure if there is a way to make that happen, but I'll look at a few other flags and see what they do.

andrewbaptist · 2022-11-04T18:05:28Z

Here are the results from running with and without a background load:

The background load is created by running

Without background

BenchmarkDisk/none_pre:_true_sync:_true-24         	      12	1238645454 ns/op	        12.94 mb/s	      1591 µP99
BenchmarkDisk/dsync_pre:_true_sync:_true-24        	      12	1340566538 ns/op	        11.95 mb/s	      1542 µP99
BenchmarkDisk/direct_pre:_true_sync:_true-24       	      10	1286769702 ns/op	        12.43 mb/s	      1929 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_true-24     	      10	2007231466 ns/op	         7.971 mb/s	      2249 µP99
BenchmarkDisk/none_pre:_true_sync:_false-24        	     458	  26559345 ns/op	       651.0 mb/s	        15.83 µP99
BenchmarkDisk/dsync_pre:_true_sync:_false-24       	      12	1259256263 ns/op	        12.73 mb/s	      1576 µP99
BenchmarkDisk/direct_pre:_true_sync:_false-24      	      13	 423314385 ns/op	        37.80 mb/s	       595.8 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_false-24    	      13	1272375492 ns/op	        12.57 mb/s	      1751 µP99
BenchmarkDisk/none_pre:_false_sync:_true-24        	       4	1390231070 ns/op	        11.51 mb/s	      1525 µP99
BenchmarkDisk/dsync_pre:_false_sync:_true-24       	       4	1380457011 ns/op	        11.59 mb/s	      1508 µP99
BenchmarkDisk/direct_pre:_false_sync:_true-24      	       4	1295780321 ns/op	        12.35 mb/s	      1443 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_true-24    	       3	2187435961 ns/op	         7.314 mb/s	      2549 µP99
BenchmarkDisk/none_pre:_false_sync:_false-24       	     512	  26702503 ns/op	       599.2 mb/s	        14.38 µP99
BenchmarkDisk/dsync_pre:_false_sync:_false-24      	       4	1380766279 ns/op	        11.59 mb/s	      1558 µP99
BenchmarkDisk/direct_pre:_false_sync:_false-24     	      14	 427984386 ns/op	        37.38 mb/s	       626.5 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_false-24   	       4	1371626356 ns/op	        11.66 mb/s	      1553 µP99

With light synchronous background
fio --name=write_throughput --size=100M --time_based --runtime=600s --bs=128K --fdatasync=32 --iodepth=1 --rw=randwrite

BenchmarkDisk/none_pre:_true_sync:_true-24         	       3	4009278016 ns/op	         3.993 mb/s	     13365 µP99
BenchmarkDisk/dsync_pre:_true_sync:_true-24        	       2	3470970798 ns/op	         4.613 mb/s	     13433 µP99
BenchmarkDisk/direct_pre:_true_sync:_true-24       	       3	3851960578 ns/op	         4.154 mb/s	     13006 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_true-24     	       2	5082365004 ns/op	         3.148 mb/s	     14290 µP99
BenchmarkDisk/none_pre:_true_sync:_false-24        	     457	  27348302 ns/op	       649.9 mb/s	        18.22 µP99
BenchmarkDisk/dsync_pre:_true_sync:_false-24       	       2	3374042723 ns/op	         4.745 mb/s	     13231 µP99
BenchmarkDisk/direct_pre:_true_sync:_false-24      	       4	1625394318 ns/op	         9.844 mb/s	      8743 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_false-24    	       3	3824970476 ns/op	         4.183 mb/s	     13113 µP99
BenchmarkDisk/none_pre:_false_sync:_true-24        	       1	5106366319 ns/op	         3.133 mb/s	     13412 µP99
BenchmarkDisk/dsync_pre:_false_sync:_true-24       	       1	5272358502 ns/op	         3.035 mb/s	     13196 µP99
BenchmarkDisk/direct_pre:_false_sync:_true-24      	       1	5105863824 ns/op	         3.134 mb/s	     13505 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_true-24    	       1	8021056981 ns/op	         1.995 mb/s	     14347 µP99
BenchmarkDisk/none_pre:_false_sync:_false-24       	     530	  28675148 ns/op	       558.0 mb/s	        11.37 µP99
BenchmarkDisk/dsync_pre:_false_sync:_false-24      	       1	5585596284 ns/op	         2.865 mb/s	     13081 µP99
BenchmarkDisk/direct_pre:_false_sync:_false-24     	       4	1464614786 ns/op	        10.92 mb/s	      8778 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_false-24   	       1	5072423798 ns/op	         3.154 mb/s	     13514 µP99

With heavy non-sync background
fio --name=write_throughput --size=100M --time_based --runtime=600s --bs=128K --iodepth=64 --rw=randwrite

BenchmarkDisk/none_pre:_true_sync:_true-24         	       2	8019921604 ns/op	         1.996 mb/s	    155396 µP99
BenchmarkDisk/dsync_pre:_true_sync:_true-24        	       2	9037770432 ns/op	         1.771 mb/s	    159796 µP99
BenchmarkDisk/direct_pre:_true_sync:_true-24       	       2	7998823202 ns/op	         2.000 mb/s	    133362 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_true-24     	       2	12352198902 ns/op	         1.295 mb/s	    222613 µP99
BenchmarkDisk/none_pre:_true_sync:_false-24        	     463	  34015463 ns/op	       497.7 mb/s	        15.02 µP99
BenchmarkDisk/dsync_pre:_true_sync:_false-24       	       2	8674885886 ns/op	         1.845 mb/s	    159735 µP99
BenchmarkDisk/direct_pre:_true_sync:_false-24      	       2	3300131652 ns/op	         4.848 mb/s	     63642 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_false-24    	       2	8213039106 ns/op	         1.948 mb/s	    140962 µP99
BenchmarkDisk/none_pre:_false_sync:_true-24        	       1	12631980252 ns/op	         1.267 mb/s	    155088 µP99
BenchmarkDisk/dsync_pre:_false_sync:_true-24       	       1	13345513393 ns/op	         1.199 mb/s	    159818 µP99
BenchmarkDisk/direct_pre:_false_sync:_true-24      	       1	12638305736 ns/op	         1.266 mb/s	    138920 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_true-24    	       1	20365517321 ns/op	         0.7856 mb/s	    211620 µP99
BenchmarkDisk/none_pre:_false_sync:_false-24       	     512	  38586711 ns/op	       414.7 mb/s	        24.84 µP99
BenchmarkDisk/dsync_pre:_false_sync:_false-24      	       1	14422855090 ns/op	         1.109 mb/s	    129323 µP99
BenchmarkDisk/direct_pre:_false_sync:_false-24     	       2	3523167069 ns/op	         4.541 mb/s	     67654 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_false-24   	       1	13015565783 ns/op	         1.229 mb/s	    103118 µP99

For reference "pre" means the file is pre-allocated. syscall.Fdatasync means that sync is called after each write operation. It is expected that sync would only have an impact on calls that don't use sync noramlly.

So in all cases, huge differences when there is a background job running on the disks 20x or 400x slower depending on the background job running.

andrewbaptist · 2022-11-04T20:10:09Z

Just for completeness, I ran the same tests using GCP local SSD. Results are similar to before.

No load

BenchmarkDisk/none_pre:_true_sync:_true-4         	       8	1710270092 ns/op	         9.37 mb/s	      2592 µP99
BenchmarkDisk/dsync_pre:_true_sync:_true-4        	       7	1689939661 ns/op	         9.48 mb/s	      2732 µP99
BenchmarkDisk/direct_pre:_true_sync:_true-4       	       8	1659618131 ns/op	         9.64 mb/s	      2791 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_true-4     	       8	1711753250 ns/op	         9.35 mb/s	      2109 µP99
BenchmarkDisk/none_pre:_true_sync:_false-4        	     316	  43425868 ns/op	       401 mb/s	       440 µP99
BenchmarkDisk/dsync_pre:_true_sync:_false-4       	       8	1634831314 ns/op	         9.80 mb/s	      2031 µP99
BenchmarkDisk/direct_pre:_true_sync:_false-4      	      10	 618531522 ns/op	        25.9 mb/s	       662 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_false-4    	       8	1753448424 ns/op	         9.12 mb/s	      2407 µP99
BenchmarkDisk/none_pre:_false_sync:_true-4        	       3	1766277008 ns/op	         9.06 mb/s	      1957 µP99
BenchmarkDisk/dsync_pre:_false_sync:_true-4       	       3	1828155550 ns/op	         8.75 mb/s	      2940 µP99
BenchmarkDisk/direct_pre:_false_sync:_true-4      	       3	1757773168 ns/op	         9.10 mb/s	      3276 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_true-4    	       3	1848917775 ns/op	         8.65 mb/s	      2688 µP99
BenchmarkDisk/none_pre:_false_sync:_false-4       	     429	  44298698 ns/op	       361 mb/s	        46.0 µP99
BenchmarkDisk/dsync_pre:_false_sync:_false-4      	       1	12516907394 ns/op	         1.28 mb/s	      4408 µP99
BenchmarkDisk/direct_pre:_false_sync:_false-4     	       9	 576603141 ns/op	        27.7 mb/s	       661 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_false-4   	       3	1752307389 ns/op	         9.13 mb/s	      2096 µP99

Low load

BenchmarkDisk/none_pre:_true_sync:_true-4         	       2	4598542615 ns/op	         3.48 mb/s	     15741 µP99
BenchmarkDisk/dsync_pre:_true_sync:_true-4        	       2	4685235336 ns/op	         3.42 mb/s	     16272 µP99
BenchmarkDisk/direct_pre:_true_sync:_true-4       	       2	4546165543 ns/op	         3.52 mb/s	     15971 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_true-4     	       2	4762879076 ns/op	         3.36 mb/s	     15841 µP99
BenchmarkDisk/none_pre:_true_sync:_false-4        	     280	  39199910 ns/op	       439 mb/s	        16.9 µP99
BenchmarkDisk/dsync_pre:_true_sync:_false-4       	       2	4664986696 ns/op	         3.43 mb/s	     16079 µP99
BenchmarkDisk/direct_pre:_true_sync:_false-4      	       3	2113261201 ns/op	         7.57 mb/s	     13330 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_false-4    	       2	4589112903 ns/op	         3.49 mb/s	     15307 µP99
BenchmarkDisk/none_pre:_false_sync:_true-4        	       1	6859935520 ns/op	         2.33 mb/s	     15603 µP99
BenchmarkDisk/dsync_pre:_false_sync:_true-4       	       1	6854349516 ns/op	         2.33 mb/s	     16922 µP99
BenchmarkDisk/direct_pre:_false_sync:_true-4      	       1	6790783076 ns/op	         2.36 mb/s	     16276 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_true-4    	       1	7059430671 ns/op	         2.27 mb/s	     15985 µP99
BenchmarkDisk/none_pre:_false_sync:_false-4       	     392	  45559759 ns/op	       351 mb/s	        57.0 µP99
BenchmarkDisk/dsync_pre:_false_sync:_false-4      	       1	17424536302 ns/op	         0.918 mb/s	     15957 µP99
BenchmarkDisk/direct_pre:_false_sync:_false-4     	       3	2174630650 ns/op	         7.36 mb/s	     13862 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_false-4   	       1	6842719456 ns/op	         2.34 mb/s	     15497 µP99

High load

BenchmarkDisk/none_pre:_true_sync:_true-4         	       1	5983721346 ns/op	         2.67 mb/s	     97600 µP99
BenchmarkDisk/dsync_pre:_true_sync:_true-4        	       1	6125400324 ns/op	         2.61 mb/s	     98046 µP99
BenchmarkDisk/direct_pre:_true_sync:_true-4       	       1	5679268977 ns/op	         2.82 mb/s	     98508 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_true-4     	       1	6092119833 ns/op	         2.63 mb/s	     98727 µP99
BenchmarkDisk/none_pre:_true_sync:_false-4        	     252	  54462484 ns/op	       312 mb/s	        39.4 µP99
BenchmarkDisk/dsync_pre:_true_sync:_false-4       	       1	5680659063 ns/op	         2.82 mb/s	     96873 µP99
BenchmarkDisk/direct_pre:_true_sync:_false-4      	       2	4548252517 ns/op	         3.52 mb/s	     83040 µP99
BenchmarkDisk/s+direct_pre:_true_sync:_false-4    	       1	5718927922 ns/op	         2.80 mb/s	     98311 µP99
BenchmarkDisk/none_pre:_false_sync:_true-4        	       1	16000013001 ns/op	         1.00 mb/s	    245491 µP99
BenchmarkDisk/dsync_pre:_false_sync:_true-4       	       1	16381048816 ns/op	         0.977 mb/s	    197249 µP99
BenchmarkDisk/direct_pre:_false_sync:_true-4      	       1	14770711900 ns/op	         1.08 mb/s	    134806 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_true-4    	       1	15583920989 ns/op	         1.03 mb/s	    201587 µP99
BenchmarkDisk/none_pre:_false_sync:_false-4       	     391	  79159258 ns/op	       202 mb/s	        47.4 µP99
BenchmarkDisk/dsync_pre:_false_sync:_false-4      	       1	29135669935 ns/op	         0.549 mb/s	    246163 µP99
BenchmarkDisk/direct_pre:_false_sync:_false-4     	       2	4236379428 ns/op	         3.78 mb/s	     77824 µP99
BenchmarkDisk/s+direct_pre:_false_sync:_false-4   	       1	16015944751 ns/op	         0.999 mb/s	    155620 µP99

tbg · 2022-11-07T08:38:34Z

Thanks Andrew! The comparison we're most interested in is preallocated+dsync vs preallocated+direct+dsync, right (though in reality we'd want to compare recycled files, not just preallocated, or is that really the same in this test?)

Looking at the numbers, it seems like O_DIRECT is slightly better but not by a ton, even on p99. Is that correct?

tbg force-pushed the sync-exp branch 6 times, most recently from c099c68 to 0719c06 Compare November 4, 2022 10:48

tbg force-pushed the sync-exp branch from 0719c06 to cb4103c Compare November 4, 2022 13:31

tbg mentioned this pull request Nov 7, 2022

replication: avoid fsync during raft log append #88442

Open

tbg closed this Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dnm] kvserver: play with O_DIRECT #91272

[dnm] kvserver: play with O_DIRECT #91272

tbg commented Nov 4, 2022 •

edited

Loading

cockroach-teamcity commented Nov 4, 2022

knz commented Nov 4, 2022

tbg commented Nov 4, 2022

tbg commented Nov 4, 2022 •

edited

Loading

tbg commented Nov 4, 2022 •

edited

Loading

tbg commented Nov 4, 2022

tbg commented Nov 4, 2022 •

edited

Loading

andrewbaptist commented Nov 4, 2022

andrewbaptist commented Nov 4, 2022 •

edited

Loading

andrewbaptist commented Nov 4, 2022

tbg commented Nov 7, 2022

[dnm] kvserver: play with O_DIRECT #91272

[dnm] kvserver: play with O_DIRECT #91272

Conversation

tbg commented Nov 4, 2022 • edited Loading

cockroach-teamcity commented Nov 4, 2022

knz commented Nov 4, 2022

tbg commented Nov 4, 2022

tbg commented Nov 4, 2022 • edited Loading

tbg commented Nov 4, 2022 • edited Loading

tbg commented Nov 4, 2022

tbg commented Nov 4, 2022 • edited Loading

Footnotes

andrewbaptist commented Nov 4, 2022

O_DIRECT

DSYNC

O_DIRECT + DSYNC

andrewbaptist commented Nov 4, 2022 • edited Loading

andrewbaptist commented Nov 4, 2022

tbg commented Nov 7, 2022

tbg commented Nov 4, 2022 •

edited

Loading

tbg commented Nov 4, 2022 •

edited

Loading

tbg commented Nov 4, 2022 •

edited

Loading

tbg commented Nov 4, 2022 •

edited

Loading

andrewbaptist commented Nov 4, 2022 •

edited

Loading