OpenZFS 8585 - improve batching done in zil_commit() #6566

prakashsurya · 2017-08-25T19:27:33Z

Authored by: Prakash Surya [email protected]
Reviewed by: Brad Lewis [email protected]
Reviewed by: Matt Ahrens [email protected]
Reviewed by: George Wilson [email protected]
Ported-by: Prakash Surya [email protected]

Problem

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.

As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.

Solution

The solution attempted by this change has the following goals:

no on-disk changes; maintain current on-disk format.
modify the "batch size" to be equal to the "ZIL block size".
allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
allow zil_commit() to wait for as few ZIL blocks as possible.
use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.

Porting Notes

Due to the changes made in commit 119a394, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).

To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:

A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.

Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:

The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
New tracepoints had to be added for the following 3 new probes:
- zil__process__commit__itx
- zil__process__normal__itx
- zil__commit__io__error

OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: openzfs/openzfs@5d95a3a

prakashsurya · 2017-08-25T19:43:38Z

@sempervictus this is a second attempt at porting what I previously had in #6337, and attempts to address the same performance issues as #6133 .. if you can, it'd be great if you could try this out in your test environment.

I've fixed the prior issue with msync and the zfstest suite, and was able to successfully run this version through the test suite in a local VM. Additionally, I believe the build issues that you previously hit w.r.t. the new tracepoints have been fixed (tested and verified they worked in my VM).

The OpenZFS PR contains links to data on the performance testing that I previously did for this change (on illumos); see the "Performance Testing" section of that PR's description: openzfs/openzfs#447

sempervictus · 2017-08-25T20:46:04Z

@prakashsurya: adding to the test queue, thanks for digging in so quickly.

prakashsurya · 2017-08-28T15:19:08Z

@behlendorf is the nopwrite/nopwrite_volume test reliable? that appears to have failed, so I'm curious if that's a known failure mode in the existing code, or if it's likely a result of my changes (and I should dig into the root cause). here's what the log shows:

Test: /usr/share/zfs/zfs-tests/tests/functional/nopwrite/nopwrite_volume (run as root) [00:14] [FAIL]
05:35:21.18 ASSERTION: nopwrite works on volumes
05:35:21.33 SUCCESS: zfs set compress=on testpool/testvol.nopwrite
05:35:21.42 SUCCESS: zfs set checksum=sha256 testpool/testvol.nopwrite
05:35:31.03 SUCCESS: zfs clone testpool/testvol.nopwrite@a testpool/clone
05:35:31.33 SUCCESS: zfs set compress=on testpool/clone
05:35:31.43 SUCCESS: zfs set checksum=sha256 testpool/clone
05:35:33.59 SUCCESS: zpool sync testpool
05:35:33.76 NOTE: Comparing 33886720 and 377856 given 2% (calculated: 1.11%)
05:35:33.78 NOTE: Comparing 33907712 and 0 given 2% (calculated: 0%)
05:35:33.79 NOTE: Comparing 33886720 and 377856 given 2% (calculated: 1.11%)
05:35:33.81 NOTE: Comparing 33907712 and 33886720 given 98% (calculated: 99.93%)
05:35:33.83 NOTE: Comparing 33886720 and 33886720 given 98% (calculated: 100.00%)
05:35:35.12 NOTE: Comparing 377856 and 356864 given 98% (calculated: 94.44%)
05:35:35.13 ERROR: verify_nopwrite testpool/testvol.nopwrite testpool/testvol.nopwrite@a testpool/clone exited 1

behlendorf · 2017-08-28T17:49:58Z

@prakashsurya it hasn't been 100% reliable and we've opened #5510 to track the known failure. We see a failure maybe 1/50 runs on the 32-bit tester but haven't had a chance completely root cause and fix the issue. Let me run it through again and see if it's reproducible, below is the link to the previous test failures so we don't lose track of it.

http://build.zfsonlinux.org/builders/Ubuntu%2014.04%20i686%20%28TEST%29/builds/3556

sempervictus · 2017-08-28T17:56:56Z

A bunch of zloop later, still haven't blown it up. Going to push to some metal today, see how it fares on real world ops.

sempervictus · 2017-08-29T15:28:45Z

This is promising - two personal production systems now running with this PR, send/recv working, zvols being used for KVM instances. Contended writes are a lot better - double digit reduction in iowait with multiple VMS running FIO. Thank you @prakashsurya

prakashsurya · 2017-08-29T16:01:31Z

appreciate the feedback; thanks @sempervictus.

behlendorf · 2017-08-30T20:19:16Z

@prakashsurya it looks like there's still a failure on the 32-bit testers which needs to be run down. I ran the tests a few times to make sure and it's reproducible. The test case zpool_clear_001_pos appears to cause the issue and I was able to pull the following stack out of the ZTS logs. ~~Unfortunately, the exact ASSERT we hit was lost~~.

VERIFY3(size != 0) failed (0 != 0)
PANIC at zfs_vnops.c:1014:zfs_get_data()
Showing stack for process 5466
CPU: 0 PID: 5466 Comm: z_vdev_file Tainted: P           OX 3.13.0-117-generic #164-Ubuntu

Call Trace:
 [<c16633c3>] schedule+0x23/0x60
 [<d5606485>] spl_panic+0xc5/0xe0 [spl]
 [<d5ddbac2>] zfs_get_data+0x1b2/0x570 [zfs] 
 [<d5dea586>] zil_commit+0x1266/0x1fd0 [zfs] 
 [<d5dd32dc>] zfs_fsync+0x6c/0xd0 [zfs]
 [<d5e061af>] zpl_fsync+0x7f/0x130 [zfs]
 [<c11a6f9f>] vfs_fsync+0x2f/0x40
 [<d560591a>] vn_fsync+0x3a/0xd0 [spl]
 [<d5d914d0>] vdev_file_io_fsync+0x30/0x50
 [<d5603997>] taskq_thread+0x267/0x5a0 [spl] 
 [<c1076fdb>] kthread+0x9b/0xb0

[edited] I was able to reproduce this issue by running the zpool_clear test group. Updated the stack trace to include the ASSERT hit.

prakashsurya · 2017-08-31T05:46:19Z

Thanks for the report. Briefly looking at the assertion, I don't know why that'd be occurring; I'll have to dig more into it and maybe set up an environment to reproduce it locally. It appears as though the write record passed into zfs_get_data has an lr_length of zero, but I don't immediately know how/why that would occur (nor why it'd only occur on the 32-bit system). This needs more investigation.

sempervictus · 2017-09-01T14:14:58Z

@prakashsurya: saw similar error in a long and laggy send/recv. Not sure if its the patch or a very slow recv host though - hung task didn't crash hosts. The original 64b p4 in there is not loving it. Shows how far chips have come in a decade I guess.

prakashsurya · 2017-09-05T18:58:42Z

While running some tests of this patch in a VM, I managed trip the following assertion:

[ 1806.416279] VERIFY3(itx->itx_wr_state == WR_INDIRECT) failed (4294967295 == 0)
[ 1806.416315] PANIC at zil.c:1506:zil_lwb_commit()
[ 1806.416333] Showing stack for process 28365
[ 1806.416336] CPU: 5 PID: 28365 Comm: fio Tainted: P           OE   4.4.0-93-generic #116-Ubuntu
[ 1806.416337] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[ 1806.416338]  0000000000000286 16a1c125c5c4eec7 ffff887cd21678c0 ffffffff813f9f83
[ 1806.416341]  ffffffffc136d454 00000000000005e2 ffff887cd21678d0 ffffffffc02d6fd2
[ 1806.416342]  ffff887cd2167a58 ffffffffc02d70a5 ffff887b665d75d0 ffff887c00000030
[ 1806.416344] Call Trace:
[ 1806.416350]  [<ffffffff813f9f83>] dump_stack+0x63/0x90
[ 1806.416362]  [<ffffffffc02d6fd2>] spl_dumpstack+0x42/0x50 [spl]
[ 1806.416366]  [<ffffffffc02d70a5>] spl_panic+0xc5/0x100 [spl]
[ 1806.416428]  [<ffffffffc127ab1e>] ? zio_rewrite+0x6e/0x70 [zfs]
[ 1806.416469]  [<ffffffffc126bd90>] ? zil_lwb_write_open+0x520/0x520 [zfs]
[ 1806.416509]  [<ffffffffc126b96f>] ? zil_lwb_write_open+0xff/0x520 [zfs]
[ 1806.416551]  [<ffffffffc126bd90>] ? zil_lwb_write_open+0x520/0x520 [zfs]
[ 1806.416591]  [<ffffffffc1271daa>] zil_commit+0x14ba/0x1d90 [zfs]
[ 1806.416635]  [<ffffffffc1260d6d>] zfs_write+0xe0d/0x1020 [zfs]
[ 1806.416677]  [<ffffffffc128b391>] zpl_write_common_iovec+0x91/0x130 [zfs]
[ 1806.416717]  [<ffffffffc128b4e1>] zpl_iter_write+0xb1/0xf0 [zfs]
[ 1806.416721]  [<ffffffff8120ec0b>] new_sync_write+0x9b/0xe0
[ 1806.416723]  [<ffffffff8120ec76>] __vfs_write+0x26/0x40
[ 1806.416724]  [<ffffffff8120f5f9>] vfs_write+0xa9/0x1a0
[ 1806.416726]  [<ffffffff81210465>] SyS_pwrite64+0x95/0xb0
[ 1806.416729]  [<ffffffff81039d03>] ? fpu__restore+0xf3/0x150
[ 1806.416734]  [<ffffffff818431f2>] entry_SYSCALL_64_fastpath+0x16/0x71

this was running on a VM with 8 SSDs, 32 vCPUs, 512GB of RAM, and 512 FIO threads performing synchronous writes; I'm not yet sure what the root cause for this is.

sempervictus · 2017-09-05T19:05:56Z

I believe I saw that on a prior revision of this PR. Couldn't repro however.

prakashsurya · 2017-09-05T19:44:14Z

OK, that's good to know. I can reproduce this every time I run this workload on this system, so I hope to root cause it sometime this week (it's a shame not having mdb at my disposal).

sempervictus · 2017-10-07T17:45:28Z

We've had this running for a few weeks including a few production hosts backing cinder with pcie ssds for slog on spinning pools. Writes are definitely snappier and the stalls are greatly reduced under concurrent load. If there's no blocker outstanding, I think this should get merged for wider eval.

kernelOfTruth · 2017-10-07T18:58:12Z

Just to understand this a bit better: this doesn't apply with no dedicated ZIL device connected, right ? [from sempervictus' comment dedicated devices appear to be needed to benefit from the improvements]

In any case those lower latencies and higher IOPS are impressive !

Thanks a lot !

prakashsurya · 2017-10-08T19:05:46Z

@kernelOfTruth The "ZIL" cannot be removed, so I think you meant "SLOG" right? If so, yes, this should improve latency even with no SLOG connected. My performance tests done on illumos/delphixos were done without an SLOG, and showed improved latencies; see here for more details.

@sempervictus There's two blockers which I haven't yet gotten around to fixing:

I hit a panic when running my performance tests on an Ubuntu VM with an all SSD pool, and 32 vCPUs. See here.
Brian hit a panic when running on a 32-bit Ubuntu VM. See here.

IMO, both of these should be investigated and fixed prior to landing.

kernelOfTruth · 2017-10-08T20:06:46Z

@prakashsurya yes, sorry I meant SLOG device, all of the time I was reading on the web about it only was called "ZIL device"

[the following seems to be one article that uses the correct term https://www.ixsystems.com/blog/o-slog-not-slog-best-configure-zfs-intent-log/ ]

Thanks

sempervictus · 2017-10-08T21:43:36Z

@prakashsurya: regarding the crash you experienced, i'm rather suspicious of your hypervisor. Type 2 HVs are known for doing "the wrong thing" when it comes to IO, specifically as relating to synchronous semantics/commit order/other things sane people expect to work. VBox is most guilty of this, but any type 2 is suspect. Can you repro on ESXi?
Far as the failure on 32bit goes, i'm glad to see some support for the platform, but the vast majority of real world consumers are 64bit. Maybe leave in the existing codepath defined out only for 32bit? Are newer 32bit distros seeing this as well? Compiler version does make a difference...
On a related note - I've realized that running Arch Linux (rolling distro) i'm always getting the newest version of gcc (which makes for a nightmarish experience with randstruct and other plugins affecting my kernel and dkms modules). Apaprently there's benefit over the older versions found in more mainline distributions in terms of compatibility and performance. There's a good example of this on the kernel-hardening list in discussions around PIE performance impact. It might make sense to have a builder on such a distribution which can produce packages for other distros so we can compare them against natively built versions when we have reproducers (to track down the problem to the distro or code). It would also give us "current baseline" binaries for each tested commit, which may not hurt either.

@kernelOfTruth: Sorry, wasn't my intent to say its required. We run plenty of SSD-only pools as well, but the majority of data serviced by our cinder-volume containers resides on mirrored spans with PCI-based SLOG mirrors and SAS SSD cache spans. Thats pretty much the most abusive consistent consumer we have (clouds dont sleep apparently), so i was referencing it as the high-water mark.

kernelOfTruth · 2017-10-08T22:35:14Z

@sempervictus no worries - now I know the internals a bit better :)

Makes sense to employ the change where it matters the most - ZFS and the hardware must be beaten pretty heavily in that usage scenario - that'll (as shown) most likely also will reveal any existing, dormant issues (if applicable) 👍

prakashsurya · 2017-10-09T04:23:13Z

@sempervictus For my performance testing, and the panic I linked to, I was using VMware for the hypervisor. It was the same server/hypervisor that I used for the tests that I ran on illumos.

$ vmware -vl
VMware ESXi 6.0.0 build-3620759
VMware ESXi 6.0.0 Update 2
$ uname -a
VMkernel zfsperfesx1 6.0.0 #1 SMP Release build-3620759 Mar  3 2016 18:41:52 x86_64 x86_64 x86_64 ESXi

behlendorf · 2017-10-11T00:06:29Z

@prakashsurya if you get a chance can you rebase this one more time on master. I'd like to see if the 32-bit fix from #6734, which was merged today, has any impact of the i686 test results.

prakashsurya · 2017-10-11T00:14:17Z

@behlendorf done.

behlendorf · 2017-10-12T00:51:31Z

That's a bit better, no ASSERT. Those all look like known spurious test failures, which suggests the root cause for this issue may be aggravated by a low memory situation. I'll send the build through a few more times to see if we always get the same set of test case failures.

prakashsurya · 2017-10-12T15:48:17Z

To make sure I understand, you're saying the 32-bit failure was due to an unrelated bug (i.e. not in this patch) that's been fixed? (pending re-running the tests a few more times)

behlendorf · 2017-10-12T17:35:52Z

Not exactly. The 32-bug, which was recently fixed, caused the ARC to miscalculate the amount of available free memory because it wasn't excluding free highmem pages. As a result the ARC wasn't shrinking when it needed to which put low memory stress on the rest of the system. This definitely resulted in OOM events and possibly memory allocation failures. How that ended up causing the ASSERT isn't clear.

However, since resolving that issue I've been unable to reproduce the ASSERT reported above in any of my 32-bit testing. The additional buildbot runs (4 so far) also only showed known spurious unrelated ZTS failures. And in fact we did get one entirely clean test run.

My suspicion is that the flaw isn't 32-bit specific but was simply easier to reproduce there do to this other problem. The ASSERT probably could happen on an x86_64 system too given the right circumstances. But since we don't actually know the root cause that's all really just guess work.

@sempervictus @prakashsurya I'd be open to merging this for broader testing, assuming we're no longer able to reproduce either of the two above ASSERTs. And we can verify the merging this doesn't increase the frequency of those unrelated ZTS test failures.

sempervictus · 2017-11-12T02:47:12Z

Does this PR change on-disk structures at all in the ZIL? Trying to import a pool which suffered a crash using master-only builds results in:

[Sun Nov 12 01:51:55 2017] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [dmu_objset_find:6907]
[Sun Nov 12 01:51:55 2017] Modules linked in: serpent_sse2_x86_64 serpent_generic lrw glue_helper ablk_helper cryptd jc42 nls_iso8859_1 nls_cp437 vfat fat ext4 crc16 jbd2 fscrypto mbcache iTCO_wdt iTCO_vendor_support gpio_ich ppdev ttm drm_kms_helper intel_powerclamp drm coretemp intel_cstate input_leds evdev i2c_algo_bit fb_sys_fops led_class joydev syscopyarea mousedev mac_hid sysfillrect usbmouse usbkbd sysimgblt psmouse uio_pdrv_genirq pcspkr uio i2c_i801 parport_pc i2c_smbus ioatdma i7core_edac parport button edac_core shpchp i5500_temp ipmi_ssif dca lpc_ich acpi_cpufreq sch_fq_codel 8021q garp mrp iscsi_scst scst_local scst dlm dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic kvm_intel kvm irqbypass ipmi_devintf ipmi_si ipmi_msghandler br_netfilter bridge stp llc ip_tables x_tables
[Sun Nov 12 01:51:55 2017]  zfs(POE) zunicode(POE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) zlib_deflate xts gf128mul algif_skcipher af_alg dm_crypt dm_mod sd_mod mlx4_en hid_generic usbhid hid uas usb_storage uhci_hcd serio_raw atkbd libps2 crc32c_intel mlx4_core ahci devlink libahci ehci_pci ehci_hcd usbcore e1000e mpt3sas usb_common ptp raid_class scsi_transport_sas pps_core i8042 serio fjes
[Sun Nov 12 01:51:55 2017] CPU: 0 PID: 6907 Comm: dmu_objset_find Tainted: P           OEL  4.9.61 #1
[Sun Nov 12 01:51:55 2017] Hardware name: Intel Thurley/Greencity, BIOS GEMAV200 01/28/2011
[Sun Nov 12 01:51:55 2017] task: ffff8807eb65ee00 task.stack: ffffc9003e910000
[Sun Nov 12 01:51:55 2017] RIP: 0010:[<ffffffffa0626649>]  [<ffffffffa0626649>] zil_claim_log_record+0x19/0x60 [zfs]
[Sun Nov 12 01:51:55 2017] RSP: 0000:ffffc9003e913a38  EFLAGS: 00000246
[Sun Nov 12 01:51:55 2017] RAX: 0000000000000000 RBX: ffffc90036d86efe RCX: 0000000002305788
[Sun Nov 12 01:51:55 2017] RDX: 0000000000000000 RSI: ffffc90036d86efe RDI: ffff880813d56800
[Sun Nov 12 01:51:55 2017] RBP: ffffc9003e913a58 R08: ffff8807e0324b58 R09: 00000001802e001c
[Sun Nov 12 01:51:55 2017] R10: 0100000000000000 R11: 0100000000000000 R12: ffff880813d56800
[Sun Nov 12 01:51:55 2017] R13: 0000000000000000 R14: 0000000002305788 R15: ffffffffffffffff
[Sun Nov 12 01:51:55 2017] FS:  0000000000000000(0000) GS:ffff88081fa00000(0000) knlGS:0000000000000000
[Sun Nov 12 01:51:55 2017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sun Nov 12 01:51:55 2017] CR2: 00000061b0b41000 CR3: 0000000001d5a000 CR4: 00000000000006f0
[Sun Nov 12 01:51:55 2017] Stack:
[Sun Nov 12 01:51:55 2017]  0000000000000000 ffffffffa0626630 ffffc90036d86efe 00000002540ee182
[Sun Nov 12 01:51:55 2017]  ffffc9003e913c68 ffffffffa06280b8 ffffc90001800190 ffffc9003e913aec
[Sun Nov 12 01:51:55 2017]  ffffc9003e913af8 ffffffffa0626630 ffffc90037d97000 0000000000000000
[Sun Nov 12 01:51:55 2017] Call Trace:
[Sun Nov 12 01:51:55 2017]  [<ffffffffa0626630>] ? zil_claim_log_block+0x90/0x90 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa06280b8>] zil_parse+0x688/0x910 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa0626630>] ? zil_claim_log_block+0x90/0x90 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa06265a0>] ? zil_replaying+0x70/0x70 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa0628d91>] zil_check_log_chain+0xe1/0x1a0 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa0574e11>] dmu_objset_find_dp_impl+0x151/0x400 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa05750e6>] dmu_objset_find_dp_cb+0x26/0x40 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa014cca1>] taskq_thread+0x2b1/0x4d0 [spl]
[Sun Nov 12 01:51:55 2017]  [<ffffffff810a0160>] ? wake_up_q+0x90/0x90
[Sun Nov 12 01:51:55 2017]  [<ffffffffa014c9f0>] ? task_done+0xa0/0xa0 [spl]
[Sun Nov 12 01:51:55 2017]  [<ffffffff81095ddd>] kthread+0xfd/0x120
[Sun Nov 12 01:51:55 2017]  [<ffffffff81095ce0>] ? kthread_parkme+0x40/0x40
[Sun Nov 12 01:51:55 2017]  [<ffffffff8180c5e7>] ret_from_fork+0x37/0x50

sempervictus · 2017-11-13T19:55:05Z

Good news: while my pool is dead as a doornail, the cause did not come from here. I rebuilt and retested with all the PRs/commits in my test stack atop master and seem to have localized this to the datto work around scrubs and the likely abandoned crypt fmt fix branch. I'll keep a long running zloop on this build to try and work out any kinks, but far as my issues go, i'm 99% sure this didnt cause them, just happened to be in play when the fan was hit.

behlendorf · 2017-11-14T00:17:28Z

@sempervictus no ZIL format changes in this PR or the scrub work. If you do have any suspicions about the current scrub patches please let me know now since those will likely be merged this week.

behlendorf · 2017-11-16T00:40:11Z

@prakashsurya I was looking at the ztest ASSERT from CentOS 6 in zil_lwb_write_issue. Specifically it ASSERT'ed because dmu_tx_assign returned ERESTART due to a suspended pool. While far from likely there are a few other ways ERESTART can be returned here.

In all of these rare cases it looks like the right thing to do is abort the TX and return NULL. What do you think about removing this ASSERT and updating the comment accordingly. Incidentally this isn't a new issue introduced by this PR, but since we've been focusing on the ZIL we may be just more aware of it.

error == 5 (0x55 == 0x5)
ASSERT at ../../module/zfs/zil.c:1304:zil_lwb_write_issue()/sbin/ztest[0x407c13]
/lib64/libpthread.so.0(+0xf7e0)[0x7f8f078707e0]
/lib64/libc.so.6(gsignal+0x35)[0x7f8f074ff495]
/lib64/libc.so.6(abort+0x175)[0x7f8f07500c75]
/lib64/libzpool.so.2(+0x162c33)[0x7f8f0868ac33]
/lib64/libzpool.so.2(+0x1669b8)[0x7f8f0868e9b8]
/lib64/libzpool.so.2(zil_commit+0x9fe)[0x7f8f0869119e]
/sbin/ztest[0x408a87]
/sbin/ztest[0x408ddd]
/lib64/libpthread.so.0(+0x7aa1)[0x7f8f07868aa1]
/lib64/libc.so.6(clone+0x6d)[0x7f8f075b5bcd]
child died with signal 6

behlendorf · 2017-11-17T17:41:05Z

@sempervictus I believe @prakashsurya's fix in e93b150 could explain you corrupt ZIL record.

sempervictus · 2017-11-17T19:08:30Z

Could that result in the bad dva issue?

behlendorf · 2017-11-17T19:40:24Z

@sempervictus I believe so under an unlikely set of circumstances. You'd need to write a ZIL record which included a dva which was damaged due to the memory it was stored in being free'd and reused as described in e93b150. Then you'd have to replay that damaged ZIL record on mount. You'd need to be really really unlucky, but I think it could happen.

prakashsurya · 2017-11-17T22:42:36Z

Now that I have a working implementation, I was able to run the same performance tests that I used on illumos, to test this PR on Ubuntu 16.04. See here for the results.

@behlendorf we can discuss the results in detail after the holiday, but the SSD numbers are a little concerning. I've copied the graphs of the percentage change in latency between master and this PR below (these are also contain in the jupyter notebooks which I linked to above):

Using HDDs -- each `fio` thread submitting writes as fast as possible:

Using HDDs -- each `fio` thread submitting 64 writes per second:

Using SSDs -- each `fio` thread submitting writes as fast as possible:

Using SSDs -- each `fio` thread submitting 64 writes per second:

sempervictus · 2017-11-18T02:53:19Z

In the graphs, does the term "using" indicate for slog or for data?

sempervictus · 2017-11-18T04:20:19Z

@prakashsurya: the pool which keeps getting "murdolated" in #6856 is a 5 member RAIDZ1 pool consisting of 250G Samsung EVO 850s on a SAS HBA. A 5-wide 2-way mirror span running its disks and cache from the same HBA (SLOG are m2's mirrored on PCI for that pool), is unaffected during these events, as is a 4 member RAID1 1T EVO 8 50 pool (but that's seeing no IO, and is on separate datapaths internally). Its getting killed, repeatedly, on a build consisting of nothing but master (at time of build) and this PR. I just refreshed the system with the same thing built a few hours ago with your updates along with those that went into master, destroyed the unimportable pool (again), rebuilt it, reseeded it with the VMs and containers it needed from their last known good snap, and lost the thing again in a matter of days. Something's rotten in the state of Denmark, and i'm apparently down-wind.
I cant say for sure that this PR is causing that failure, but whatever IS going on, its requiring manual recovery efforts in a place about as automated as Megatron's bowels, at a somewhat less -than-optimal time :-). Be glad to keep going through this till Monday, but i'll have to get a master branch build at that point if it keeps happening and we dont know why (identical system is still running an older revision of this PR, as are many others, with none of the stack traces related to this unfortunate set of events).

prakashsurya · 2017-11-20T18:19:41Z

@sempervictus For my tests, I was not using an SLOG. "Using" simply means the disks used by the zpool itself; i.e. I had pools with 1 to 8 HDDs, and pools with 1 to 8 SSDs (no redundancy, no SLOG).

As far as your failures.. you're still seeing problems with the latest code of this PR, with my follow-up fixes? Are the issues seen on ZIL replay (i.e. initial mount), or after the FS/dataset is mounted?

sempervictus · 2017-11-20T20:15:30Z

@prakashsurya: i've had to pull this PR out of the stack because it seems to be causing two distinct issues:
1 - ARC, for some reason, starts regulating itself despite sysctl settings. i have no idea how in the world this is happening from this PR, but dropping it from the stack restores normal behavior.
2 - With the rebased code, a RAIDZ1 pool consisting entirely of SSDs went bad on me once. Its running with this PR, currently, and sending its deltas hourly to another system (hoping to "validate in flight"). I am thinking that the issues, both times, were caused by IO from ZVOLs backing control nodes for openstack (MySQL, Rabbit, etc, in KVM instances on raw ZVOLs). That particular pool also serves an LXC container with Zenoss doing all the same MySQL and Rabbit stuff, along with a ton of file writes via the ZPL as opposed to ZVOL path. Errors always seem to start from ZVOL ops though, so the ZPL IO may simply be adding contenttion. Same pool went bad before the rebase as well, so i'm not sure if the cause of the error was what you fixed, or what i'm now thinking #6294 is designed to fix (#6888). Reason for this logic is that what i saw last night didnt have this PR in it, on a wholly separate system, in a another location. It was just a master build, and too much IO to a ZVOL killed in that specific manner. I do however suspect that the write improvements here are causing those failures to amplify as opposed to the "serialized failure" without the optimizations here, potentially explaining my bad pools. Do scrubs find DVA errors?

behlendorf · 2017-11-27T20:33:39Z

@sempervictus were you able to confirm that removing this PR from your stack resolved the two issue you mentioned?

Do scrubs find DVA errors?

It depends exactly what;'s wrong with the DVA but in general yes.

dweeezil · 2017-11-27T20:41:19Z

I've been loosely following this issue, along with #6874, and it seems that the bad blkptrs are being written to the pool on systems which do have ECC memory. In that case, it might be useful to put some calls to zfs_blkptr_verify() in the write path, potentially even in zio_create() (when bp isn't null) to try to catch the condition under which they're being written.

behlendorf · 2017-11-27T22:01:53Z

@dweeezil good thought. For debugging it might be useful to additionally verify the blkptr in zio_ready and zio_done.

prakashsurya · 2017-11-30T17:23:09Z

@behlendorf I rebased this onto the latest master, and squashed to a single commit. The only failure I see is the following, from the 32-bit tester:

Test: /usr/share/zfs/zfs-tests/tests/functional/vdev_zaps/vdev_zaps_004_pos (run as root) [00:03] [FAIL]
23:16:12.42 ASSERTION: Per-vdev ZAPs are transferred properly on attach/detach
23:16:12.56 SUCCESS: zpool create -f testpool loop0
23:16:15.42 SUCCESS: zpool attach testpool loop0 loop1
23:16:15.81 zdb: can't open 'testpool': No such device or address

Is this a known flakey test? This doesn't seem related to my changes at first glance.

behlendorf · 2017-11-30T17:32:15Z

@prakashsurya thanks! That's a known spurious failure, I've resubmitting that build.

Authored by: Prakash Surya <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Prakash Surya <[email protected]> Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Porting Notes ============= Due to the changes made in commit 119a394, the lifetime of an itx structure differs than in OpenZFS. Specifically, the itx structure is kept around until the data associated with the itx is considered to be safe on disk; this is so that the itx's callback can be called after the data is committed to stable storage. Since OpenZFS doesn't have this itx callback mechanism, it's able to destroy the itx structure immediately after the itx is committed to an lwb (before the lwb is written to disk). To support this difference, and to ensure the itx's callbacks can still be called after the itx's data is on disk, a few changes had to be made: * A list of itxs was added to the lwb structure. This list contains all of the itxs that have been committed to the lwb, such that the callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(), after the data for the itxs is committed to disk. * A list of itxs was added on the stack of the zil_process_commit_list() function; the "nolwb_itxs" list. In some circumstances, an itx may not be committed to an lwb (e.g. if allocating the "next" ZIL block on disk fails), so this list is used to keep track of which itxs fall into this state, such that their callbacks can be called after the ZIL's writer pipeline is "stalled". * The logic to actually call the itx's callback was moved into the zil_itx_destroy() function. Since all consumers of zil_itx_destroy() were effectively performing the same logic (i.e. if callback is non-null, call the callback), it seemed like useful code cleanup to consolidate this logic into a single function. Additionally, the existing Linux tracepoint infrastructure dealing with the ZIL's probes and structures had to be updated to reflect these code changes. Specifically: * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be removed from "trace_zil.h" as well. * Some of the zilog structure's fields were removed, which affected the tracepoint definitions of the structure. * New tracepoints had to be added for the following 3 new probes: * zil__process__commit__itx * zil__process__normal__itx * zil__commit__io__error OpenZFS-issue: https://www.illumos.org/issues/8585 OpenZFS-commit: openzfs/openzfs@5d95a3a

behlendorf

I've rerun @prakashsurya's performance tests on a different system and those results are more inline with what was observed on Illumos. This change improves ZIL performance for the majority of workloads. The one exception, which can be further investigated after the merge, is a penalty for single threaded applications issuing sync IO at the maximum rate. During all of my testing I did not observe any functional issues. This LGTM.

codecov · 2017-12-05T12:45:35Z

Codecov Report

Merging #6566 into master will decrease coverage by <.01%.
The diff coverage is 93.86%.

@@            Coverage Diff            @@
##           master   #6566      +/-   ##
=========================================
- Coverage    75.2%   75.2%   -0.01%     
=========================================
  Files         296     298       +2     
  Lines       95200   95521     +321     
=========================================
+ Hits        71599   71833     +234     
- Misses      23601   23688      +87

Flag	Coverage Δ
#kernel	`74.6% <87.07%> (+0.3%)`	⬆️
#user	`67.51% <91.19%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b34070...9948347. Read the comment docs.

codecov · 2017-12-05T20:53:29Z

Codecov Report

Merging #6566 into master will decrease coverage by <.01%.
The diff coverage is 93.86%.

@@            Coverage Diff            @@
##           master   #6566      +/-   ##
=========================================
- Coverage    75.2%   75.2%   -0.01%     
=========================================
  Files         296     298       +2     
  Lines       95200   95521     +321     
=========================================
+ Hits        71599   71833     +234     
- Misses      23601   23688      +87

Flag	Coverage Δ
#kernel	`74.6% <87.07%> (+0.3%)`	⬆️
#user	`67.51% <91.19%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b34070...9948347. Read the comment docs.

Authored by: Prakash Surya <[email protected]> Reviewed by: Brad Lewis <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Prakash Surya <[email protected]> Problem ======= The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). To further complicate the situation, these two issues result in the following side effect: 3. If a given batch takes longer to complete than normal, this results in larger batch sizes, which then take longer to complete and further drive up the latency of zil_commit(). This can occur for a number of reasons, including (but not limited to): transient changes in the workload, and storage latency irregularites. Solution ======== The solution attempted by this change has the following goals: 1. no on-disk changes; maintain current on-disk format. 2. modify the "batch size" to be equal to the "ZIL block size". 3. allow new batches to be generated and issued to disk, while there's already batches being serviced by the disk. 4. allow zil_commit() to wait for as few ZIL blocks as possible. 5. use as few ZIL blocks as possible, for the same amount of ZIL transactions, without introducing significant latency to any individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks. In theory, with these goals met, the new allgorithm will allow the following improvements: 1. new ZIL blocks can be generated and issued, while there's already oustanding ZIL blocks being serviced by the storage. 2. the latency of zil_commit() should be proportional to the underlying storage latency, rather than the incoming synchronous workload. Porting Notes ============= Due to the changes made in commit 119a394, the lifetime of an itx structure differs than in OpenZFS. Specifically, the itx structure is kept around until the data associated with the itx is considered to be safe on disk; this is so that the itx's callback can be called after the data is committed to stable storage. Since OpenZFS doesn't have this itx callback mechanism, it's able to destroy the itx structure immediately after the itx is committed to an lwb (before the lwb is written to disk). To support this difference, and to ensure the itx's callbacks can still be called after the itx's data is on disk, a few changes had to be made: * A list of itxs was added to the lwb structure. This list contains all of the itxs that have been committed to the lwb, such that the callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(), after the data for the itxs is committed to disk. * A list of itxs was added on the stack of the zil_process_commit_list() function; the "nolwb_itxs" list. In some circumstances, an itx may not be committed to an lwb (e.g. if allocating the "next" ZIL block on disk fails), so this list is used to keep track of which itxs fall into this state, such that their callbacks can be called after the ZIL's writer pipeline is "stalled". * The logic to actually call the itx's callback was moved into the zil_itx_destroy() function. Since all consumers of zil_itx_destroy() were effectively performing the same logic (i.e. if callback is non-null, call the callback), it seemed like useful code cleanup to consolidate this logic into a single function. Additionally, the existing Linux tracepoint infrastructure dealing with the ZIL's probes and structures had to be updated to reflect these code changes. Specifically: * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be removed from "trace_zil.h" as well. * Some of the zilog structure's fields were removed, which affected the tracepoint definitions of the structure. * New tracepoints had to be added for the following 3 new probes: * zil__process__commit__itx * zil__process__normal__itx * zil__commit__io__error OpenZFS-issue: https://www.illumos.org/issues/8585 OpenZFS-commit: openzfs/openzfs@5d95a3a Closes openzfs#6566

behlendorf added the Ready for Testing label Aug 28, 2017

behlendorf mentioned this pull request Aug 29, 2017

ZIL Pipelining #6133

Closed

12 tasks

openzfs deleted a comment from codecov bot Sep 23, 2017

prakashsurya requested review from nedbass and removed request for nedbass October 16, 2017 19:53

sempervictus mentioned this pull request Nov 13, 2017

Broken ZIL (during heavy testing of #6566) #6856

Closed

sempervictus mentioned this pull request Nov 17, 2017

PANIC: blkptr at ffff880eb6742800 DVA 0 has invalid VDEV 4294967295 #6874

Closed

prakashsurya mentioned this pull request Nov 27, 2017

Hang when running zloop - apparently due to "zio_wait" #6901

Closed

behlendorf mentioned this pull request Nov 30, 2017

Preserve itx alloc size for zio_data_buf_free() #6912

Merged

13 tasks

behlendorf approved these changes Dec 4, 2017

View reviewed changes

behlendorf merged commit 1ce23dc into openzfs:master Dec 5, 2017

scineram mentioned this pull request Jul 27, 2018

Want ZIL Pipelining #6097

Closed

OpenZFS 8585 - improve batching done in zil_commit() #6566

OpenZFS 8585 - improve batching done in zil_commit() #6566

Conversation

prakashsurya commented Aug 25, 2017 • edited Loading

Problem

Solution

Porting Notes

prakashsurya commented Aug 25, 2017 • edited Loading

sempervictus commented Aug 25, 2017

prakashsurya commented Aug 28, 2017 • edited Loading

behlendorf commented Aug 28, 2017 • edited Loading

sempervictus commented Aug 28, 2017

sempervictus commented Aug 29, 2017

prakashsurya commented Aug 29, 2017

behlendorf commented Aug 30, 2017 • edited Loading

prakashsurya commented Aug 31, 2017

sempervictus commented Sep 1, 2017

prakashsurya commented Sep 5, 2017

sempervictus commented Sep 5, 2017 via email

prakashsurya commented Sep 5, 2017 • edited Loading

sempervictus commented Oct 7, 2017

kernelOfTruth commented Oct 7, 2017 • edited Loading

prakashsurya commented Oct 8, 2017

kernelOfTruth commented Oct 8, 2017 • edited Loading

sempervictus commented Oct 8, 2017 • edited Loading

kernelOfTruth commented Oct 8, 2017

prakashsurya commented Oct 9, 2017 • edited Loading

behlendorf commented Oct 11, 2017

prakashsurya commented Oct 11, 2017

behlendorf commented Oct 12, 2017

prakashsurya commented Oct 12, 2017

behlendorf commented Oct 12, 2017

sempervictus commented Nov 12, 2017

sempervictus commented Nov 13, 2017

behlendorf commented Nov 14, 2017

behlendorf commented Nov 16, 2017 • edited Loading

behlendorf commented Nov 17, 2017

sempervictus commented Nov 17, 2017 via email

behlendorf commented Nov 17, 2017

prakashsurya commented Nov 17, 2017

Using HDDs -- each fio thread submitting writes as fast as possible:

Using HDDs -- each fio thread submitting 64 writes per second:

Using SSDs -- each fio thread submitting writes as fast as possible:

Using SSDs -- each fio thread submitting 64 writes per second:

sempervictus commented Nov 18, 2017 via email

sempervictus commented Nov 18, 2017

prakashsurya commented Nov 20, 2017

sempervictus commented Nov 20, 2017

behlendorf commented Nov 27, 2017

dweeezil commented Nov 27, 2017

behlendorf commented Nov 27, 2017

prakashsurya commented Nov 30, 2017

behlendorf commented Nov 30, 2017

behlendorf left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 5, 2017

Codecov Report

codecov bot commented Dec 5, 2017

Codecov Report

prakashsurya commented Aug 25, 2017 •

edited

Loading

prakashsurya commented Aug 25, 2017 •

edited

Loading

prakashsurya commented Aug 28, 2017 •

edited

Loading

behlendorf commented Aug 28, 2017 •

edited

Loading

behlendorf commented Aug 30, 2017 •

edited

Loading

prakashsurya commented Sep 5, 2017 •

edited

Loading

kernelOfTruth commented Oct 7, 2017 •

edited

Loading

kernelOfTruth commented Oct 8, 2017 •

edited

Loading

sempervictus commented Oct 8, 2017 •

edited

Loading

prakashsurya commented Oct 9, 2017 •

edited

Loading

behlendorf commented Nov 16, 2017 •

edited

Loading

Using HDDs -- each `fio` thread submitting writes as fast as possible:

Using HDDs -- each `fio` thread submitting 64 writes per second:

Using SSDs -- each `fio` thread submitting writes as fast as possible:

Using SSDs -- each `fio` thread submitting 64 writes per second: