Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenZFS 8585 - improve batching done in zil_commit() #6566

Merged
merged 1 commit into from
Dec 5, 2017
Merged

OpenZFS 8585 - improve batching done in zil_commit() #6566

merged 1 commit into from
Dec 5, 2017

Conversation

prakashsurya
Copy link
Member

@prakashsurya prakashsurya commented Aug 25, 2017

Authored by: Prakash Surya [email protected]
Reviewed by: Brad Lewis [email protected]
Reviewed by: Matt Ahrens [email protected]
Reviewed by: George Wilson [email protected]
Ported-by: Prakash Surya [email protected]

Problem

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

  1. When there's outstanding ZIL blocks being written (i.e. there's
    already a "writer thread" in progress), then any new calls to
    zil_commit() will block waiting for the currently oustanding ZIL
    blocks to complete. The blocks written for each "writer thread" is
    coined a "batch", and there can only ever be a single "batch" being
    written at a time. When a batch is being written, any new ZIL
    transactions will have to wait for the next batch to be written,
    which won't occur until the current batch finishes.

    As a result, the underlying storage may not be used as efficiently
    as possible. While "new" threads enter zil_commit() and are blocked
    waiting for the next batch, it's possible that the underlying
    storage isn't fully utilized by the current batch of ZIL blocks. In
    that case, it'd be better to allow these new threads to generate
    (and issue) a new ZIL block, such that it could be serviced by the
    underlying storage concurrently with the other ZIL blocks that are
    being serviced.

  2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
    to complete, prior to zil_commit() returning. The size of any given
    batch is proportional to the number of ZIL transaction in the queue
    at the time that the batch starts processing the queue; which
    doesn't occur until the previous batch completes. Thus, if there's a
    lot of transactions in the queue, the batch could be composed of
    many ZIL blocks, and each call to zil_commit() will have to wait for
    all of these writes to complete (even if the thread calling
    zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

  1. If a given batch takes longer to complete than normal, this results
    in larger batch sizes, which then take longer to complete and
    further drive up the latency of zil_commit(). This can occur for a
    number of reasons, including (but not limited to): transient changes
    in the workload, and storage latency irregularites.

Solution

The solution attempted by this change has the following goals:

  1. no on-disk changes; maintain current on-disk format.
  2. modify the "batch size" to be equal to the "ZIL block size".
  3. allow new batches to be generated and issued to disk, while there's
    already batches being serviced by the disk.
  4. allow zil_commit() to wait for as few ZIL blocks as possible.
  5. use as few ZIL blocks as possible, for the same amount of ZIL
    transactions, without introducing significant latency to any
    individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

  1. new ZIL blocks can be generated and issued, while there's already
    oustanding ZIL blocks being serviced by the storage.
  2. the latency of zil_commit() should be proportional to the underlying
    storage latency, rather than the incoming synchronous workload.

Porting Notes

Due to the changes made in commit 119a394, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).

To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:

  • A list of itxs was added to the lwb structure. This list contains
    all of the itxs that have been committed to the lwb, such that the
    callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
    after the data for the itxs is committed to disk.

  • A list of itxs was added on the stack of the zil_process_commit_list()
    function; the "nolwb_itxs" list. In some circumstances, an itx may
    not be committed to an lwb (e.g. if allocating the "next" ZIL block
    on disk fails), so this list is used to keep track of which itxs
    fall into this state, such that their callbacks can be called after
    the ZIL's writer pipeline is "stalled".

  • The logic to actually call the itx's callback was moved into the
    zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
    were effectively performing the same logic (i.e. if callback is
    non-null, call the callback), it seemed like useful code cleanup to
    consolidate this logic into a single function.

Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:

  • The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
    removed from "trace_zil.h" as well.

  • Some of the zilog structure's fields were removed, which affected
    the tracepoint definitions of the structure.

  • New tracepoints had to be added for the following 3 new probes:

    • zil__process__commit__itx
    • zil__process__normal__itx
    • zil__commit__io__error

OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: openzfs/openzfs@5d95a3a

@prakashsurya
Copy link
Member Author

prakashsurya commented Aug 25, 2017

@sempervictus this is a second attempt at porting what I previously had in #6337, and attempts to address the same performance issues as #6133 .. if you can, it'd be great if you could try this out in your test environment.

I've fixed the prior issue with msync and the zfstest suite, and was able to successfully run this version through the test suite in a local VM. Additionally, I believe the build issues that you previously hit w.r.t. the new tracepoints have been fixed (tested and verified they worked in my VM).

The OpenZFS PR contains links to data on the performance testing that I previously did for this change (on illumos); see the "Performance Testing" section of that PR's description: openzfs/openzfs#447

@sempervictus
Copy link
Contributor

@prakashsurya: adding to the test queue, thanks for digging in so quickly.

@prakashsurya
Copy link
Member Author

prakashsurya commented Aug 28, 2017

@behlendorf is the nopwrite/nopwrite_volume test reliable? that appears to have failed, so I'm curious if that's a known failure mode in the existing code, or if it's likely a result of my changes (and I should dig into the root cause). here's what the log shows:

Test: /usr/share/zfs/zfs-tests/tests/functional/nopwrite/nopwrite_volume (run as root) [00:14] [FAIL]
05:35:21.18 ASSERTION: nopwrite works on volumes
05:35:21.33 SUCCESS: zfs set compress=on testpool/testvol.nopwrite
05:35:21.42 SUCCESS: zfs set checksum=sha256 testpool/testvol.nopwrite
05:35:31.03 SUCCESS: zfs clone testpool/testvol.nopwrite@a testpool/clone
05:35:31.33 SUCCESS: zfs set compress=on testpool/clone
05:35:31.43 SUCCESS: zfs set checksum=sha256 testpool/clone
05:35:33.59 SUCCESS: zpool sync testpool
05:35:33.76 NOTE: Comparing 33886720 and 377856 given 2% (calculated: 1.11%)
05:35:33.78 NOTE: Comparing 33907712 and 0 given 2% (calculated: 0%)
05:35:33.79 NOTE: Comparing 33886720 and 377856 given 2% (calculated: 1.11%)
05:35:33.81 NOTE: Comparing 33907712 and 33886720 given 98% (calculated: 99.93%)
05:35:33.83 NOTE: Comparing 33886720 and 33886720 given 98% (calculated: 100.00%)
05:35:35.12 NOTE: Comparing 377856 and 356864 given 98% (calculated: 94.44%)
05:35:35.13 ERROR: verify_nopwrite testpool/testvol.nopwrite testpool/testvol.nopwrite@a testpool/clone exited 1

@behlendorf
Copy link
Contributor

behlendorf commented Aug 28, 2017

@prakashsurya it hasn't been 100% reliable and we've opened #5510 to track the known failure. We see a failure maybe 1/50 runs on the 32-bit tester but haven't had a chance completely root cause and fix the issue. Let me run it through again and see if it's reproducible, below is the link to the previous test failures so we don't lose track of it.

http://build.zfsonlinux.org/builders/Ubuntu%2014.04%20i686%20%28TEST%29/builds/3556

@sempervictus
Copy link
Contributor

A bunch of zloop later, still haven't blown it up. Going to push to some metal today, see how it fares on real world ops.

@sempervictus
Copy link
Contributor

This is promising - two personal production systems now running with this PR, send/recv working, zvols being used for KVM instances. Contended writes are a lot better - double digit reduction in iowait with multiple VMS running FIO. Thank you @prakashsurya

@prakashsurya
Copy link
Member Author

appreciate the feedback; thanks @sempervictus.

@behlendorf behlendorf mentioned this pull request Aug 29, 2017
12 tasks
@behlendorf
Copy link
Contributor

behlendorf commented Aug 30, 2017

@prakashsurya it looks like there's still a failure on the 32-bit testers which needs to be run down. I ran the tests a few times to make sure and it's reproducible. The test case zpool_clear_001_pos appears to cause the issue and I was able to pull the following stack out of the ZTS logs. Unfortunately, the exact ASSERT we hit was lost.

VERIFY3(size != 0) failed (0 != 0)
PANIC at zfs_vnops.c:1014:zfs_get_data()
Showing stack for process 5466
CPU: 0 PID: 5466 Comm: z_vdev_file Tainted: P           OX 3.13.0-117-generic #164-Ubuntu

Call Trace:
 [<c16633c3>] schedule+0x23/0x60
 [<d5606485>] spl_panic+0xc5/0xe0 [spl]
 [<d5ddbac2>] zfs_get_data+0x1b2/0x570 [zfs] 
 [<d5dea586>] zil_commit+0x1266/0x1fd0 [zfs] 
 [<d5dd32dc>] zfs_fsync+0x6c/0xd0 [zfs]
 [<d5e061af>] zpl_fsync+0x7f/0x130 [zfs]
 [<c11a6f9f>] vfs_fsync+0x2f/0x40
 [<d560591a>] vn_fsync+0x3a/0xd0 [spl]
 [<d5d914d0>] vdev_file_io_fsync+0x30/0x50
 [<d5603997>] taskq_thread+0x267/0x5a0 [spl] 
 [<c1076fdb>] kthread+0x9b/0xb0

[edited] I was able to reproduce this issue by running the zpool_clear test group. Updated the stack trace to include the ASSERT hit.

@prakashsurya
Copy link
Member Author

Thanks for the report. Briefly looking at the assertion, I don't know why that'd be occurring; I'll have to dig more into it and maybe set up an environment to reproduce it locally. It appears as though the write record passed into zfs_get_data has an lr_length of zero, but I don't immediately know how/why that would occur (nor why it'd only occur on the 32-bit system). This needs more investigation.

@sempervictus
Copy link
Contributor

@prakashsurya: saw similar error in a long and laggy send/recv. Not sure if its the patch or a very slow recv host though - hung task didn't crash hosts. The original 64b p4 in there is not loving it. Shows how far chips have come in a decade I guess.

@prakashsurya
Copy link
Member Author

While running some tests of this patch in a VM, I managed trip the following assertion:

[ 1806.416279] VERIFY3(itx->itx_wr_state == WR_INDIRECT) failed (4294967295 == 0)
[ 1806.416315] PANIC at zil.c:1506:zil_lwb_commit()
[ 1806.416333] Showing stack for process 28365
[ 1806.416336] CPU: 5 PID: 28365 Comm: fio Tainted: P           OE   4.4.0-93-generic #116-Ubuntu
[ 1806.416337] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[ 1806.416338]  0000000000000286 16a1c125c5c4eec7 ffff887cd21678c0 ffffffff813f9f83
[ 1806.416341]  ffffffffc136d454 00000000000005e2 ffff887cd21678d0 ffffffffc02d6fd2
[ 1806.416342]  ffff887cd2167a58 ffffffffc02d70a5 ffff887b665d75d0 ffff887c00000030
[ 1806.416344] Call Trace:
[ 1806.416350]  [<ffffffff813f9f83>] dump_stack+0x63/0x90
[ 1806.416362]  [<ffffffffc02d6fd2>] spl_dumpstack+0x42/0x50 [spl]
[ 1806.416366]  [<ffffffffc02d70a5>] spl_panic+0xc5/0x100 [spl]
[ 1806.416428]  [<ffffffffc127ab1e>] ? zio_rewrite+0x6e/0x70 [zfs]
[ 1806.416469]  [<ffffffffc126bd90>] ? zil_lwb_write_open+0x520/0x520 [zfs]
[ 1806.416509]  [<ffffffffc126b96f>] ? zil_lwb_write_open+0xff/0x520 [zfs]
[ 1806.416551]  [<ffffffffc126bd90>] ? zil_lwb_write_open+0x520/0x520 [zfs]
[ 1806.416591]  [<ffffffffc1271daa>] zil_commit+0x14ba/0x1d90 [zfs]
[ 1806.416635]  [<ffffffffc1260d6d>] zfs_write+0xe0d/0x1020 [zfs]
[ 1806.416677]  [<ffffffffc128b391>] zpl_write_common_iovec+0x91/0x130 [zfs]
[ 1806.416717]  [<ffffffffc128b4e1>] zpl_iter_write+0xb1/0xf0 [zfs]
[ 1806.416721]  [<ffffffff8120ec0b>] new_sync_write+0x9b/0xe0
[ 1806.416723]  [<ffffffff8120ec76>] __vfs_write+0x26/0x40
[ 1806.416724]  [<ffffffff8120f5f9>] vfs_write+0xa9/0x1a0
[ 1806.416726]  [<ffffffff81210465>] SyS_pwrite64+0x95/0xb0
[ 1806.416729]  [<ffffffff81039d03>] ? fpu__restore+0xf3/0x150
[ 1806.416734]  [<ffffffff818431f2>] entry_SYSCALL_64_fastpath+0x16/0x71

this was running on a VM with 8 SSDs, 32 vCPUs, 512GB of RAM, and 512 FIO threads performing synchronous writes; I'm not yet sure what the root cause for this is.

@sempervictus
Copy link
Contributor

sempervictus commented Sep 5, 2017 via email

@prakashsurya
Copy link
Member Author

prakashsurya commented Sep 5, 2017

OK, that's good to know. I can reproduce this every time I run this workload on this system, so I hope to root cause it sometime this week (it's a shame not having mdb at my disposal).

@openzfs openzfs deleted a comment from codecov bot Sep 23, 2017
@sempervictus
Copy link
Contributor

We've had this running for a few weeks including a few production hosts backing cinder with pcie ssds for slog on spinning pools. Writes are definitely snappier and the stalls are greatly reduced under concurrent load. If there's no blocker outstanding, I think this should get merged for wider eval.

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Oct 7, 2017

Just to understand this a bit better: this doesn't apply with no dedicated ZIL device connected, right ? [from sempervictus' comment dedicated devices appear to be needed to benefit from the improvements]

In any case those lower latencies and higher IOPS are impressive !

Thanks a lot !

@prakashsurya
Copy link
Member Author

@kernelOfTruth The "ZIL" cannot be removed, so I think you meant "SLOG" right? If so, yes, this should improve latency even with no SLOG connected. My performance tests done on illumos/delphixos were done without an SLOG, and showed improved latencies; see here for more details.

@sempervictus There's two blockers which I haven't yet gotten around to fixing:

  1. I hit a panic when running my performance tests on an Ubuntu VM with an all SSD pool, and 32 vCPUs. See here.

  2. Brian hit a panic when running on a 32-bit Ubuntu VM. See here.

IMO, both of these should be investigated and fixed prior to landing.

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Oct 8, 2017

@prakashsurya yes, sorry I meant SLOG device, all of the time I was reading on the web about it only was called "ZIL device"

[the following seems to be one article that uses the correct term https://www.ixsystems.com/blog/o-slog-not-slog-best-configure-zfs-intent-log/ ]

Thanks

@sempervictus
Copy link
Contributor

sempervictus commented Oct 8, 2017

@prakashsurya: regarding the crash you experienced, i'm rather suspicious of your hypervisor. Type 2 HVs are known for doing "the wrong thing" when it comes to IO, specifically as relating to synchronous semantics/commit order/other things sane people expect to work. VBox is most guilty of this, but any type 2 is suspect. Can you repro on ESXi?
Far as the failure on 32bit goes, i'm glad to see some support for the platform, but the vast majority of real world consumers are 64bit. Maybe leave in the existing codepath defined out only for 32bit? Are newer 32bit distros seeing this as well? Compiler version does make a difference...
On a related note - I've realized that running Arch Linux (rolling distro) i'm always getting the newest version of gcc (which makes for a nightmarish experience with randstruct and other plugins affecting my kernel and dkms modules). Apaprently there's benefit over the older versions found in more mainline distributions in terms of compatibility and performance. There's a good example of this on the kernel-hardening list in discussions around PIE performance impact. It might make sense to have a builder on such a distribution which can produce packages for other distros so we can compare them against natively built versions when we have reproducers (to track down the problem to the distro or code). It would also give us "current baseline" binaries for each tested commit, which may not hurt either.

@kernelOfTruth: Sorry, wasn't my intent to say its required. We run plenty of SSD-only pools as well, but the majority of data serviced by our cinder-volume containers resides on mirrored spans with PCI-based SLOG mirrors and SAS SSD cache spans. Thats pretty much the most abusive consistent consumer we have (clouds dont sleep apparently), so i was referencing it as the high-water mark.

@kernelOfTruth
Copy link
Contributor

@sempervictus no worries - now I know the internals a bit better :)

Makes sense to employ the change where it matters the most - ZFS and the hardware must be beaten pretty heavily in that usage scenario - that'll (as shown) most likely also will reveal any existing, dormant issues (if applicable) 👍

@prakashsurya
Copy link
Member Author

prakashsurya commented Oct 9, 2017

@sempervictus For my performance testing, and the panic I linked to, I was using VMware for the hypervisor. It was the same server/hypervisor that I used for the tests that I ran on illumos.

$ vmware -vl
VMware ESXi 6.0.0 build-3620759
VMware ESXi 6.0.0 Update 2
$ uname -a
VMkernel zfsperfesx1 6.0.0 #1 SMP Release build-3620759 Mar  3 2016 18:41:52 x86_64 x86_64 x86_64 ESXi

@behlendorf
Copy link
Contributor

@prakashsurya if you get a chance can you rebase this one more time on master. I'd like to see if the 32-bit fix from #6734, which was merged today, has any impact of the i686 test results.

@prakashsurya
Copy link
Member Author

@behlendorf done.

@behlendorf
Copy link
Contributor

That's a bit better, no ASSERT. Those all look like known spurious test failures, which suggests the root cause for this issue may be aggravated by a low memory situation. I'll send the build through a few more times to see if we always get the same set of test case failures.

@prakashsurya
Copy link
Member Author

To make sure I understand, you're saying the 32-bit failure was due to an unrelated bug (i.e. not in this patch) that's been fixed? (pending re-running the tests a few more times)

@behlendorf
Copy link
Contributor

Not exactly. The 32-bug, which was recently fixed, caused the ARC to miscalculate the amount of available free memory because it wasn't excluding free highmem pages. As a result the ARC wasn't shrinking when it needed to which put low memory stress on the rest of the system. This definitely resulted in OOM events and possibly memory allocation failures. How that ended up causing the ASSERT isn't clear.

However, since resolving that issue I've been unable to reproduce the ASSERT reported above in any of my 32-bit testing. The additional buildbot runs (4 so far) also only showed known spurious unrelated ZTS failures. And in fact we did get one entirely clean test run.

My suspicion is that the flaw isn't 32-bit specific but was simply easier to reproduce there do to this other problem. The ASSERT probably could happen on an x86_64 system too given the right circumstances. But since we don't actually know the root cause that's all really just guess work.

@sempervictus @prakashsurya I'd be open to merging this for broader testing, assuming we're no longer able to reproduce either of the two above ASSERTs. And we can verify the merging this doesn't increase the frequency of those unrelated ZTS test failures.

@prakashsurya prakashsurya requested review from nedbass and removed request for nedbass October 16, 2017 19:53
@sempervictus
Copy link
Contributor

Does this PR change on-disk structures at all in the ZIL? Trying to import a pool which suffered a crash using master-only builds results in:

[Sun Nov 12 01:51:55 2017] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [dmu_objset_find:6907]
[Sun Nov 12 01:51:55 2017] Modules linked in: serpent_sse2_x86_64 serpent_generic lrw glue_helper ablk_helper cryptd jc42 nls_iso8859_1 nls_cp437 vfat fat ext4 crc16 jbd2 fscrypto mbcache iTCO_wdt iTCO_vendor_support gpio_ich ppdev ttm drm_kms_helper intel_powerclamp drm coretemp intel_cstate input_leds evdev i2c_algo_bit fb_sys_fops led_class joydev syscopyarea mousedev mac_hid sysfillrect usbmouse usbkbd sysimgblt psmouse uio_pdrv_genirq pcspkr uio i2c_i801 parport_pc i2c_smbus ioatdma i7core_edac parport button edac_core shpchp i5500_temp ipmi_ssif dca lpc_ich acpi_cpufreq sch_fq_codel 8021q garp mrp iscsi_scst scst_local scst dlm dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic kvm_intel kvm irqbypass ipmi_devintf ipmi_si ipmi_msghandler br_netfilter bridge stp llc ip_tables x_tables
[Sun Nov 12 01:51:55 2017]  zfs(POE) zunicode(POE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) zlib_deflate xts gf128mul algif_skcipher af_alg dm_crypt dm_mod sd_mod mlx4_en hid_generic usbhid hid uas usb_storage uhci_hcd serio_raw atkbd libps2 crc32c_intel mlx4_core ahci devlink libahci ehci_pci ehci_hcd usbcore e1000e mpt3sas usb_common ptp raid_class scsi_transport_sas pps_core i8042 serio fjes
[Sun Nov 12 01:51:55 2017] CPU: 0 PID: 6907 Comm: dmu_objset_find Tainted: P           OEL  4.9.61 #1
[Sun Nov 12 01:51:55 2017] Hardware name: Intel Thurley/Greencity, BIOS GEMAV200 01/28/2011
[Sun Nov 12 01:51:55 2017] task: ffff8807eb65ee00 task.stack: ffffc9003e910000
[Sun Nov 12 01:51:55 2017] RIP: 0010:[<ffffffffa0626649>]  [<ffffffffa0626649>] zil_claim_log_record+0x19/0x60 [zfs]
[Sun Nov 12 01:51:55 2017] RSP: 0000:ffffc9003e913a38  EFLAGS: 00000246
[Sun Nov 12 01:51:55 2017] RAX: 0000000000000000 RBX: ffffc90036d86efe RCX: 0000000002305788
[Sun Nov 12 01:51:55 2017] RDX: 0000000000000000 RSI: ffffc90036d86efe RDI: ffff880813d56800
[Sun Nov 12 01:51:55 2017] RBP: ffffc9003e913a58 R08: ffff8807e0324b58 R09: 00000001802e001c
[Sun Nov 12 01:51:55 2017] R10: 0100000000000000 R11: 0100000000000000 R12: ffff880813d56800
[Sun Nov 12 01:51:55 2017] R13: 0000000000000000 R14: 0000000002305788 R15: ffffffffffffffff
[Sun Nov 12 01:51:55 2017] FS:  0000000000000000(0000) GS:ffff88081fa00000(0000) knlGS:0000000000000000
[Sun Nov 12 01:51:55 2017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sun Nov 12 01:51:55 2017] CR2: 00000061b0b41000 CR3: 0000000001d5a000 CR4: 00000000000006f0
[Sun Nov 12 01:51:55 2017] Stack:
[Sun Nov 12 01:51:55 2017]  0000000000000000 ffffffffa0626630 ffffc90036d86efe 00000002540ee182
[Sun Nov 12 01:51:55 2017]  ffffc9003e913c68 ffffffffa06280b8 ffffc90001800190 ffffc9003e913aec
[Sun Nov 12 01:51:55 2017]  ffffc9003e913af8 ffffffffa0626630 ffffc90037d97000 0000000000000000
[Sun Nov 12 01:51:55 2017] Call Trace:
[Sun Nov 12 01:51:55 2017]  [<ffffffffa0626630>] ? zil_claim_log_block+0x90/0x90 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa06280b8>] zil_parse+0x688/0x910 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa0626630>] ? zil_claim_log_block+0x90/0x90 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa06265a0>] ? zil_replaying+0x70/0x70 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa0628d91>] zil_check_log_chain+0xe1/0x1a0 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa0574e11>] dmu_objset_find_dp_impl+0x151/0x400 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa05750e6>] dmu_objset_find_dp_cb+0x26/0x40 [zfs]
[Sun Nov 12 01:51:55 2017]  [<ffffffffa014cca1>] taskq_thread+0x2b1/0x4d0 [spl]
[Sun Nov 12 01:51:55 2017]  [<ffffffff810a0160>] ? wake_up_q+0x90/0x90
[Sun Nov 12 01:51:55 2017]  [<ffffffffa014c9f0>] ? task_done+0xa0/0xa0 [spl]
[Sun Nov 12 01:51:55 2017]  [<ffffffff81095ddd>] kthread+0xfd/0x120
[Sun Nov 12 01:51:55 2017]  [<ffffffff81095ce0>] ? kthread_parkme+0x40/0x40
[Sun Nov 12 01:51:55 2017]  [<ffffffff8180c5e7>] ret_from_fork+0x37/0x50

@sempervictus
Copy link
Contributor

Good news: while my pool is dead as a doornail, the cause did not come from here. I rebuilt and retested with all the PRs/commits in my test stack atop master and seem to have localized this to the datto work around scrubs and the likely abandoned crypt fmt fix branch. I'll keep a long running zloop on this build to try and work out any kinks, but far as my issues go, i'm 99% sure this didnt cause them, just happened to be in play when the fan was hit.

@behlendorf
Copy link
Contributor

@sempervictus no ZIL format changes in this PR or the scrub work. If you do have any suspicions about the current scrub patches please let me know now since those will likely be merged this week.

@behlendorf
Copy link
Contributor

behlendorf commented Nov 16, 2017

@prakashsurya I was looking at the ztest ASSERT from CentOS 6 in zil_lwb_write_issue. Specifically it ASSERT'ed because dmu_tx_assign returned ERESTART due to a suspended pool. While far from likely there are a few other ways ERESTART can be returned here.

In all of these rare cases it looks like the right thing to do is abort the TX and return NULL. What do you think about removing this ASSERT and updating the comment accordingly. Incidentally this isn't a new issue introduced by this PR, but since we've been focusing on the ZIL we may be just more aware of it.

error == 5 (0x55 == 0x5)
ASSERT at ../../module/zfs/zil.c:1304:zil_lwb_write_issue()/sbin/ztest[0x407c13]
/lib64/libpthread.so.0(+0xf7e0)[0x7f8f078707e0]
/lib64/libc.so.6(gsignal+0x35)[0x7f8f074ff495]
/lib64/libc.so.6(abort+0x175)[0x7f8f07500c75]
/lib64/libzpool.so.2(+0x162c33)[0x7f8f0868ac33]
/lib64/libzpool.so.2(+0x1669b8)[0x7f8f0868e9b8]
/lib64/libzpool.so.2(zil_commit+0x9fe)[0x7f8f0869119e]
/sbin/ztest[0x408a87]
/sbin/ztest[0x408ddd]
/lib64/libpthread.so.0(+0x7aa1)[0x7f8f07868aa1]
/lib64/libc.so.6(clone+0x6d)[0x7f8f075b5bcd]
child died with signal 6

@behlendorf
Copy link
Contributor

@sempervictus I believe @prakashsurya's fix in e93b150 could explain you corrupt ZIL record.

@sempervictus
Copy link
Contributor

sempervictus commented Nov 17, 2017 via email

@behlendorf
Copy link
Contributor

@sempervictus I believe so under an unlikely set of circumstances. You'd need to write a ZIL record which included a dva which was damaged due to the memory it was stored in being free'd and reused as described in e93b150. Then you'd have to replay that damaged ZIL record on mount. You'd need to be really really unlucky, but I think it could happen.

@prakashsurya
Copy link
Member Author

Now that I have a working implementation, I was able to run the same performance tests that I used on illumos, to test this PR on Ubuntu 16.04. See here for the results.

@behlendorf we can discuss the results in detail after the holiday, but the SSD numbers are a little concerning. I've copied the graphs of the percentage change in latency between master and this PR below (these are also contain in the jupyter notebooks which I linked to above):

Using HDDs -- each fio thread submitting writes as fast as possible:

max-rate-submit-hdd-latency

Using HDDs -- each fio thread submitting 64 writes per second:

fixed-rate-submit-hdd-latency

Using SSDs -- each fio thread submitting writes as fast as possible:

max-rate-submit-ssd-latency

Using SSDs -- each fio thread submitting 64 writes per second:

fixed-rate-submit-ssd-latency

@sempervictus
Copy link
Contributor

sempervictus commented Nov 18, 2017 via email

@sempervictus
Copy link
Contributor

@prakashsurya: the pool which keeps getting "murdolated" in #6856 is a 5 member RAIDZ1 pool consisting of 250G Samsung EVO 850s on a SAS HBA. A 5-wide 2-way mirror span running its disks and cache from the same HBA (SLOG are m2's mirrored on PCI for that pool), is unaffected during these events, as is a 4 member RAID1 1T EVO 8 50 pool (but that's seeing no IO, and is on separate datapaths internally). Its getting killed, repeatedly, on a build consisting of nothing but master (at time of build) and this PR. I just refreshed the system with the same thing built a few hours ago with your updates along with those that went into master, destroyed the unimportable pool (again), rebuilt it, reseeded it with the VMs and containers it needed from their last known good snap, and lost the thing again in a matter of days. Something's rotten in the state of Denmark, and i'm apparently down-wind.
I cant say for sure that this PR is causing that failure, but whatever IS going on, its requiring manual recovery efforts in a place about as automated as Megatron's bowels, at a somewhat less -than-optimal time :-). Be glad to keep going through this till Monday, but i'll have to get a master branch build at that point if it keeps happening and we dont know why (identical system is still running an older revision of this PR, as are many others, with none of the stack traces related to this unfortunate set of events).

@prakashsurya
Copy link
Member Author

@sempervictus For my tests, I was not using an SLOG. "Using" simply means the disks used by the zpool itself; i.e. I had pools with 1 to 8 HDDs, and pools with 1 to 8 SSDs (no redundancy, no SLOG).

As far as your failures.. you're still seeing problems with the latest code of this PR, with my follow-up fixes? Are the issues seen on ZIL replay (i.e. initial mount), or after the FS/dataset is mounted?

@sempervictus
Copy link
Contributor

@prakashsurya: i've had to pull this PR out of the stack because it seems to be causing two distinct issues:
1 - ARC, for some reason, starts regulating itself despite sysctl settings. i have no idea how in the world this is happening from this PR, but dropping it from the stack restores normal behavior.
2 - With the rebased code, a RAIDZ1 pool consisting entirely of SSDs went bad on me once. Its running with this PR, currently, and sending its deltas hourly to another system (hoping to "validate in flight"). I am thinking that the issues, both times, were caused by IO from ZVOLs backing control nodes for openstack (MySQL, Rabbit, etc, in KVM instances on raw ZVOLs). That particular pool also serves an LXC container with Zenoss doing all the same MySQL and Rabbit stuff, along with a ton of file writes via the ZPL as opposed to ZVOL path. Errors always seem to start from ZVOL ops though, so the ZPL IO may simply be adding contenttion. Same pool went bad before the rebase as well, so i'm not sure if the cause of the error was what you fixed, or what i'm now thinking #6294 is designed to fix (#6888). Reason for this logic is that what i saw last night didnt have this PR in it, on a wholly separate system, in a another location. It was just a master build, and too much IO to a ZVOL killed in that specific manner. I do however suspect that the write improvements here are causing those failures to amplify as opposed to the "serialized failure" without the optimizations here, potentially explaining my bad pools. Do scrubs find DVA errors?

@behlendorf
Copy link
Contributor

@sempervictus were you able to confirm that removing this PR from your stack resolved the two issue you mentioned?

Do scrubs find DVA errors?

It depends exactly what;'s wrong with the DVA but in general yes.

@dweeezil
Copy link
Contributor

I've been loosely following this issue, along with #6874, and it seems that the bad blkptrs are being written to the pool on systems which do have ECC memory. In that case, it might be useful to put some calls to zfs_blkptr_verify() in the write path, potentially even in zio_create() (when bp isn't null) to try to catch the condition under which they're being written.

@behlendorf
Copy link
Contributor

@dweeezil good thought. For debugging it might be useful to additionally verify the blkptr in zio_ready and zio_done.

@prakashsurya
Copy link
Member Author

@behlendorf I rebased this onto the latest master, and squashed to a single commit. The only failure I see is the following, from the 32-bit tester:

Test: /usr/share/zfs/zfs-tests/tests/functional/vdev_zaps/vdev_zaps_004_pos (run as root) [00:03] [FAIL]
23:16:12.42 ASSERTION: Per-vdev ZAPs are transferred properly on attach/detach
23:16:12.56 SUCCESS: zpool create -f testpool loop0
23:16:15.42 SUCCESS: zpool attach testpool loop0 loop1
23:16:15.81 zdb: can't open 'testpool': No such device or address

Is this a known flakey test? This doesn't seem related to my changes at first glance.

@behlendorf
Copy link
Contributor

@prakashsurya thanks! That's a known spurious failure, I've resubmitting that build.

Authored by: Prakash Surya <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Prakash Surya <[email protected]>

Problem
=======

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

 1. When there's outstanding ZIL blocks being written (i.e. there's
    already a "writer thread" in progress), then any new calls to
    zil_commit() will block waiting for the currently oustanding ZIL
    blocks to complete. The blocks written for each "writer thread" is
    coined a "batch", and there can only ever be a single "batch" being
    written at a time. When a batch is being written, any new ZIL
    transactions will have to wait for the next batch to be written,
    which won't occur until the current batch finishes.

    As a result, the underlying storage may not be used as efficiently
    as possible. While "new" threads enter zil_commit() and are blocked
    waiting for the next batch, it's possible that the underlying
    storage isn't fully utilized by the current batch of ZIL blocks. In
    that case, it'd be better to allow these new threads to generate
    (and issue) a new ZIL block, such that it could be serviced by the
    underlying storage concurrently with the other ZIL blocks that are
    being serviced.

 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
    to complete, prior to zil_commit() returning. The size of any given
    batch is proportional to the number of ZIL transaction in the queue
    at the time that the batch starts processing the queue; which
    doesn't occur until the previous batch completes. Thus, if there's a
    lot of transactions in the queue, the batch could be composed of
    many ZIL blocks, and each call to zil_commit() will have to wait for
    all of these writes to complete (even if the thread calling
    zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

 3. If a given batch takes longer to complete than normal, this results
    in larger batch sizes, which then take longer to complete and
    further drive up the latency of zil_commit(). This can occur for a
    number of reasons, including (but not limited to): transient changes
    in the workload, and storage latency irregularites.

Solution
========

The solution attempted by this change has the following goals:

 1. no on-disk changes; maintain current on-disk format.
 2. modify the "batch size" to be equal to the "ZIL block size".
 3. allow new batches to be generated and issued to disk, while there's
    already batches being serviced by the disk.
 4. allow zil_commit() to wait for as few ZIL blocks as possible.
 5. use as few ZIL blocks as possible, for the same amount of ZIL
    transactions, without introducing significant latency to any
    individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

 1. new ZIL blocks can be generated and issued, while there's already
    oustanding ZIL blocks being serviced by the storage.
 2. the latency of zil_commit() should be proportional to the underlying
    storage latency, rather than the incoming synchronous workload.

Porting Notes
=============

Due to the changes made in commit 119a394, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).

To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:

  * A list of itxs was added to the lwb structure. This list contains
    all of the itxs that have been committed to the lwb, such that the
    callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
    after the data for the itxs is committed to disk.

  * A list of itxs was added on the stack of the zil_process_commit_list()
    function; the "nolwb_itxs" list. In some circumstances, an itx may
    not be committed to an lwb (e.g. if allocating the "next" ZIL block
    on disk fails), so this list is used to keep track of which itxs
    fall into this state, such that their callbacks can be called after
    the ZIL's writer pipeline is "stalled".

  * The logic to actually call the itx's callback was moved into the
    zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
    were effectively performing the same logic (i.e. if callback is
    non-null, call the callback), it seemed like useful code cleanup to
    consolidate this logic into a single function.

Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:

  * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
    removed from "trace_zil.h" as well.

  * Some of the zilog structure's fields were removed, which affected
    the tracepoint definitions of the structure.

  * New tracepoints had to be added for the following 3 new probes:
      * zil__process__commit__itx
      * zil__process__normal__itx
      * zil__commit__io__error

OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: openzfs/openzfs@5d95a3a
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rerun @prakashsurya's performance tests on a different system and those results are more inline with what was observed on Illumos. This change improves ZIL performance for the majority of workloads. The one exception, which can be further investigated after the merge, is a penalty for single threaded applications issuing sync IO at the maximum rate. During all of my testing I did not observe any functional issues. This LGTM.

@codecov
Copy link

codecov bot commented Dec 5, 2017

Codecov Report

Merging #6566 into master will decrease coverage by <.01%.
The diff coverage is 93.86%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #6566      +/-   ##
=========================================
- Coverage    75.2%   75.2%   -0.01%     
=========================================
  Files         296     298       +2     
  Lines       95200   95521     +321     
=========================================
+ Hits        71599   71833     +234     
- Misses      23601   23688      +87
Flag Coverage Δ
#kernel 74.6% <87.07%> (+0.3%) ⬆️
#user 67.51% <91.19%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b34070...9948347. Read the comment docs.

@behlendorf behlendorf merged commit 1ce23dc into openzfs:master Dec 5, 2017
@codecov
Copy link

codecov bot commented Dec 5, 2017

Codecov Report

Merging #6566 into master will decrease coverage by <.01%.
The diff coverage is 93.86%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #6566      +/-   ##
=========================================
- Coverage    75.2%   75.2%   -0.01%     
=========================================
  Files         296     298       +2     
  Lines       95200   95521     +321     
=========================================
+ Hits        71599   71833     +234     
- Misses      23601   23688      +87
Flag Coverage Δ
#kernel 74.6% <87.07%> (+0.3%) ⬆️
#user 67.51% <91.19%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b34070...9948347. Read the comment docs.

Nasf-Fan pushed a commit to Nasf-Fan/zfs that referenced this pull request Jan 29, 2018
Authored by: Prakash Surya <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Prakash Surya <[email protected]>

Problem
=======

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

 1. When there's outstanding ZIL blocks being written (i.e. there's
    already a "writer thread" in progress), then any new calls to
    zil_commit() will block waiting for the currently oustanding ZIL
    blocks to complete. The blocks written for each "writer thread" is
    coined a "batch", and there can only ever be a single "batch" being
    written at a time. When a batch is being written, any new ZIL
    transactions will have to wait for the next batch to be written,
    which won't occur until the current batch finishes.

    As a result, the underlying storage may not be used as efficiently
    as possible. While "new" threads enter zil_commit() and are blocked
    waiting for the next batch, it's possible that the underlying
    storage isn't fully utilized by the current batch of ZIL blocks. In
    that case, it'd be better to allow these new threads to generate
    (and issue) a new ZIL block, such that it could be serviced by the
    underlying storage concurrently with the other ZIL blocks that are
    being serviced.

 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
    to complete, prior to zil_commit() returning. The size of any given
    batch is proportional to the number of ZIL transaction in the queue
    at the time that the batch starts processing the queue; which
    doesn't occur until the previous batch completes. Thus, if there's a
    lot of transactions in the queue, the batch could be composed of
    many ZIL blocks, and each call to zil_commit() will have to wait for
    all of these writes to complete (even if the thread calling
    zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

 3. If a given batch takes longer to complete than normal, this results
    in larger batch sizes, which then take longer to complete and
    further drive up the latency of zil_commit(). This can occur for a
    number of reasons, including (but not limited to): transient changes
    in the workload, and storage latency irregularites.

Solution
========

The solution attempted by this change has the following goals:

 1. no on-disk changes; maintain current on-disk format.
 2. modify the "batch size" to be equal to the "ZIL block size".
 3. allow new batches to be generated and issued to disk, while there's
    already batches being serviced by the disk.
 4. allow zil_commit() to wait for as few ZIL blocks as possible.
 5. use as few ZIL blocks as possible, for the same amount of ZIL
    transactions, without introducing significant latency to any
    individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

 1. new ZIL blocks can be generated and issued, while there's already
    oustanding ZIL blocks being serviced by the storage.
 2. the latency of zil_commit() should be proportional to the underlying
    storage latency, rather than the incoming synchronous workload.

Porting Notes
=============

Due to the changes made in commit 119a394, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).

To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:

  * A list of itxs was added to the lwb structure. This list contains
    all of the itxs that have been committed to the lwb, such that the
    callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
    after the data for the itxs is committed to disk.

  * A list of itxs was added on the stack of the zil_process_commit_list()
    function; the "nolwb_itxs" list. In some circumstances, an itx may
    not be committed to an lwb (e.g. if allocating the "next" ZIL block
    on disk fails), so this list is used to keep track of which itxs
    fall into this state, such that their callbacks can be called after
    the ZIL's writer pipeline is "stalled".

  * The logic to actually call the itx's callback was moved into the
    zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
    were effectively performing the same logic (i.e. if callback is
    non-null, call the callback), it seemed like useful code cleanup to
    consolidate this logic into a single function.

Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:

  * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
    removed from "trace_zil.h" as well.

  * Some of the zilog structure's fields were removed, which affected
    the tracepoint definitions of the structure.

  * New tracepoints had to be added for the following 3 new probes:
      * zil__process__commit__itx
      * zil__process__normal__itx
      * zil__commit__io__error

OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: openzfs/openzfs@5d95a3a
Closes openzfs#6566
Nasf-Fan pushed a commit to Nasf-Fan/zfs that referenced this pull request Feb 13, 2018
Authored by: Prakash Surya <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Prakash Surya <[email protected]>

Problem
=======

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

 1. When there's outstanding ZIL blocks being written (i.e. there's
    already a "writer thread" in progress), then any new calls to
    zil_commit() will block waiting for the currently oustanding ZIL
    blocks to complete. The blocks written for each "writer thread" is
    coined a "batch", and there can only ever be a single "batch" being
    written at a time. When a batch is being written, any new ZIL
    transactions will have to wait for the next batch to be written,
    which won't occur until the current batch finishes.

    As a result, the underlying storage may not be used as efficiently
    as possible. While "new" threads enter zil_commit() and are blocked
    waiting for the next batch, it's possible that the underlying
    storage isn't fully utilized by the current batch of ZIL blocks. In
    that case, it'd be better to allow these new threads to generate
    (and issue) a new ZIL block, such that it could be serviced by the
    underlying storage concurrently with the other ZIL blocks that are
    being serviced.

 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
    to complete, prior to zil_commit() returning. The size of any given
    batch is proportional to the number of ZIL transaction in the queue
    at the time that the batch starts processing the queue; which
    doesn't occur until the previous batch completes. Thus, if there's a
    lot of transactions in the queue, the batch could be composed of
    many ZIL blocks, and each call to zil_commit() will have to wait for
    all of these writes to complete (even if the thread calling
    zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

 3. If a given batch takes longer to complete than normal, this results
    in larger batch sizes, which then take longer to complete and
    further drive up the latency of zil_commit(). This can occur for a
    number of reasons, including (but not limited to): transient changes
    in the workload, and storage latency irregularites.

Solution
========

The solution attempted by this change has the following goals:

 1. no on-disk changes; maintain current on-disk format.
 2. modify the "batch size" to be equal to the "ZIL block size".
 3. allow new batches to be generated and issued to disk, while there's
    already batches being serviced by the disk.
 4. allow zil_commit() to wait for as few ZIL blocks as possible.
 5. use as few ZIL blocks as possible, for the same amount of ZIL
    transactions, without introducing significant latency to any
    individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

 1. new ZIL blocks can be generated and issued, while there's already
    oustanding ZIL blocks being serviced by the storage.
 2. the latency of zil_commit() should be proportional to the underlying
    storage latency, rather than the incoming synchronous workload.

Porting Notes
=============

Due to the changes made in commit 119a394, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).

To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:

  * A list of itxs was added to the lwb structure. This list contains
    all of the itxs that have been committed to the lwb, such that the
    callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
    after the data for the itxs is committed to disk.

  * A list of itxs was added on the stack of the zil_process_commit_list()
    function; the "nolwb_itxs" list. In some circumstances, an itx may
    not be committed to an lwb (e.g. if allocating the "next" ZIL block
    on disk fails), so this list is used to keep track of which itxs
    fall into this state, such that their callbacks can be called after
    the ZIL's writer pipeline is "stalled".

  * The logic to actually call the itx's callback was moved into the
    zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
    were effectively performing the same logic (i.e. if callback is
    non-null, call the callback), it seemed like useful code cleanup to
    consolidate this logic into a single function.

Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:

  * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
    removed from "trace_zil.h" as well.

  * Some of the zilog structure's fields were removed, which affected
    the tracepoint definitions of the structure.

  * New tracepoints had to be added for the following 3 new probes:
      * zil__process__commit__itx
      * zil__process__normal__itx
      * zil__commit__io__error

OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: openzfs/openzfs@5d95a3a
Closes openzfs#6566
@scineram scineram mentioned this pull request Jul 27, 2018
FransUrbo pushed a commit to FransUrbo/zfs that referenced this pull request Apr 28, 2019
Authored by: Prakash Surya <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Matt Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Approved by: Dan McDonald <[email protected]>
Ported-by: Prakash Surya <[email protected]>

Problem
=======

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

 1. When there's outstanding ZIL blocks being written (i.e. there's
    already a "writer thread" in progress), then any new calls to
    zil_commit() will block waiting for the currently oustanding ZIL
    blocks to complete. The blocks written for each "writer thread" is
    coined a "batch", and there can only ever be a single "batch" being
    written at a time. When a batch is being written, any new ZIL
    transactions will have to wait for the next batch to be written,
    which won't occur until the current batch finishes.

    As a result, the underlying storage may not be used as efficiently
    as possible. While "new" threads enter zil_commit() and are blocked
    waiting for the next batch, it's possible that the underlying
    storage isn't fully utilized by the current batch of ZIL blocks. In
    that case, it'd be better to allow these new threads to generate
    (and issue) a new ZIL block, such that it could be serviced by the
    underlying storage concurrently with the other ZIL blocks that are
    being serviced.

 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
    to complete, prior to zil_commit() returning. The size of any given
    batch is proportional to the number of ZIL transaction in the queue
    at the time that the batch starts processing the queue; which
    doesn't occur until the previous batch completes. Thus, if there's a
    lot of transactions in the queue, the batch could be composed of
    many ZIL blocks, and each call to zil_commit() will have to wait for
    all of these writes to complete (even if the thread calling
    zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

 3. If a given batch takes longer to complete than normal, this results
    in larger batch sizes, which then take longer to complete and
    further drive up the latency of zil_commit(). This can occur for a
    number of reasons, including (but not limited to): transient changes
    in the workload, and storage latency irregularites.

Solution
========

The solution attempted by this change has the following goals:

 1. no on-disk changes; maintain current on-disk format.
 2. modify the "batch size" to be equal to the "ZIL block size".
 3. allow new batches to be generated and issued to disk, while there's
    already batches being serviced by the disk.
 4. allow zil_commit() to wait for as few ZIL blocks as possible.
 5. use as few ZIL blocks as possible, for the same amount of ZIL
    transactions, without introducing significant latency to any
    individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

 1. new ZIL blocks can be generated and issued, while there's already
    oustanding ZIL blocks being serviced by the storage.
 2. the latency of zil_commit() should be proportional to the underlying
    storage latency, rather than the incoming synchronous workload.

Porting Notes
=============

Due to the changes made in commit 119a394, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).

To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:

  * A list of itxs was added to the lwb structure. This list contains
    all of the itxs that have been committed to the lwb, such that the
    callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
    after the data for the itxs is committed to disk.

  * A list of itxs was added on the stack of the zil_process_commit_list()
    function; the "nolwb_itxs" list. In some circumstances, an itx may
    not be committed to an lwb (e.g. if allocating the "next" ZIL block
    on disk fails), so this list is used to keep track of which itxs
    fall into this state, such that their callbacks can be called after
    the ZIL's writer pipeline is "stalled".

  * The logic to actually call the itx's callback was moved into the
    zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
    were effectively performing the same logic (i.e. if callback is
    non-null, call the callback), it seemed like useful code cleanup to
    consolidate this logic into a single function.

Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:

  * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
    removed from "trace_zil.h" as well.

  * Some of the zilog structure's fields were removed, which affected
    the tracepoint definitions of the structure.

  * New tracepoints had to be added for the following 3 new probes:
      * zil__process__commit__itx
      * zil__process__normal__itx
      * zil__commit__io__error

OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: openzfs/openzfs@5d95a3a
Closes openzfs#6566
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants