ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

ppwaskie · 2022-10-10T04:44:47Z

System information

Type	Version/Name
Distribution Name	Gentoo
Distribution Version	Rolling
Kernel Version	5.19.14-gentoo-x86_64, 5.15.72-gentoo-x86_64
Architecture	x86_64
OpenZFS Version	2.1.6 or 2.1.5

Describe the problem you're observing

I've been running ZFS 2.1.4 for quite some time on my main ZFS array, using RAIDz3 with a very large dataset (85TB online). On Gentoo, I can only run a 5.15.x or lower kernel with this version. Upgrading to a 5.18 or 5.19 kernel, I need to upgrade to use ZFS 2.1.6 to compile for the newer kernel. When I try this, my write performance goes from 100-150MB/sec of write on 5.15 and ZFS 2.1.4 (testing emerge -a =sys-kernel/gentoo-sources-5.10.144) to about 100 kB/sec on 5.19.14 and ZFS 2.1.6.

I've tried ZFS 2.1.5 and 2.1.6 with a 5.15.72 kernel, and had the exact same performance regression.

The big issue is ZFS 2.1.4 has now been removed from the main world list after an emerge --sync, so I can't revert my installed version of 2.1.6.

Describe how to reproduce the problem

Upgrade an existing host to ZFS 2.1.5 or 2.1.6, attempt writing a larger package with lots of smaller files (e.g. a Linux kernel source package) and observe the write performance reduced by a factor of about 100.

Include any warning/errors/backtraces from the system logs

I see nothing indicating anything is going wrong. Nothing in dmesg, nothing in syslogs, and zpool status is clean.

Rebooting into a 5.15 kernel with ZFS 2.1.4 on the exact same array returns the expected performance.

ryao · 2022-10-10T16:45:34Z

Would you try ZFS master via the 9999 ebuild and see if the issue is present there too?

As long as you do not run a zpool upgrade $pool command, it should be safe to go to ZFS master and then back to 2.1.4.

satarsa · 2022-10-11T11:26:28Z

The big issue is ZFS 2.1.4 has now been removed from the main world list after an emerge --sync, so I can't revert my installed version of 2.1.6.

Actually, you can. You could clone the official gentoo repo from https://gitweb.gentoo.org/repo/gentoo.git/
as your local repo, check it out to the version with zfs-kmod-2.1.4-r1 not yet dropped (I believe it would be 33344d7dd6b44bd93c17485d77d60c0e25ef71ee) and mask locally everything >=2.1.5.

ppwaskie · 2022-10-11T13:45:36Z

@satarsa thanks for that. And @ryao I connected with one of the Gentoo maintainers for ZFS offline, and he provided me with some instructions on how to use the 9999 ebuild along with bisecting between 2.1.4 and 2.1.5. I’m happy to try and find the commit where the perf regression showed up, at least for my ZFS setup.

I honestly didn’t think this would get so much activity though shortly after I opened the issue! I’m currently not at home where this server is, but I’ll try and run some of these bisect ops while I’m away this week. Worst case, I can get this nailed down this coming weekend.

All of the support is greatly appreciated!!

ryao · 2022-10-11T13:50:34Z

I did not expect you to bisect it, but if you do, that would be awesome. I should be able to figure this out quickly if you identify the bad patch through a bisect.

scineram · 2022-10-11T15:22:03Z

@ryao From the release notes only #13405 looks like it could really impact general performance.

ppwaskie · 2022-10-11T15:28:31Z

I haven’t started bisecting yet, but more info on my system/setup where I’m seeing this issue:

Intel Xeon SP system, Skylake Platinum, 2 socket, 112 cores (with SMT enabled)
128GB RAM
13 x 10TB Seagate Exos drives in RAIDz3
2 x 1TB Intel NVMe SSD’s. Half of each are split into Log and ARC cache. The other half of each is a RAID-1 mirror for the root filesystem of the host.

So I do have many cores in the system running. In that RAIDz3 pool, I have many datasets carved out, where I’m pushing about 31TB used total. Most of it is video-based streaming content for Plex, so not lots of tiny files.

I hope to have more info once I can coordinate with home and bisect on the live system.

ppwaskie · 2022-10-16T03:12:20Z

Apologies for the delay on this. I was finally able to get some time on the box and bisect this.

This is the offending commit that is killing write performance on my system:

9f6943504aec36f897f814fb7ae5987425436b11 is the first bad commit
commit 9f6943504aec36f897f814fb7ae5987425436b11
Author: Brian Behlendorf <[email protected]>
Date:   Tue Nov 30 10:38:09 2021 -0800

    Default to zfs_dmu_offset_next_sync=1
    
    Strict hole reporting was previously disabled by default as a
    performance optimization.  However, this has lead to confusion
    over the expected behavior and a variety of workarounds being
    adopted by consumers of ZFS.  Change the default behavior to
    always report holes and force the TXG sync.
    
    Reviewed-by: Matthew Ahrens <[email protected]>
    Reviewed-by: Tony Hutter <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>
    Upstream-commit: 05b3eb6d232009db247882a39d518e7282630753
    Ref: #13261
    Closes #12746

 man/man4/zfs.4   |  8 ++++----
 module/zfs/dmu.c | 12 ++++++++----
 2 files changed, 12 insertions(+), 8 deletions(-)

I've taken this a step further and while on the build with this patch, I turned off that tunable:

# echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

And then re-tested immediately after. The issue went away. I went from about 100kB/sec write performance to 150MB/sec (note the order of magnitude difference).

UPDATE: I went ahead and built the 2.1.6 ebuilds, and confirmed I still had this issue. I then turned off the same tunable, and the performance issue went away.

Hope this helps inform how to deal with this upstream.

ryao · 2022-10-16T04:17:09Z

Nice find.

rincebrain · 2022-10-18T07:14:32Z

I should warn you, turning that off will result in sometimes treating files as dense when they're sparse if that hasn't synced out yet, IIRC, so if that's a use case you care about, you may be sad.

Of course, when you're handing it to ZFS with compression on, it'll eat the sparseness one way or another, it's just a question of whether you unnecessarily copied some zeroes only to throw them out, so, if this works for you, great, just be aware that it results in additional IO overhead if you come looking for performance bottlenecks again.

amotin · 2022-10-18T14:30:45Z

I see it not great to allow regular unprivileged user to force or depend on pool TXG commits. There should be some better solution.

amotin · 2022-10-18T14:40:10Z

I think at very least the code could be optimized to not even think to commit TXG if file is below a certain size, especially if below one block, that means it can't have holes unless it is a one big hole. If I understood right and the workload is updating Linux source tree, then I guess most/many of source files should fit within one block.

thesamesam · 2023-03-11T20:34:38Z

See also #14512 and #14594. #13368 may or may not help.

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #13368 Issue #14594 Issue #14512 Issue #14009

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #13368 Issue #14594 Issue #14512 Issue #14009

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009

stale · 2024-03-13T10:26:56Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

ppwaskie added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 10, 2022

ryao added Type: Regression Indicates a functional regression and removed Type: Defect Incorrect behavior (e.g. crash, hang) labels Oct 10, 2022

behlendorf mentioned this issue Mar 14, 2023

[2.1.10] ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #14627

Merged

7 tasks

kiler129 mentioned this issue Nov 25, 2023

some copied files are corrupted (chunks replaced by zeros) #15526

Closed

stale bot added the Status: Stale No recent activity for issue label Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

ppwaskie commented Oct 10, 2022

ryao commented Oct 10, 2022

satarsa commented Oct 11, 2022

ppwaskie commented Oct 11, 2022

ryao commented Oct 11, 2022

scineram commented Oct 11, 2022

ppwaskie commented Oct 11, 2022

ppwaskie commented Oct 16, 2022 •

edited

Loading

ryao commented Oct 16, 2022

rincebrain commented Oct 18, 2022

amotin commented Oct 18, 2022

amotin commented Oct 18, 2022

thesamesam commented Mar 11, 2023 •

edited

Loading

stale bot commented Mar 13, 2024

ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

Comments

ppwaskie commented Oct 10, 2022

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

ryao commented Oct 10, 2022

satarsa commented Oct 11, 2022

ppwaskie commented Oct 11, 2022

ryao commented Oct 11, 2022

scineram commented Oct 11, 2022

ppwaskie commented Oct 11, 2022

ppwaskie commented Oct 16, 2022 • edited Loading

ryao commented Oct 16, 2022

rincebrain commented Oct 18, 2022

amotin commented Oct 18, 2022

amotin commented Oct 18, 2022

thesamesam commented Mar 11, 2023 • edited Loading

stale bot commented Mar 13, 2024

ppwaskie commented Oct 16, 2022 •

edited

Loading

thesamesam commented Mar 11, 2023 •

edited

Loading