severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5 #14594

devZer0 · 2023-03-07T08:52:53Z

Describe the problem you're observing

on recent proxmox releases which is delivered with zfs 2.1.5 as part of the kernel package, there is a significant slowdown when moving an empty qcow2 virtual disk file from an hdd based pool to any other pool

it seems this issue is related to atime updates.

the problem goes away when setting atime=off or atime=on/relatime=on

Describe how to reproduce the problem

update older proxmox with 5.15.39-1-pve kernel to recent proxmox version (pve-no-subscription or pvetest repo)

Include any warning/errors/backtraces from the system logs

see discussion/analysis at this thread:

https://forum.proxmox.com/threads/weird-disk-write-i-o-pattern-on-source-disks-when-moving-virtual-disk.123639/post-538583

start of thread:
https://forum.proxmox.com/threads/weird-disk-write-i-o-pattern-on-source-disks-when-moving-virtual-disk.123639/

not sure what change in zfs could cause this behaviour , maybe #13338 ? @rincebrain ?

5.15.39-1-pve with zfs 2.1.4:
time qm disk move 100 scsi0  nvmepool
real    0m19.974s

zpool iostat -r hddpool:
 
hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            35      0      0      0      1      0    316      0     10      0      0      0
1K            307      0    144      0     54      0    329      5     41      0      0      0
2K             33      0      0      0      0      0    252     56      2     12      0      0
4K              9      0      0      0      0      2    335     77      0     12      0      0
8K            213      0     12      0      2      5      0    153     12      7      0      0
16K             0      0      0      0      0      5      0     71      0     20      0      0
32K             0      0      0      0      0      2      0      6      0     15      0      0
64K            16      0     32      0      0      0      0      0      0      2      0      0
128K            6      0      0      0      0      0      0      0      0      5      0      0
256K            0      0      0      0      0      0      0      0      0      3      0      0
512K            0      0      0      0      0      0      0      0      0      3      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

5.15.39-2-pve with zfs 2.1.5:
time qm disk move 100 scsi0  nvmepool
real    1m10.214s

zpool iostat -r hddpool:

hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            33      0      0      0      3      0  13.2K      0      7      0      0      0
1K            324      0  5.94K      0     61      0  15.9K      0    171      1      0      0
2K             47      0      0      0      0      5  10.0K  2.62K     11      9      0      0
4K             16      0      0      0      0      3  22.9K  2.94K      8      8      0      0
8K            216      0     12      0      2      4     20  3.73K      1     13      0      0
16K             2      0      0      0      0      7    238  3.04K      0      8      0      0
32K             0      0      0      0      0      0      0    111      0      8      0      0
64K            16      0     32      0      0      0      0     87      0      0      0      0
128K            1      0      0      0      0      0    217      7      0      2      0      0
256K            0      0      0      0      0      0      0     44      0      2      0      0
512K            0      0      0      0      0      0      0      0      0      4      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

5.15.39-2-pve with zfs 2.1.5 and atime=on/relatime=on:
time qm disk move 100 scsi0  nvmepool
real    0m3.365s


# zpool iostat -r hddpool

hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            38      0      0      0      0      0    169      0      5      0      0      0
1K             77      0     80      0     64      0    198      7    113      3      0      0
2K             17      0      0      0      0      3    133     13      4     20      0      0
4K             11      0      0      0      0      7    171     46      4      5      0      0
8K            216      0     12      0      2      5      0     52      5      6      0      0
16K             1      0      0      0      0      5      0     33      0      5      0      0
32K             1      0      0      0      0      1      0      0      0     11      0      0
64K            16      0     32      0      0      0      0      0      0      4      0      0
128K            1      0      0      0      0      0      0      0      0      3      0      0
256K            0      0      0      0      0      0      0      0      0      4      0      0
512K            0      0      0      0      0      0      0      0      0      4      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

devZer0 · 2023-03-07T11:47:23Z

with linux 6.1 kernel and zfs 2.1.9 it seems it's even slower then with 2.1.5 ( 1m27s vs. 3.18s with relatime=on)

rincebrain · 2023-03-07T22:27:00Z

Pathology in common with #14512 maybe?

rincebrain · 2023-03-08T08:24:28Z

(Replying here because I don't think commenting on that bug that this one seems unrelated is reasonable after I linked it here.)

It might be unrelated, you'd have to see where you're burning your time, but my speculation went something like "disk images often contain large sparse areas" => "this is a known edge case where trying to manipulate sparse areas on things being regularly updated can cause problems, maybe it's causing problems here too".

You could try flipping the tunable zfs_dmu_offset_next_sync to 0 to see if the behavior improves. It's not dangerous, it just makes things behave as though there's no holes if it would have required a forced TXG sync to think it had a correct view of the holes, so you might end up reading/writing a bunch of zeroes instead of jumping over them if something was using SEEK_HOLE/SEEK_DATA.

devZer0 · 2023-03-08T08:48:26Z

i did strace the qemu-img process but it did not reveal anything usable besides the fact , that i can see a lot of lseek and every seek apparently causing one or more atime updates.

this is a known edge case where trying to manipulate sparse areas on things being regularly updated can cause problems,

you did notice that this is nothing related to WRITE access in any way but only READ access? it also happens if the virtual machine is powered off. so there is no proces writing to the file itself.

apparently , simple reading of the metadata pre-allocated qcow2 file is causing an massive amount of atime updates and how massive this is, also seems to depend from the layout of the qcow2 file.

when the file is being moved forth and back, the problem is gone. apparently , "qemu-img convert" does not preserve what "qemu-img create" did setup initially.

i have found that issue after i copied a virtual machine from an older cluster to a newer one and then moving that file with proxmox gui (i.e. qemu-img) from hdd to ssd was slower then the copy via scp/network.

rincebrain · 2023-03-08T08:57:44Z

Hypothetically, it could be something like, you do something to access the VM image while it's idle (reading, not writing, just to be entirely clear), it dirties the file because of the needed atime update, and consequently you end up with the aforementioned feature triggering on SEEK_HOLE/SEEK_DATA and forcing a txg sync because it notices the file is dirty, nevermind in what way.

devZer0 · 2023-03-08T09:42:57Z

You could try flipping the tunable zfs_dmu_offset_next_sync to 0 to see if the behavior improves

it does! it restores performance to pre 2.1.5 behaviour ! (20s instead of 1m20s)

devZer0 · 2023-03-08T10:15:51Z

Hypothetically, it could be something like, you do something to access the VM image while it's idle (reading, not writing, just to be entirely clear), it dirties the file because of the needed atime update, and consequently you end up with the aforementioned feature triggering on SEEK_HOLE/SEEK_DATA and forcing a txg sync because it notices the file is dirty, nevermind in what way.

after looking at bb8526e , i think it's exactly like you tell !

i have no clue how this can be resolved in a sane way.

is there a way to check if dirtying the file was "just" an atime update (which is not worth forcing txg sync) ?

when reading through https://man7.org/linux/man-pages/man2/open.2.html , i think O_NOATIME is no option for open() in qemu-img

devZer0 · 2023-03-08T10:50:59Z

for reference:
#4306

and

https://www.mail-archive.com/[email protected]/msg427170.html

ryao · 2023-03-08T18:52:34Z

9f69435 is the commit in 2.1.x that changed our behavior. It was first in 2.1.5.

behlendorf · 2023-03-08T21:06:18Z

There's actually an open PR to optimize this, #13368. There's just some outstanding feedback which needs to be addressed before it can be merged.

rincebrain · 2023-03-09T21:12:41Z

9f69435 is the commit in 2.1.x that changed our behavior. It was first in 2.1.5.

It goes back further and more wind-y than that, I think.

First, there was #11900, where you could get inconsistent hole data if you looked too fast, as I recall. Oopsie. So the logic was modified, though the tunable predates that, and we got #12724. But that had a flaw, so we got #12745, and #12746 because the argument was that if we effectively didn't report holes without that in most cases, the functionality was likely to bitrot and break strangely one day.

And now here we are, with not inconsistent data, but pathological performing, behavior.

I'm curious to see if #13368 will mean we can avoid this penalty in most cases or not in practice..

devZer0 · 2023-03-14T18:30:15Z

apparently patch from matthew ahrens has been approved: #13368 (review)

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #13368 Issue #14594 Issue #14512 Issue #14009

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #13368 Issue #14594 Issue #14512 Issue #14009

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009

devZer0 · 2023-03-18T16:07:44Z

just for reference: #14641

ryao · 2023-03-18T23:07:37Z

apparently patch from matthew ahrens has been approved: #13368 (review)

It has been merged. Is the problem still here?

ssergiienko · 2024-03-31T05:37:07Z

Given that fix(#13368) was reverted in #14761 the question is seems actual, is degradation still there?
Or did it go away with some other optimizations? Or is there any other ticket which is going to solve that problem?

devZer0 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 7, 2023

devZer0 mentioned this issue Mar 8, 2023

SEEK_HOLE loop & forced syncing causes never-ending delay in grep #14512

Closed

thesamesam mentioned this issue Mar 11, 2023

ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

Open

behlendorf mentioned this issue Mar 14, 2023

[2.1.10] ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #14627

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5 #14594

severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5 #14594

devZer0 commented Mar 7, 2023 •

edited

Loading

devZer0 commented Mar 7, 2023

rincebrain commented Mar 7, 2023

rincebrain commented Mar 8, 2023

devZer0 commented Mar 8, 2023

rincebrain commented Mar 8, 2023

devZer0 commented Mar 8, 2023

devZer0 commented Mar 8, 2023 •

edited

Loading

devZer0 commented Mar 8, 2023

ryao commented Mar 8, 2023

behlendorf commented Mar 8, 2023

rincebrain commented Mar 9, 2023

devZer0 commented Mar 14, 2023

devZer0 commented Mar 18, 2023 •

edited

Loading

ryao commented Mar 18, 2023

ssergiienko commented Mar 31, 2024

severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5 #14594

severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5 #14594

Comments

devZer0 commented Mar 7, 2023 • edited Loading

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

devZer0 commented Mar 7, 2023

rincebrain commented Mar 7, 2023

rincebrain commented Mar 8, 2023

devZer0 commented Mar 8, 2023

rincebrain commented Mar 8, 2023

devZer0 commented Mar 8, 2023

devZer0 commented Mar 8, 2023 • edited Loading

devZer0 commented Mar 8, 2023

ryao commented Mar 8, 2023

behlendorf commented Mar 8, 2023

rincebrain commented Mar 9, 2023

devZer0 commented Mar 14, 2023

devZer0 commented Mar 18, 2023 • edited Loading

ryao commented Mar 18, 2023

ssergiienko commented Mar 31, 2024

devZer0 commented Mar 7, 2023 •

edited

Loading

devZer0 commented Mar 8, 2023 •

edited

Loading

devZer0 commented Mar 18, 2023 •

edited

Loading