-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5 #14594
Comments
with linux 6.1 kernel and zfs 2.1.9 it seems it's even slower then with 2.1.5 ( 1m27s vs. 3.18s with relatime=on) |
Pathology in common with #14512 maybe? |
(Replying here because I don't think commenting on that bug that this one seems unrelated is reasonable after I linked it here.) It might be unrelated, you'd have to see where you're burning your time, but my speculation went something like "disk images often contain large sparse areas" => "this is a known edge case where trying to manipulate sparse areas on things being regularly updated can cause problems, maybe it's causing problems here too". You could try flipping the tunable |
i did strace the qemu-img process but it did not reveal anything usable besides the fact , that i can see a lot of lseek and every seek apparently causing one or more atime updates.
you did notice that this is nothing related to WRITE access in any way but only READ access? it also happens if the virtual machine is powered off. so there is no proces writing to the file itself. apparently , simple reading of the metadata pre-allocated qcow2 file is causing an massive amount of atime updates and how massive this is, also seems to depend from the layout of the qcow2 file. when the file is being moved forth and back, the problem is gone. apparently , "qemu-img convert" does not preserve what "qemu-img create" did setup initially. i have found that issue after i copied a virtual machine from an older cluster to a newer one and then moving that file with proxmox gui (i.e. qemu-img) from hdd to ssd was slower then the copy via scp/network. |
Hypothetically, it could be something like, you do something to access the VM image while it's idle (reading, not writing, just to be entirely clear), it dirties the file because of the needed atime update, and consequently you end up with the aforementioned feature triggering on SEEK_HOLE/SEEK_DATA and forcing a txg sync because it notices the file is dirty, nevermind in what way. |
it does! it restores performance to pre 2.1.5 behaviour ! (20s instead of 1m20s) |
after looking at bb8526e , i think it's exactly like you tell ! i have no clue how this can be resolved in a sane way. is there a way to check if dirtying the file was "just" an atime update (which is not worth forcing txg sync) ? when reading through https://man7.org/linux/man-pages/man2/open.2.html , i think O_NOATIME is no option for open() in qemu-img |
for reference: and https://www.mail-archive.com/[email protected]/msg427170.html |
9f69435 is the commit in 2.1.x that changed our behavior. It was first in 2.1.5. |
There's actually an open PR to optimize this, #13368. There's just some outstanding feedback which needs to be addressed before it can be merged. |
It goes back further and more wind-y than that, I think. First, there was #11900, where you could get inconsistent hole data if you looked too fast, as I recall. Oopsie. So the logic was modified, though the tunable predates that, and we got #12724. But that had a flaw, so we got #12745, and #12746 because the argument was that if we effectively didn't report holes without that in most cases, the functionality was likely to bitrot and break strangely one day. And now here we are, with not inconsistent data, but pathological performing, behavior. I'm curious to see if #13368 will mean we can avoid this penalty in most cases or not in practice.. |
apparently patch from matthew ahrens has been approved: #13368 (review) |
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #13368 Issue #14594 Issue #14512 Issue #14009
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #13368 Issue #14594 Issue #14512 Issue #14009
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009
just for reference: #14641 |
It has been merged. Is the problem still here? |
Distribution Name | Proxmox PVE
Distribution Version | 7.3
Kernel Version | 5.15.39-2-pve and later
Architecture | x86_64
OpenZFS Version | 2.1.5
Describe the problem you're observing
on recent proxmox releases which is delivered with zfs 2.1.5 as part of the kernel package, there is a significant slowdown when moving an empty qcow2 virtual disk file from an hdd based pool to any other pool
it seems this issue is related to atime updates.
the problem goes away when setting atime=off or atime=on/relatime=on
Describe how to reproduce the problem
update older proxmox with 5.15.39-1-pve kernel to recent proxmox version (pve-no-subscription or pvetest repo)
Include any warning/errors/backtraces from the system logs
see discussion/analysis at this thread:
https://forum.proxmox.com/threads/weird-disk-write-i-o-pattern-on-source-disks-when-moving-virtual-disk.123639/post-538583
start of thread:
https://forum.proxmox.com/threads/weird-disk-write-i-o-pattern-on-source-disks-when-moving-virtual-disk.123639/
not sure what change in zfs could cause this behaviour , maybe #13338 ? @rincebrain ?
The text was updated successfully, but these errors were encountered: