zfs_bclone_wait_dirty=1 broken for files with unallocated blocks at the end #15994

rrevans · 2024-03-14T03:22:09Z

System information

Type	Version/Name
Distribution Name	Fedora
Distribution Version	39
Kernel Version	6.6.8-200.fc39.x86_64
Architecture	x86_64
OpenZFS Version	Build from head (`8f2f6cd` DEBUG)

Describe the problem you're observing

With zfs_bclone_enabled=1 and zfs_bclone_wait_dirty=1, copying a file with unallocated blocks at the end gets stuck in the kernel forever. During this time, the kernel is also forcing txg syncs in an infinite loop.

This behavior is also observed in #15933.

(The same underlying issue also means that bclone always fails for files with unallocated blocks at the end if zfs_bclone_wait_dirty=0.)

Describe how to reproduce the problem

$ echo 1 > /sys/module/zfs/parameters/zfs_bclone_enabled
$ echo 1 > /sys/module/zfs/parameters/zfs_bclone_wait_dirty
$ cd /testpool
$ truncate --size=256M src
$ zpool sync
$ cp --reflink=always src dst

cp never finishes and is stuck in an uninterruptible state unresponsive to both SIGINT and SIGQUIT.

Setting zfs_bclone_wait_dirty=0 while cp is still running causes cp to finish immediately with error cp: failed to clone 'dst' from 'src': Resource temporarily unavailable

Include any warning/errors/backtraces from the system logs

Nothing immediately in dmesg or dbgmsg, but during the failure /proc/spl/kstat/zfs/testpool/txgs shows that zfs is generating a lot of empty txgs:

$ tail -30 /proc/spl/kstat/zfs/testpool/txgs
1571017  393315256830     C     0            0            0            0        0        701860       6090         25220        77480       
1571018  393315958690     C     0            0            0            0        0        704610       6090         25140        77340       
1571019  393316663300     C     0            0            0            0        0        695860       6800         25400        106120      
1571020  393317359160     C     0            0            0            0        0        719080       6120         25310        88020       
1571021  393318078240     C     0            0            0            0        0        710381       6100         25260        78150       
1571022  393318788621     C     0            0            0            0        0        710010       6100         25780        78240       
1571023  393319498631     C     0            0            0            0        0        693820       6130         25420        79990       
1571024  393320192451     C     0            0            0            0        0        701620       6090         24460        77400       
1571025  393320894071     C     0            0            0            0        0        701870       6100         25250        78390       
1571026  393321595941     C     0            0            0            0        0        696431       6080         25560        77490       
1571027  393322292372     C     0            0            0            0        0        699960       6070         25330        77490       
1571028  393322992332     C     0            0            0            0        0        702420       6220         25210        77350       
1571029  393323694752     C     0            0            0            0        0        695030       6160         25730        97120       
1571030  393324389782     C     0            0            0            0        0        710290       6080         25340        87720       
1571031  393325100072     C     0            0            0            0        0        709801       6070         25430        78120       
1571032  393325809873     C     0            0            0            0        0        697140       6240         24630        77370       
1571033  393326507013     C     0            0            0            0        0        696710       6080         25130        78150       
1571034  393327203723     C     0            0            0            0        0        700170       6110         24730        77050       
1571035  393327903893     C     0            0            0            0        0        703480       6140         25691        78570       
1571036  393328607373     C     0            0            0            0        0        698401       6110         25590        78960       
1571037  393329305774     C     0            0            0            0        0        699180       6090         25290        77460       
1571038  393330004954     C     0            0            0            0        0        703890       6090         25230        77220       
1571039  393330708844     C     0            0            0            0        0        694670       6080         25520        96280       
1571040  393331403514     C     0            0            0            0        0        707651       6080         25380        87750       
1571041  393332111165     C     0            0            0            0        0        711130       6250         25450        77950       
1571042  393332822295     C     0            0            0            0        0        705860       6120         24780        83430       
1571043  393333528155     C     0            0            0            0        0        745090       6140         34510        96930       
1571044  393334273245     C     0            0            0            0        0        790700       6180         32650        106500      
1571045  393335063945     S     0            0            0            0        0        784211       6140         32690        0           
1571046  393335848156     O     0            0            0            0        0        0            0            0            0

Root cause

#15842 adds logic to wait for sync when encountering dirty blocks implemented as syncing when dmu_read_l0_bps returns EAGAIN, but the logic is broken. cc @behlendorf

Normally dmu_read_l0_bps returns EAGAIN for dirty blocks. However it also returns EAGAIN whenever db->db_blkptr == NULL. This normally occurs for newly-written blocks not-yet-allocated, but it also occurs for sparse, unallocated blocks beyond the end of a fully-synced object. (More specifically, this occurs for any of the conditions that cause dbuf_findbp to return ENOENT when holding the dbuf.)

In this situation, zfs_clone_range tries to force a sync when zfs_bclone_wait_dirty=1, but syncing does not allocate any blocks since none are actually dirty. Then the next attempt runs into the same condition and syncs again in an infinite loop. Setting zfs_bclone_wait_dirty=0 breaks the loop and returns an error to cp.

This is trivially reproducible by creating an empty sparse file, as seen by zdb:

$ zdb -vvvvvv -O testpool src
obj=128 dataset=testpool path=/src type=19 bonustype=44

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       128    1   128K   128K      0     512   128K    0.00  ZFS plain file (K=inherit) (Z=inherit=lz4)
                                               184   bonus  System attributes
	dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
	dnode maxblkid: 0
	uid     1000
	gid     1000
	atime	Wed Mar 13 23:17:42 2024
	mtime	Wed Mar 13 23:17:42 2024
	ctime	Wed Mar 13 23:17:42 2024
	crtime	Wed Mar 13 23:17:42 2024
	gen	2368119
	mode	100644
	size	268435456
	parent	34
	links	1
	pflags	840800000004
	xattr	129
Indirect blocks:

Note that despite being a 256MiB file size according to ZPL metadata, the actual on-disk object is still a 1-level object with dn_maxblkid == 0 and no indirect blocks which is sufficient to trigger the db_blkptr == NULL case upon dbuf_hold.

The text was updated successfully, but these errors were encountered:

- When reading L0 block pointers handle buffers without ones and without dirty records as a holes. Those appear when dnode size was increased, but the end was never written, so there are no new indirection levels to store the pointers. It makes no sense to return EAGAIN here, since sync won't create new indirection levels until there will be actual writes. - When cloning blocks always set destination logical birth time to the current TXG. Otherwise if we are cloning over existing data, newly created holes may not be properly replicated later. While there, stop messing with physical birth time, it was already copied as part of the pointer and should already be zero for holes. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes: openzfs#15994

amotin · 2024-03-18T18:48:54Z

This seems to be completely unrelated issue from #15933. #16007 should fix this.

- When reading L0 block pointers handle buffers without ones and without dirty records as a holes. Those appear when dnode size was increased, but the end was never written, so there are no new indirection levels to store the pointers. It makes no sense to return EAGAIN here, since sync won't create new indirection levels until there will be actual writes. - When cloning blocks set destination hole logical birth time to the current TXG. Otherwise if we are cloning over existing data, newly created holes may not be properly replicated later. While there, stop messing with physical birth time, it was already copied as part of the pointer and should already be zero for holes. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes: openzfs#15994

- When reading L0 block pointers handle buffers without ones and without dirty records as a holes. Those appear when dnode size was increased, but the end was never written, so there are no new indirection levels to store the pointers. It makes no sense to return EAGAIN here, since sync won't create new indirection levels until there will be actual writes. - When cloning blocks set destination hole logical birth time to the current TXG. Otherwise if we are cloning over existing data, newly created holes may not be properly replicated later. Use BP_SET_BIRTH() when possible to not replicate its logic. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes: openzfs#15994

- When reading L0 block pointers handle buffers without ones and without dirty records as a holes. Those appear when dnode size was increased, but the end was never written, so there are no new indirection levels to store the pointers. It makes no sense to return EAGAIN here, since sync won't create new indirection levels until there will be actual writes. - When cloning blocks set destination hole logical birth time to the current TXG. Otherwise if we are cloning over existing data, newly created holes may not be properly replicated later. Use BP_SET_BIRTH() when possible to not replicate its logic. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15994 Closes openzfs#16007

- When reading L0 block pointers handle buffers without ones and without dirty records as a holes. Those appear when dnode size was increased, but the end was never written, so there are no new indirection levels to store the pointers. It makes no sense to return EAGAIN here, since sync won't create new indirection levels until there will be actual writes. - When cloning blocks set destination hole logical birth time to the current TXG. Otherwise if we are cloning over existing data, newly created holes may not be properly replicated later. Use BP_SET_BIRTH() when possible to not replicate its logic. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #15994 Closes #16007

- When reading L0 block pointers handle buffers without ones and without dirty records as a holes. Those appear when dnode size was increased, but the end was never written, so there are no new indirection levels to store the pointers. It makes no sense to return EAGAIN here, since sync won't create new indirection levels until there will be actual writes. - When cloning blocks set destination hole logical birth time to the current TXG. Otherwise if we are cloning over existing data, newly created holes may not be properly replicated later. Use BP_SET_BIRTH() when possible to not replicate its logic. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#15994 Closes openzfs#16007

rrevans added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 14, 2024

rrevans mentioned this issue Mar 14, 2024

some copied files are still corrupted (chunks replaced by zeros) #15933

Closed

rincebrain mentioned this issue Mar 16, 2024

Add a kill switch for SEEK_HOLE/SEEK_DATA #16000

Closed

13 tasks

amotin mentioned this issue Mar 18, 2024

BRT: Fix holes cloning #16007

Closed

13 tasks

behlendorf closed this as completed in a89d209 Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zfs_bclone_wait_dirty=1 broken for files with unallocated blocks at the end #15994

zfs_bclone_wait_dirty=1 broken for files with unallocated blocks at the end #15994

rrevans commented Mar 14, 2024 •

edited

Loading

amotin commented Mar 18, 2024

zfs_bclone_wait_dirty=1 broken for files with unallocated blocks at the end #15994

zfs_bclone_wait_dirty=1 broken for files with unallocated blocks at the end #15994

Comments

rrevans commented Mar 14, 2024 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Root cause

amotin commented Mar 18, 2024

rrevans commented Mar 14, 2024 •

edited

Loading