Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VERIFY3(0 == remove_reference(hdr, ((void *)0), tag)) failed (0 == 1) #6881

Open
jgottula opened this issue Nov 18, 2017 · 8 comments
Open
Labels
Bot: Not Stale Override for the stale bot Component: Encryption "native encryption" feature Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@jgottula
Copy link
Contributor

Skip down a couple sections if you don't care about any of this backstory; it might be relevant to the problem, or it might not, I don't know.

Background information

I have a ZFS filesystem containing disk image files which I update every week and then take a snapshot, so that I have a weekly backup history stored up.

I had some disk corruption on this pool's 4-disk raidz1 vdev during a period in which I was running it on only 3 disks, so zpool status -v knows of a couple of snapshots that contain corrupted data. I'm fairly certain that there are also a few more snapshots affected than just what zpool status -v reports, due to shared blocks between the snapshots.

Basically, this pool is barely hanging on, with at least one very marginal drive. I've spent over 3 months and numerous hours now attempting to get all of my data off of this doomed pool via zfs send/zfs recv, something that is still not fully possible due to EINVAL-related problems with zfs recv that I've described in other issue reports here.

I also cannot really "simply swap out the bad disks", because whenever I attempt a zfs replace, it gets stuck at a certain point, never completes, and never detaches the old device. (Oh and this pool now also spends its time perpetually resilvering over and over, apparently due to the stuck-replacement problem. Or some other bug. I don't know.)

Here's what this wonderful pool looks like:

  pool: pool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Nov 15 03:43:32 2017
        7.94T scanned out of 10.5T at 37.2M/s, 20h10m to go
        118G resilvered, 75.52% done
config:

        NAME                        STATE     READ WRITE CKSUM
        pool                        DEGRADED     0     0    22
          raidz1-0                  DEGRADED     0     0    50
            replacing-0             DEGRADED     0     0     2
              JGPool1               ONLINE       0     0     0
              14172011574593492267  UNAVAIL      0     0     0  was /dev/disk/by-partlabel/JGPoolY
            JGPoolX                 ONLINE       1     0     0  (resilvering)
            replacing-2             DEGRADED     0     0 78.4M
              6328974533181069670   UNAVAIL      0     0     0  was /dev/disk/by-partlabel/JGPool3
              JGPoolZ               ONLINE       0     0     0  (resilvering)
            JGPool4                 ONLINE       0     0     0
        logs
          mirror-1                  ONLINE       0     0     0
            ZIL_Samsung             ONLINE       0     0     0
            ZIL_Corsair             ONLINE       0     0     0
        cache
          L2ARC_Samsung             ONLINE       0     0     0
          L2ARC_Corsair             ONLINE       0     0     0

What I was doing when the panics happened

I've resigned myself to the inevitability that I'll never be able to zfs send/zfs recv my data off of this pool and onto a new one before my drives have fully disintegrated into scrap. Since this collection of historical backups relies extremely heavily on CoW space savings between snapshots (it's about 2.77 TiB compressed, but would be well over 7x larger otherwise), I decided to write up some scripts and manually port over the data to the new pool in a manner that maintains the CoW space savings by only writing the changed parts.

The details of that aren't relevant; what is relevant here is that as a precondition to doing this manual transfer process, I ran a bunch of md5sum processes with GNU parallel to get hashes of each disk image at each snapshot. Most of this went fine.

For reference, zpool status -v knows of data corruption in snapshots 20170327 and 20170529.

The md5sum process that attempted to read the disk image in the 20170327 snapshot error'd out with EIO, as expected.

However, the md5sum processes that attempted to read the disk image in the 20170515, 20170522, 20170529, and 20170605 snapshots got stuck in state D (uninterruptible sleep) forever, assumedly due to the task thread panics that are the subject of this report.

Interestingly, the system is still entirely usable (i.e. other I/O initiated after the problem doesn't get stuck); but those md5sum processes are forever stuck in state D, and the couple of z_rd_int_7 and z_rd_int_0 kernel threads that had the panics are now idle, doing nothing.

The task thread panics themselves

Two panics, separated by approximately 2 seconds.

Nov 15 10:02:41 jgpc kernel: VERIFY3(0 == remove_reference(hdr, ((void *)0), tag)) failed (0 == 1)
Nov 15 10:02:41 jgpc kernel: PANIC at arc.c:3748:arc_buf_destroy()
Nov 15 10:02:41 jgpc kernel: Showing stack for process 1033
Nov 15 10:02:41 jgpc kernel: CPU: 2 PID: 1033 Comm: z_rd_int_7 Tainted: P      D    O    4.13.6-1 #1
Nov 15 10:02:41 jgpc kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A99X EVO, BIOS 1708 04/10/2013
Nov 15 10:02:41 jgpc kernel: Call Trace:
Nov 15 10:02:41 jgpc kernel:  dump_stack+0x62/0x7a
Nov 15 10:02:41 jgpc kernel:  spl_dumpstack+0x55/0x60 [spl]
Nov 15 10:02:41 jgpc kernel:  spl_panic+0xd2/0x120 [spl]
Nov 15 10:02:41 jgpc kernel:  ? debug_smp_processor_id+0x2a/0x30
Nov 15 10:02:41 jgpc kernel:  ? spl_kmem_cache_alloc+0xc1/0x710 [spl]
Nov 15 10:02:41 jgpc kernel:  ? spl_kmem_cache_alloc+0x70/0x710 [spl]
Nov 15 10:02:41 jgpc kernel:  ? buf_cons+0x84/0x90 [zfs]
Nov 15 10:02:41 jgpc kernel:  arc_buf_destroy+0x143/0x150 [zfs]
Nov 15 10:02:41 jgpc kernel:  dbuf_read_done+0x99/0xf0 [zfs]
Nov 15 10:02:41 jgpc kernel:  arc_read_done+0x1e9/0x470 [zfs]
Nov 15 10:02:41 jgpc kernel:  zio_done+0x420/0xe48 [zfs]
Nov 15 10:02:41 jgpc kernel:  zio_execute+0x9d/0xf8 [zfs]
Nov 15 10:02:41 jgpc kernel:  taskq_thread+0x299/0x4b8 [spl]
Nov 15 10:02:41 jgpc kernel:  ? wake_up_q+0x90/0x90
Nov 15 10:02:41 jgpc kernel:  ? zio_reexecute+0x3a0/0x3a0 [zfs]
Nov 15 10:02:41 jgpc kernel:  kthread+0x134/0x150
Nov 15 10:02:41 jgpc kernel:  ? taskq_thread_should_stop+0x88/0x88 [spl]
Nov 15 10:02:41 jgpc kernel:  ? kthread_create_on_node+0x50/0x50
Nov 15 10:02:41 jgpc kernel:  ret_from_fork+0x22/0x30

Nov 15 10:02:43 jgpc kernel: VERIFY3(0 == remove_reference(hdr, ((void *)0), tag)) failed (0 == 2)
Nov 15 10:02:43 jgpc kernel: PANIC at arc.c:3748:arc_buf_destroy()
Nov 15 10:02:43 jgpc kernel: Showing stack for process 946
Nov 15 10:02:43 jgpc kernel: CPU: 2 PID: 946 Comm: z_rd_int_0 Tainted: P      D    O    4.13.6-1 #1
Nov 15 10:02:43 jgpc kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A99X EVO, BIOS 1708 04/10/2013
Nov 15 10:02:43 jgpc kernel: Call Trace:
Nov 15 10:02:43 jgpc kernel:  dump_stack+0x62/0x7a
Nov 15 10:02:43 jgpc kernel:  spl_dumpstack+0x55/0x60 [spl]
Nov 15 10:02:43 jgpc kernel:  spl_panic+0xd2/0x120 [spl]
Nov 15 10:02:43 jgpc kernel:  ? debug_smp_processor_id+0x2a/0x30
Nov 15 10:02:43 jgpc kernel:  ? spl_kmem_cache_alloc+0xc1/0x710 [spl]
Nov 15 10:02:43 jgpc kernel:  ? spl_kmem_cache_alloc+0x70/0x710 [spl]
Nov 15 10:02:43 jgpc kernel:  ? buf_cons+0x84/0x90 [zfs]
Nov 15 10:02:43 jgpc kernel:  arc_buf_destroy+0x143/0x150 [zfs]
Nov 15 10:02:43 jgpc kernel:  dbuf_read_done+0x99/0xf0 [zfs]
Nov 15 10:02:43 jgpc kernel:  arc_read_done+0x1e9/0x470 [zfs]
Nov 15 10:02:43 jgpc kernel:  zio_done+0x420/0xe48 [zfs]
Nov 15 10:02:43 jgpc kernel:  zio_execute+0x9d/0xf8 [zfs]
Nov 15 10:02:43 jgpc kernel:  taskq_thread+0x299/0x4b8 [spl]
Nov 15 10:02:43 jgpc kernel:  ? wake_up_q+0x90/0x90
Nov 15 10:02:43 jgpc kernel:  ? zio_reexecute+0x3a0/0x3a0 [zfs]
Nov 15 10:02:43 jgpc kernel:  kthread+0x134/0x150
Nov 15 10:02:43 jgpc kernel:  ? taskq_thread_should_stop+0x88/0x88 [spl]
Nov 15 10:02:43 jgpc kernel:  ? kthread_create_on_node+0x50/0x50
Nov 15 10:02:43 jgpc kernel:  ret_from_fork+0x22/0x30

The stuck md5sum processes

All of the stuck md5sum processes have the following stack:

# cat /proc/$STUCK_PID/stack
[<ffffffff8109b6a9>] io_schedule+0x29/0x50
[<ffffffffa062fbc8>] cv_wait_common+0xb0/0x120 [spl]
[<ffffffffa062fcc3>] __cv_wait_io+0x2b/0x30 [spl]
[<ffffffffa09ceff2>] zio_wait+0xf2/0x198 [zfs]
[<ffffffffa0912847>] dmu_buf_hold_array_by_dnode+0x167/0x480 [zfs]
[<ffffffffa0913f39>] dmu_read_uio_dnode+0x49/0xf0 [zfs]
[<ffffffffa0914038>] dmu_read_uio_dbuf+0x58/0x78 [zfs]
[<ffffffffa09b6f85>] zfs_read+0x135/0x3f0 [zfs]
[<ffffffffa09d8f30>] zpl_read_common_iovec+0x80/0xc0 [zfs]
[<ffffffffa09d97c9>] zpl_iter_read+0xa9/0xf0 [zfs]
[<ffffffff81222e3b>] __vfs_read+0xcb/0x128
[<ffffffff81224152>] vfs_read+0xa2/0x148
[<ffffffff8122586d>] SyS_read+0x55/0xb8
[<ffffffff8164aaa4>] entry_SYSCALL_64_fastpath+0x17/0x98
[<ffffffffffffffff>] 0xffffffffffffffff

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version n/a
Linux Kernel 4.13.6
Architecture x86_64
ZFS Version cdc15a7
SPL Version openzfs/spl@0cefc9d
@kylebur
Copy link

kylebur commented Dec 8, 2017

I don't have a solution, yet, but also see the same bug. This is on a 36 bay system populated with 8T disks.

`# zpool status
pool: rzpool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Dec 8 08:45:18 2017
1.78G scanned out of 39.7T at 2.29M/s, (scan is slow, no estimated time)
116M resilvered, 0.00% done
config:

NAME                          STATE     READ WRITE CKSUM
rzpool                        DEGRADED     0     0   450
  raidz3-0                    ONLINE       0     0 1.20K
    scsi-35000c50092bcebfc    ONLINE       0     0     0
    scsi-35000c50092bd7315    ONLINE       0     0     0
    scsi-35000c50092bd382e    ONLINE       0     0     0
    scsi-35000c50092bcb8d2    ONLINE       0     0     0
    scsi-35000c50092bca1e7    ONLINE       0     0     1
    scsi-35000c50092bd28d7    ONLINE       0     0     0
    scsi-35000c50092bd45b1    ONLINE       0     0     0
    scsi-35000c50092bdc4db    ONLINE       0     0     0
    scsi-35000c50092bd4ba1    ONLINE       0     0     1
    scsi-35000c50092bd42ea    ONLINE       0     0     0
    scsi-35000c50092bd7396    ONLINE       0     0     0
    scsi-35000c50092bcff8a    ONLINE       0     0     0  (resilvering)
  raidz3-1                    ONLINE       0     0    73
    scsi-35000c50092bdefbb    ONLINE       0     0     1
    scsi-35000c50092bc6763    ONLINE       0     0     0  (resilvering)
    scsi-35000c50092bd86a9    ONLINE       0     0     1
    scsi-35000c50092bd83fe    ONLINE       0     0     0
    scsi-35000c50092bd0fd2    ONLINE       0     0     1
    scsi-35000c50092bc37f3    ONLINE       0     0     0
    scsi-35000c50092bd9ec5    ONLINE       0     0     0
    scsi-35000c50092bd12bc    ONLINE       0     0     0  (resilvering)
    scsi-35000c50092be8904    ONLINE       0     0     2
    scsi-35000c50092bdd9ba    ONLINE       0     0     0
    scsi-35000c50092bccac5    ONLINE       0     0     0
    scsi-35000c50092bed185    ONLINE       0     0     0
  raidz3-2                    DEGRADED     0     0   504
    scsi-35000c50092bdf8ba    ONLINE       0     0     0
    scsi-35000c50092bcd23b    ONLINE       0     0     0
    scsi-35000c50092bc583c    ONLINE       0     0     0
    replacing-3               DEGRADED     0     0     0
      1335565945269893853     UNAVAIL      0     0     0  was /dev/sdae1/old
      scsi-35000c50092bebb1a  ONLINE       0     0     0  (resilvering)
    scsi-35000c50092bc6c6c    ONLINE       0     0     0  (resilvering)
    scsi-35000c50092bcfbbe    ONLINE       0     0     0
    scsi-35000c50092bcd906    ONLINE       0     0     0
    scsi-35000c50092bd88ba    ONLINE       0     0     0  (resilvering)
    scsi-35000c50092bddd7d    ONLINE       0     0     0  (resilvering)
    scsi-35000c50092be1910    ONLINE       0     0     0
    scsi-35000c50092bdfd56    ONLINE       0     0     0
    scsi-35000c50092bd53a7    ONLINE       0     0     0

errors: 2927 data errors, use '-v' for a list
`

In the dmesg:

[71413.662463] VERIFY3(0 == remove_reference(hdr, ((void *)0), tag)) failed (0 == 1) [71413.662470] PANIC at arc.c:3076:arc_buf_destroy() [71413.662472] Showing stack for process 19023 [71413.662476] CPU: 3 PID: 19023 Comm: z_rd_int_7 Tainted: P O 4.14.3 #1 [71413.662477] Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0 12/17/2015 [71413.662478] Call Trace: [71413.662491] dump_stack+0x63/0x89 [71413.662505] spl_dumpstack+0x44/0x50 [spl] [71413.662512] spl_panic+0xc9/0x110 [spl] [71413.662591] ? vdev_disk_io_start+0x4d9/0x6c0 [zfs] [71413.662596] ? __slab_free+0x9b/0x2c0 [71413.662599] ? __slab_free+0x9b/0x2c0 [71413.662637] ? arc_available_memory+0x2e/0xd0 [zfs] [71413.662677] arc_buf_destroy+0x11c/0x130 [zfs] [71413.662717] dbuf_read_done+0x91/0xf0 [zfs] [71413.662753] arc_read_done+0x17e/0x2d0 [zfs] [71413.662815] zio_done+0x311/0xcb0 [zfs] [71413.662819] ? kfree+0x133/0x180 [71413.662825] ? spl_kmem_free+0x35/0x40 [spl] [71413.662884] zio_execute+0x8f/0xf0 [zfs] [71413.662892] taskq_thread+0x2af/0x530 [spl] [71413.662897] ? wake_up_q+0x80/0x80 [71413.662900] kthread+0x109/0x140 [71413.662906] ? taskq_thread_spawn+0x50/0x50 [spl] [71413.662908] ? kthread_park+0x60/0x60 [71413.662912] ret_from_fork+0x25/0x30

Then followed by many more of the same messages.

I started out with an Areca SAS Raid Card, and initially thought the problem was with the card, so I replaced it with an LSI, but still see the same issue.

Initially using the latest CentOS 7 kernel with the CentOS 7 zfsonlinux, but installed kernel 4.14.3 yesterday with spl-0.7.3 (patched to change "vfs_read and vfs_write to kernel_read and kernel_write" and zfs-0.7.3. Still see the same error.

If anyone on the zfsonlinux project has an idea what is going on here, I would very much appreciate it!

@ewwhite
Copy link

ewwhite commented Aug 14, 2018

Any movement on this issue?

@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@stale stale bot closed this as completed Nov 23, 2020
@ahrens ahrens added Type: Defect Incorrect behavior (e.g. crash, hang) and removed Status: Stale No recent activity for issue labels Sep 14, 2021
@ahrens ahrens reopened this Sep 14, 2021
@ahrens
Copy link
Member

ahrens commented Sep 14, 2021

This issue was raised at the September 2021 OpenZFS Leadership meeting as a potentially serious bug that has fallen through the cracks. It may relate to https://www.illumos.org/issues/14003#note-1

@stale
Copy link

stale bot commented Sep 16, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Sep 16, 2022
@behlendorf behlendorf added Bot: Not Stale Override for the stale bot and removed Status: Stale No recent activity for issue labels Sep 16, 2022
@rincebrain rincebrain added the Component: Encryption "native encryption" feature label Nov 14, 2023
@gwr
Copy link
Contributor

gwr commented Dec 7, 2023

This bug is still out there, but it's apparently hard to reproduce. We've seen it "in the wild".
See also: https://www.illumos.org/issues/15790 which also looks like a close match.
There is now analysis and a reproducer on https://www.illumos.org/issues/14003
Fix under also posted under that issue.

@neurotensin
Copy link

neurotensin commented Feb 5, 2024

I can confirm this issue is still relevant - I just got it. I might suggest a cause though. First my details
5.15.0-25-generic #25-Ubuntu SMP Wed Mar 30 15:54:22 UTC 2022
ii libzfs4linux 2.1.2-1ubuntu3 amd64 OpenZFS filesystem library for Linux - general support
ii libzpool5linux 2.1.2-1ubuntu3 amd64 OpenZFS pool library for Linux
ii zfs-dkms 2.1.2-1ubuntu3 all OpenZFS filesystem kernel modules for Linux
ii zfs-zed 2.1.2-1ubuntu3 amd64 OpenZFS Event Daemon
ii zfsutils-linux 2.1.2-1ubuntu3 amd64 command-line tools to manage OpenZFS filesystems

And the error:

[1351503.757506] INFO: task z_rd_int_1:7824 blocked for more than 362 seconds.
[1351503.757510] Tainted: P O 5.15.0-25-generic #25-Ubuntu
[1351503.757512] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1351503.757529] task:z_rd_int_1 state:D stack: 0 pid: 7824 ppid: 2 flags:0x00004000
[1351503.757531] Call Trace:
[1351503.757533]
[1351503.757535] __schedule+0x23d/0x590
[1351503.757540] schedule+0x4e/0xb0
[1351503.757543] spl_panic+0xe7/0xe9 [spl]
[1351503.757551] ? do_raw_spin_unlock.constprop.0+0x9/0x10 [zfs]
[1351503.757701] ? __raw_spin_unlock.constprop.0+0x9/0x10 [zfs]
[1351503.757815] ? zfs_zevent_post+0x186/0x1b0 [zfs]
[1351503.757876] ? cityhash4+0x90/0xa0 [zcommon]
[1351503.757880] ? buf_hash+0x18/0x20 [zfs]
[1351503.757919] arc_buf_destroy+0xe5/0xf0 [zfs]
[1351503.757961] arc_read_done+0x224/0x530 [zfs]
[1351503.758002] zio_done+0x3ad/0xe50 [zfs]
[1351503.758075] zio_execute+0x92/0xe0 [zfs]
[1351503.758149] taskq_thread+0x21d/0x400 [spl]
[1351503.758156] ? wake_up_q+0x90/0x90
[1351503.758159] ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
[1351503.758232] ? param_set_taskq_kick+0xf0/0xf0 [spl]
[1351503.758238] kthread+0x12a/0x150
[1351503.758240] ? set_kthread_struct+0x50/0x50
[1351503.758242] ret_from_fork+0x22/0x30
[1351503.758245]

and the console:

kernel:[1351131.586643] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
kernel:[1351131.586646] PANIC at arc.c:3842:arc_buf_destroy()

This was a device that is hanging off the usbc that developed a fault on an mdraid I am running (in the middle of upgrading this laptop). I had to reseat the cable (pull it out, plug it in), so it is highly likely that something about device errors in the interim caused this issue.

The mdraid BTW recovered with no issues (AFAIK), but it would appear to be a definite trigger event.

@cyphar
Copy link
Contributor

cyphar commented Mar 28, 2024

I also just ran into something very similar. This is with regular spinning rust drives connected via SATA (using a SAS expander) and there were no other IO-related errors in the kernel log. Since it was mentioned in some other issue, I use ZFS encryption.

From SMART monitoring it seems that ata-ST10000VN0008-2PJ103_ZS515AYN is having an increase in raw read errors, but the panic along with the fact that attempts to run zpool scrub result in a hung zpool process (and running zpool status also hangs) seems to indicate there might be more going on. Also the device itself passes a SMART healthcheck (though I'm sure those aren't the most reliable things in the world...).

[2530221.641600] VERIFY3(remove_reference(hdr, hdr) > 0) failed (0 > 0)
[2530221.641627] PANIC at arc.c:6610:arc_write_done()
[2530221.641645] Showing stack for process 21525
[2530221.641658] CPU: 8 PID: 21525 Comm: z_wr_int_0 Tainted: P           O      N 5.14.21-150500.55.49-default #1 SLE15-SP5 a8d284bf25556592b1cf60fe7a08b3519f64459a
[2530221.641690] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Pro4, BIOS P1.50 11/05/2018
[2530221.641710] Call Trace:
[2530221.641722]  <TASK>
[2530221.641735]  dump_stack_lvl+0x45/0x5b
[2530221.641758]  spl_panic+0xc8/0x100 [spl af95f7395f3eb9a449f705221058fe2fb998d52b]
[2530221.641798]  ? kmem_cache_free+0x3c0/0x410
[2530221.641816]  ? srso_return_thunk+0x5/0x10
[2530221.641832]  ? kmem_cache_free+0x3c0/0x410
[2530221.641851]  ? spl_kmem_cache_free+0x110/0x1e0 [spl af95f7395f3eb9a449f705221058fe2fb998d52b]
[2530221.641883]  ? srso_return_thunk+0x5/0x10
[2530221.641898]  ? put_cpu_partial+0xb2/0xc0
[2530221.641914]  ? srso_return_thunk+0x5/0x10
[2530221.641928]  ? kmem_cache_free+0x3c0/0x410
[2530221.641946]  ? srso_return_thunk+0x5/0x10
[2530221.641961]  ? spl_kmem_cache_free+0x110/0x1e0 [spl af95f7395f3eb9a449f705221058fe2fb998d52b]
[2530221.641996]  arc_write_done+0x42b/0x4d0 [zfs 251ddbcbed81ad8aaff15de0483e32050921dab8]
[2530221.642223]  zio_done+0x3f1/0xfb0 [zfs 251ddbcbed81ad8aaff15de0483e32050921dab8]
[2530221.642485]  zio_execute+0x86/0xf0 [zfs 251ddbcbed81ad8aaff15de0483e32050921dab8]
[2530221.642736]  taskq_thread+0x266/0x450 [spl af95f7395f3eb9a449f705221058fe2fb998d52b]
[2530221.642771]  ? wake_up_q+0x90/0x90
[2530221.642793]  ? zio_reexecute+0x3e0/0x3e0 [zfs 251ddbcbed81ad8aaff15de0483e32050921dab8]
[2530221.643125]  ? task_done+0xa0/0xa0 [spl af95f7395f3eb9a449f705221058fe2fb998d52b]
[2530221.643168]  kthread+0x156/0x180
[2530221.643187]  ? set_kthread_struct+0x50/0x50
[2530221.643208]  ret_from_fork+0x22/0x30
[2530221.643238]  </TASK>
% uname -a
Linux navi 5.14.21-150500.55.49-default #1 SMP PREEMPT_DYNAMIC Sun Feb 11 17:48:15 UTC 2024 (36baf2f) x86_64 x86_64 x86_64 GNU/Linux
% sudo zfs version
zfs-2.2.3-1
zfs-kmod-2.2.3-1
% sudo zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 1 days 21:21:26 with 0 errors on Tue Mar 26 19:21:41 2024
remove: Removal of vdev 1 copied 3.59T in 9h27m, completed on Sat Feb 12 23:50:24 2022
        7.70M memory used for removed device mappings
config:

        NAME                                   STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          mirror-0                             ONLINE       0     0     0
            ata-TOSHIBA_HDWG11A_70P0A0Q8FATG   ONLINE       0     0     0
            ata-ST10000VN0008-2PJ103_ZS515AYN  ONLINE       0     0     1
          mirror-2                             ONLINE       0     0     0
            ata-TOSHIBA_HDWG11A_70P0A0Y4FATG   ONLINE       0     0     0
            ata-ST10000VN0008-2PJ103_ZS515P7F  ONLINE       0     0     0

errors: 2 data errors, use '-v' for a list

EDIT: Weirdly, after a reboot the warning was still there in status, but there was no CKSUM error listed for the problem drive, and after running zpool status -v no "problem files" were listed and the warning in status disappeared... I'm running a scrub now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bot: Not Stale Override for the stale bot Component: Encryption "native encryption" feature Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

9 participants