-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PANIC: zfs: accessing past end of object #12078
Comments
One of my backup machines is now in a panic-reboot cycle with this error. In single user mode ZFS checks out, passes scrubs etc. In multi-user mode it reports I/O errors, pool corruption, bad disks, and other nasties. 3 Other systems with the same config do no exhibit these errors yet. One other system is connected to the same JBOD but different zpools. --- panic message --- fffffe00bd33e900 genunix:vcmn_err+42 () --- cut here --- --- core status ---
--- cut here --- |
I'm seeing the same issue on Manjaro with 5.14 kernel and ZFS 2.1.1:
Kernel Kernel and ZFS packages:
The same thing happens with linux 5.13 (ZFS 2.1.1-1). Dataset properties:
Pool properties:
This started occurring about 2 weeks ago with no kernel or ZFS package updates. The only change I have made recently was to switch compression from |
@jm650 I see the same stack trace on SmartOS and wrote about that in the newsgroup https://smartos.topicbox.com/groups/smartos-discuss/Td34fab84c5051dbd/panic-zfs-accessing-past-end-of-object it seems to be related to some write operation on shell logon |
I've also just seen a similar trace on 5.15 with 2.1.2:
I'll see if I can replicate it, but I don't think it's going to be easy... |
Invested some time to look into this again today. I have only remote access to the server via IPMI but updated one of the USB sticks (which this server still uses) to the latest platform image to ensure it is not a resolved issue. (Mounting a new image via ISO IPMI mounting and copying over from ISO to USB). Didn't change anything. I then booted into the root/root mode and looked at the dataset affected:
which showed it to be part of a filesystem/vm which i didn't need (looking it up in zones/config). Assuming it was only accessed during startup due to it being listed in The system booted again now (Note to others on SmartOS. When changing mountpoints in root/root mode, don't forget to reset them before rebooting) Two outstanding things now:
I obviously have a large number of dumps in /cores I could provide (core.metadata, core.vmadm, and core.vmadmd) to anybody interestet, as well as the theoretical ability (with some planing on my end) to put the machine back into the index and re-cause the issue. |
I never followed up on my previous report. After switching recordsize back to 128k and compression back to lz4, I took a snapshot then did a send|receive into a new dataset on the same pool - fortunately I had enough space to do so. That was nearly 4 months ago, and the system has been perfectly stable since. This is on my daily-driver development laptop which was hitting the panic multiple times per day. As far as I can tell, no data was ever harmed or corrupted. It seems that the data at-rest was fine, and the issue only surfaced when trying to write. |
I have been struggling with what seems to be the same issue. Much like @scotte, I changed my recordsize to something a bit larger and switched the compression from 'lz4' to 'zstd' a few days before the panic appeared. Since this has all started, it seems like just triggering several numerous reads (possibly some writes mixed in as the panic usually appears when launching applications) will cause it to panic and lock everything up. Oddly enough, I can still use the machine if I switch TTY and log in as root. This includes being able to access data on the pool. As root, it appears the only thing that will not execute successfully is a clean shutdown/restart. This is occurring on every kernel I've tried, and it does not make a difference if I use the dkms modules. I am running zfs 2.1.2. Here is my stacktrace:
|
Similar (maybe the same) issue for me, too: NixOS unstable, zfs v2.1.3 => v2.1.4 (kernel 5.15 => 5.17) and I also changed blocksize of some of the datasets to 1M. After a minute or two into user login the system hanged partially (could switch VTs, some terminals worked but couldn't login or start new processes) and after another two-ish minutes I got kernel panic Can't identify the exact moment this started to happen but it was neither immediately after the upgrade, nor immediately after the block size change. Pool created with
|
Just hit the same issue using zfs git from a few days ago and recent arch (happens on 6.1 lts and 6.2 kernels) - it only started after I changed the |
Also hitting this issue after changing Is there any known fix for this which doesn't involve migrating the dataset?
|
The recordsize property affects only new files. You may have a few files remaining which use different recordsize or compression setting. |
Look like I've been hitting this issue as well. System is Ubuntu 22.04:
On our box we have two users. One user that systematically produces the panic (it happens when thunderbird downloads emails and start writing to one of the inbox files). In my case I can't recall when each of those record sizes were set. It looks like this is "common enough", I wonder if it's easy to trigger this "from scratch", or are these files triggering the panic maybe created by a specific older version of openzfs and have some crap metadata somewhere. |
I can confirm I had this issue, strangely enough also particularly with thunderbird, on datasets with recordsize 1M. I reverted back to 128K and things are running fine. I have been using zstd for very long, and don't believe this is related. PROPERTY VALUE Kernel 6.5.8-arch1-1-surface #1 SMP PREEMPT_DYNAMIC Sun, 22 Oct 2023 13:57:26 +0000 x86_64 GNU/Linux on Arch Linux (using zfs-dkms). zfs-2.2.0-1 |
So I can trigger this at will, which minor changes, if indeed it is the same problem. Unrelated to everything, we are looking at setting So we make an equivalent change to
Don't forget there is code to scale Then I run the
and it starts
If I restore Since I can trigger it at will, let me know if there is anything you want to know. Sadly, I can not kernel live debug it until Apple releases a KDK for my specific OS version. But I sure can printf. Also, in the |
Describe the problem you're observing
Describe how to reproduce the problem
If I only knew... the process is run by a user in a container, every night at approx 3:00 am :( Still investigating. Ideas for bpftrace commands/etc. which could catch more information are welcome :)
The text was updated successfully, but these errors were encountered: