-
Notifications
You must be signed in to change notification settings - Fork 30
grub fails with message "free magic broken" or "alloc magic broken" #2284
Comments
Reviewed dosfstool's changelog, nothing jumped out at me as a culprit. I mounted the boot partition from an affected and non-affected image:
|
Log from booting alpha. I went into the grub command line a set Edit: built new images today (qemu, qemu_uefi); they did not trigger this bug. Tested with the |
This reverts commit 7f058d6. Reverting because of bug 2284 [1] where grub will sometimes fail do to memory corruption. This is _not_ the cause of the bug, and the bug can even be reproduced with this reversion, but it seems to occur less when not using fat32. [1] coreos/bugs#2284
This reverts commit 7f058d6. Reverting because of bug 2284 [1] where grub will sometimes fail due to memory corruption. This is _not_ the cause of the bug, and the bug can even be reproduced with this reversion, but it seems to occur less when not using fat32. [1] coreos/bugs#2284
After about of week of digging I still haven't been able to find the problem. Here is what I did find:
|
@ajeddeloh That error comes from grub's memory manager code: https://github.com/coreos/grub/blob/master/grub-core/kern/mm.c#L205, so the problem seems to be memory corruption of the grub_mm_header_t. A Web search for I suggest you put a hardware breakpoint on write to the grub_mm_header_t.magic address and see if you can figure out what is writing to it. You may need to find a qemu cpu emulation that supports hardware write breakpoint. |
Thanks for the suggestion, I'll give that try after the holidays. |
This is currently affecting Container Linux 1688.4.0 Stable. If you encounter it, revert to 1632.3.0 Stable. We are working to release a new stable version that works around this issue. |
Are the update servers disabled because of this by any chance? I created a new CL VM today from an older image and trying to force an update always results in the following:
On CL 1520.8.0. |
Hopefully will work around coreos/bugs#2284
Hopefully will work around coreos/bugs#2284
For updates on the status of this bug in the stable channel, see the coreos-user thread. @andor44 Updates are now re-enabled and pointing at 1632.3.0. Machines that are already running 1688.4.0 will not automatically downgrade. |
Some info for debugging:
Getting symbols from modules:
|
Update: Stable 1688.5.3 is rolling out now with a working build. We still need to find and address the underlying cause. |
I've added some code to get primitive stacktraces in gdb. The issue appears to be both gdb and qemu seem to think they're not in protected mode, so they do things like backtraces wrong with 8 byte addresses. The backtrace code is based on https://stackoverflow.com/questions/24160995/gdb-backtrace-by-walking-frame-pointers https://gist.github.com/ajeddeloh/9b74fe9527afa614506c25f0442b056f |
The FAT32 requirement is only for EFI systems though, not BIOS right? FWIW for Fedora derivatives we currently default to either ext4 or XFS for Although this seems contradicted by:
|
@cgwalters grub still accesses |
During releases of Flatcar Linux stable v1688.4.0, I became interested in this issue. So I tried to reproduce the bug with a Flatcar qemu image, following the steps described in this issue. Basically I ran However, I cannot reproduce the issue. I'm not seeing the message like I wondered if I needed to disable KVM acceleration as mentioned in the comment. Though it was also not the case. Even after disabling KVM acceleration, still I cannot reproduce it. So I think there must be other factors that trigger this bug, so it does not happen always. Or am I missing something? |
@dongsupark a given image will succeed or fail deterministically, but it is not yet sure what's going to make such image succeed or fail. Your current Flatcar release may be in the non-failing set now, but later releases may (or may not) experience the same bug. For ContainerLinux this was quite sporadic till last month, when it started to affect most of the builds. |
@ajeddeloh I still didn't setup a dev env to reproduce this issue, but I want to share some thoughts. I'm also not that familiar with Container Linux, so please forgive me if some of my assumptions are wrong.
|
Yup, investigating that now. I've got some macros that emulate the functionality of the
I can't get a failing build locally, even checking out the
I (finally) got gdb + qemu working together properly and that |
I don't know how much it will help, but I've cherry-picked some patches to do better built-in backtraces into a branch based on FSF's current master branch, here: https://github.com/vathpela/grub2-fedora/tree/backtrace . It also adds a patch so that if you're running with debug=gdb it'll print the .text and .data addresses for kernel.exec and all the modules it loads, so you can use gdb with qemu -s and use the messages it prints to load local debuginfo from the build tree. (I haven't tested this branch, just cherry picked from a branch where this all works.) |
@lucab We've started to see this in subsequent builds of Flatcar. We're looking into it as well now. |
@vathpela thanks, at this point gdb+qemu is working pretty well. That script I linked earlier breaks on module load and loads the debug info accordingly @blixtra I'm assuming flatcar's tooling is pretty similar to ours. If you modify the
Edit: |
@ajeddeloh I have used your GDB script but changed the watchpoint to This stack is the last stack before I get the "broken magic" error messages:
In frame 18, I see:
Could it be that it is reading from disk and writing the buffer to
aligned_buf could be allocated from |
@alban is that on an EFI system? |
Seemingly related coreos/grub#53 |
I manually went in and fixed the dest (both in gdb and by hex-editing |
@ajeddeloh I don't know if it is EFI or not (I just used an equivalent of |
Closed via coreos/coreos-overlay#3166. This will be in the next alpha (released this week). Unfortunately since we cannot update grub, reprovisioning will be necessary to pick up this change. |
While reprovisioning is necessary to pick up the code change involved here, we haven't observed a case of an update triggering this error, so we currently think that users will not need to reprovision. If you do encounter the We'll continue to investigate the full impact in #2400. |
Issue Report
Bug
Container Linux Version
1618.0.0 and newer
Environment
qemu, possibly others
Expected Behavior
appending to the kernel command line via editing the grub menu does no prevent the system from booting.
Actual Behavior
Appending to the kernel command line prevents the system from booting. grub exits with the error message:
free magic is broken at 0x3cec8166: 0xcfc53e26
(actual addresses vary)Reproduction Steps
coreos_production_qemu.sh
script^X
to bootfree magic is broken at...
messageOther Information
This was introduced with changing /boot to fat32. It may be a bug in our image creation or in grub itself.
It is present with our current alpha images, and in master. Sometimes the error message is repeated, especially when building off master, although that is probably coincidence.
The text was updated successfully, but these errors were encountered: