-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Old bootloader versions don't boot new aarch64 6.2+ kernels #1441
Comments
Yeah. One option here I guess is to add cross-checks where at least rpm-ostree (or zincati) know how to query bootupd and block updates if it's too old. |
We found a last-minute issue on updating aarch64 nodes: coreos/fedora-coreos-tracker#1441 Let's cancel the rollout while we figure out how to address this.
I mean, ideally bootupd would update the bootloader automatically. How far are we from being able to do that? |
This is covered in https://github.com/coreos/bootupd/#questions-and-answers |
Well, it says "perhaps in the future bootupd will use some of those". If we prioritized doing the work, do we have the ability to implement it today, and how much effort would it be? |
Updating automatically is trivial, just a systemd unit that runs |
We found a last-minute issue on updating aarch64 nodes: coreos/fedora-coreos-tracker#1441 Let's cancel the rollout while we figure out how to address this.
To flesh this out, this would be something like: attach metadata to each release about the "minimum bootloader version" and then enhance rpm-ostree to update the bootloader using bootupd if it detects that it's too old before finalizing an update? This is not something we want to do in a rush, but we may be in a situation where we can pin on an older kernel for now to give us time to do something like this. The risk I see is that we haven't been updating the bootloader at all so far, so now years of potential regressions could manifest all at once. Though that's a concern anyway whether users do it or we do it. And we know we want to eventually have automatic bootloader updates. |
new investigation information: This is confirmed to be specific to the |
I was trying to gauge how far back you'd have to go to get a bootloader/EFI binary that wasn't compatible. In my random sampling here's what I found:
|
F36 is supposed to be able to update to F38 per Fedora policies so we should file a bug for that. Likely for the kernel. This also likely affects Silverblue/Kinoite/Sericea & IoT (and we don't have bootupd there yet unfortunately): fedora-silverblue/issue-tracker#120 |
Came up re coreos/fedora-coreos-tracker#1441 Co-authored-by: Benjamin Gilbert <[email protected]>
I think this might not affect non-OSTree EFI systems because (IIUC) the bits get updated by the RPM on upgrade so we need to be careful about how we approach this. Justin Forbes is aware of the problem. It's also possible a fully updated F36 system can update to F38, but for FCOS we stopped building F36 when F37 came out.
Note that this is a limited failure scenario IIUC. The releases had to have been building and releasing for For IoT I sent an FYI email to their mailing list. |
The new 6.2 kernel can cause aarch64 systems originally installed on F36- to not boot. For now while we figure out the best path forward we'll ship the newest 6.1 kernel we can find, which just happens to be built against F37. See coreos/fedora-coreos-tracker#1441
Since this problem is introduced with 6.2 kernels my short term proposal for FCOS For our next steps, I discussed this briefly with @jlebon yesterday. Here are a few potential options for us:
|
Note that we only caught this in advance because Dusty had an aarch64 system that he manually updated before the rollout started. Thanks, Dusty, for doing that! ...but we shouldn't rely on it. How can we improve our upgrade testing to catch this case? |
I added myself as assignee for coreos/coreos-assembler#2519 to make it happen. |
Filed coreos/bootupd#440 |
Filed fedora-silverblue/issue-tracker#434 for Silverblue/Kinoite |
Also filed: coreos/bootupd#441 |
Some background info on the situation: The 6.2 aarch64 kernel is not compatible with the GRUB bootloader versions shipped with older releases of Fedora CoreOS. Because Fedora CoreOS does not routinely update GRUB, we must do so explicitly before switching to a 6.2 kernel. To do this, we'll ship a "barrier release" on each stream with a temporary systemd service that updates the bootloader. Our rollout system will force each node to install the barrier release before updating further, ensuring that all aarch64 machines update the bootloader. To reduce risk, non-aarch64 systems will not update the bootloader, but must still install the barrier release. However, there's a further complication for The |
Since the barrier release for [1] shipped in 37.20230322.2.0 we can now unpin and ship the 6.2 kernel. [1] coreos/fedora-coreos-tracker#1441
Since the barrier release for [1] shipped in 37.20230322.2.0 we can now unpin and ship the 6.2 kernel. [1] coreos/fedora-coreos-tracker#1441
The fix for this went into |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Checklist in #1441 (comment) complete. |
This comment was marked as off-topic.
This comment was marked as off-topic.
The fix for this went into |
It turns out that the fix doesn't work for systems with mirrored disks: #1485 |
On aarch64, kernel 6.2 won't boot with older versions of GRUB. In preparation for switching to the new kernel, add a systemd service that uses bootupd to update the bootloader on aarch64 systems. Revert this after the next barrier release. For coreos/fedora-coreos-tracker#1441.
This reverts commit 8ce6fd6. The testing-devel promotion with this has already happened in coreos#2315 and we have already shipped a `next` with the unit, so we can drop it now (before we execute the `next-devel`->`next` promotion) to prevent it from shipping in more than one release per stream. We can do this because we have barriers. Full context in coreos/fedora-coreos-tracker#1441
Since the barrier release for [1] shipped in 37.20230322.2.0 we can now unpin and ship the 6.2 kernel. [1] coreos/fedora-coreos-tracker#1441
On aarch64, kernel 6.2 won't boot with older versions of GRUB. In preparation for switching to the new kernel, add a systemd service that uses bootupd to update the bootloader on aarch64 systems. Revert this after the next barrier release. For coreos/fedora-coreos-tracker#1441.
This reverts commit 8ce6fd6. The testing-devel promotion with this has already happened in coreos#2315 and we have already shipped a `next` with the unit, so we can drop it now (before we execute the `next-devel`->`next` promotion) to prevent it from shipping in more than one release per stream. We can do this because we have barriers. Full context in coreos/fedora-coreos-tracker#1441
Since the barrier release for [1] shipped in 37.20230322.2.0 we can now unpin and ship the 6.2 kernel. [1] coreos/fedora-coreos-tracker#1441
I just pro-actively updated my
t4g.medium
AWS instance to38.20230310.1.0
and it didn't come back. Upon inspecting the serial console I see:Pressing a key and selecting the older boot entry (thankfully I had console access) allowed me to re-connect with my system.
This system was provisioned a long time ago with
34.20210904.2.0
(testing
stream; later moved over to thenext
stream to allow for earlier testing).The problem here is that by default the bootloader on machines isn't updated so it keeps the one from when you first installed the machine. bootupd was created to solve this problem, but is still a work in progress so not widely used.
Here's what it shows on my system:
After updating the bootloader...
I am able to boot the system:
This is most likely due to recent changes for aarch64 kernels around EFI_ZBOOT, which we also think is the root cause for #1430.
The text was updated successfully, but these errors were encountered: