Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Flatcar Container Linux Alpha 3480.0.0, Beta 3446.1.0, Stable 3374.2.2 #946

Closed
23 tasks done
sayanchowdhury opened this issue Jan 11, 2023 · 17 comments · Fixed by flatcar/baselayout#26
Closed
23 tasks done

Comments

@sayanchowdhury
Copy link
Member

sayanchowdhury commented Jan 11, 2023

Release Flatcar Container Linux Alpha 3480.0.0, Beta 3446.1.0, Stable 3374.2.2

The release of Flatcar Container Linux Alpha 3480.0.0, Beta 3446.1.0, Stable 3374.2.2 is planned for January 11th, 2023

Tasks

@dongsupark dongsupark changed the title Release Flatcar Container Linux Alpha 3480.0.0, Beta 3446.1.0, Stable 3374.2.2, LTS 3033.3.9 Release Flatcar Container Linux Alpha 3480.0.0, Beta 3446.1.0, Stable 3374.2.2 Jan 11, 2023
@schweinchendick
Copy link

The alpha 3480.0.0 currently causes an install boot loop.
Coming from 3446.0.0 the logsmithd issues a reboot and tries to install the new system. Then fails and restores the previous alpha.
Investigation ongoing. But it worked for years previously.

@schweinchendick
Copy link

Switching the machine to beta channel: same loop.
Investigation currently impossible as the machine writes no logs during update; only in the "real" system.
More investigations follow on Sunday during planned maintenance window.

@jepio
Copy link
Member

jepio commented Jan 12, 2023

Thanks for testing alpha & beta. We're trying to repro but so far this seems isolated. Any chance you see something happening on the serial console that would hint to what is going wrong?

@schweinchendick
Copy link

I can not (currently). On Sunday no problem.
The machine is from around 2016 (the CoreOS days) and was automatically migrated since then. Unverified assumption: boot partition too small. Currently 11MB left. I'm currently running more tests on fresh installs, but too early to confirm anything.

@pothos
Copy link
Member

pothos commented Jan 12, 2023

Are you able to shortly log in with SSH? Woud be great to get the journalctl output.
Yes, the boot part problem is on the radar for some time and I'll look into freeing up some larger chunks.

@schweinchendick
Copy link

I can log in as long as often as you like, but can't afford long reboots and failing installs for now. But as mentioned above there is no journalctl output for the time period the (failing) update takes place. The last entries are "systemd[1]: Reached target System Reboot. (...) systemd-journald[694]: Journal stopped", then silence, and after ~10 minutes "kernel: Linux version 5.15.81-flatcar (...)".

@pothos
Copy link
Member

pothos commented Jan 12, 2023

Thanks for the quick response, so it doesn't seem to boot - journalctl -u update-engine --no-pager would be good to see if anything suspicious happened during download/installation of the new update.

@pothos
Copy link
Member

pothos commented Jan 12, 2023

I wonder if it could be related to the change about running systemd-tmpfile and flatcar-tmpfiles in the initrd. Sidenote: Currently it seems to be a mix of gracefully continuing and hard boot failures depending of the type of error the root setup script would encounter. Both approaches have pros and cons…

@schweinchendick
Copy link

The logs since July 2022 always show "Update successfully applied, waiting to reboot." Even for this release, The only meaningless difference is the ever increasing "Payload Attempt Number = 61" for this release. But as I have stopped the update-engine it does not try again. Resuming on Sunday...

On fresh installs the update works flawlessly. Must be something isolated which I cannot pin down at the moment. I report back as soon as I have full control over the machine.

@schweinchendick
Copy link

We recorded a boot sequence video available at https://lohmann.ml/sl/hausbesuch-wahnsinnig
The video starts with the shutdown sequence after successful installation. Then it boots right into emergency shell without any visible errors. And after the timeout expires just reboots into the old system.

During boot one can see the grub menu containing "CoreOS" entries as this is a former CoreOS. Maybe this could be a problem as the /boot partition is not upgraded at all. It has still all the files from 2016 when the machine was installed. I have never seen a CoreOS or FlatcarOS machine upgrading its /boot partition. Shouldn't that be done sometimes?

If you want I can try to provide an image of that machine without the confidential docker images and volumes for further analysis.

@jepio
Copy link
Member

jepio commented Jan 16, 2023

Can you press enter on that console, type journalctl and find out what happens after the rootfs is mounted (about 2.0 seconds into the boot in that video)? That's when the system decides to not proceed with the boot and fail in the initrd.

A snapshot of that vm without any of the docker images/volumes would work too, but we would need all other state in the root filesystem.

@schweinchendick
Copy link

Hitting Enter does not result into a real emergency shell. The system then enters a 15 second boot loop which can only be resolved by a (virtual) reset.
Therefore here is the machine disk image without the docker images and volumes: https://lohmann.ml/sl/gesamtjahr-rechtslage
Volume A contains the "good" system and Volume B is this release which shows the odd behavior.

@pothos
Copy link
Member

pothos commented Jan 16, 2023

Thanks a lot!
Hitting Enter works for me, maybe it depends on the console setup, I used QEMU via serial console and console=ttyS0.
I have to apologize, my change for running systemd-tmpfiles not only on first boot introduced the regression because it can't resolve the core group for some reason. Thanks a lot for reporting so quickly and preventing it from hitting Stable.

[    2.226858] systemd[1]: Starting Root filesystem setup...
[    2.243882] systemd-tmpfiles[503]: /sysroot/usr/lib/tmpfiles.d/baselayout-home.conf:1: Failed to resolve group 'core'.
[    2.245208] systemd-tmpfiles[503]: /sysroot/usr/lib/tmpfiles.d/baselayout-home.conf:2: Failed to resolve group 'core'.
[    2.246496] systemd-tmpfiles[503]: /sysroot/usr/lib/tmpfiles.d/baselayout-home.conf:3: Failed to resolve group 'core'.
[    2.247895] systemd-tmpfiles[503]: /sysroot/usr/lib/tmpfiles.d/baselayout-home.conf:4: Failed to resolve group 'core'.
[    2.249156] systemd-tmpfiles[503]: /sysroot/usr/lib/tmpfiles.d/baselayout-home.conf:5: Failed to resolve group 'core'.
[    2.250974] systemd[1]: initrd-setup-root.service: Main process exited, code=exited, status=65/DATAERR
[FAILED] Failed to start Root filesystem setup.

@pothos
Copy link
Member

pothos commented Jan 16, 2023

As workaround you can do echo core:x:500: | sudo tee -a /etc/group once and then the update will boot.

I guess we could fix https://github.com/flatcar/baselayout/blob/flatcar-master/scripts/flatcar-tmpfiles to also copy the list of wanted users over if they don't exist, not only if the file doesn't exist.

@schweinchendick
Copy link

Actually I don't need a workaround right now. I would rather like to see it fixed in the next release so that it is helpful for everybody. For me there is no real need to switch to this release. I'll skip it until the underlying issue is fixed.

@schweinchendick
Copy link

On the cloned machine the workaround solved the problem. On the real machine I'm waiting for a new release by halting the update-engine service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants