Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreOS won't start, ends with audit log type 1131 #483

Closed
AlexGustafsson opened this issue May 18, 2020 · 11 comments
Closed

CoreOS won't start, ends with audit log type 1131 #483

AlexGustafsson opened this issue May 18, 2020 · 11 comments

Comments

@AlexGustafsson
Copy link

Backstory

I've been running CoreOS for only a few weeks, it's mostly been working great, but a breaking issue has popped up every now and then.

I've installed CoreOS on bare metal on four different types of consumer grade hardware. During the installation, this issue came up a lot, causing me to reinstall the OS several times on different machine types.

Since the installation, it has occurred during reboot twice for two different machine types.

Issue description

  1. It happens during boot (first boot after install, or a regular reboot)
  2. The boot process seems to come a long way, but gets stuck at a log line which is always (mostly) the same
  3. The line may be repeated, but nothing else is ever shown
  4. If this is the first reboot, the machine will have to be reinstalled
  5. If it is a "regular" boot, the machine can be reinstalled and will boot again

Here is a somewhat difficult-to-read image of one occurrence of the issue:

audit

A transcription of the last log line:

audit: type=1131 audit(1588149656.742:68): pid=1 uid=0 auid=4294967295 ses=4294967295 msg=´unit=systemd-rfkill comm="systems" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success´

Here is another yet more difficult-to-read image of another occurrence on another machine:

audit2

A transcription of the last log lines:

audit: type=1131 audit(1589731456.755:68): pid=1 uid=0 auid=4294967295  ses=4294967295  subj=system_u:system_r:init_t:s0 msg=´unit=emergency comm="systems" comm="systems" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success´

audit: type=1131 audit(1589731461.010:69): pid=1 uid=0 auid=4294967295  ses=4294967295  subj=system_u:system_r:init_t:s0 msg=´unit=systemd-rfkill comm="systems" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success´

Output of sudo rpm-ostree status on one of the machines (identical setup):

$ sudo rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● ostree://fedora:fedora/x86_64/coreos/stable
                   Version: 31.20200420.3.0 (2020-05-06T16:19:21Z)
                BaseCommit: b3fc3a3e8513d7e424d0ced1e2517484cb766d238951f2fdec3da2fed3522efb
              GPGSignature: Valid signature by 7D22D5867F2A4236474BF7B850CB390B3C3359C4
           LayeredPackages: glusterfs-server

  ostree://fedora:fedora/x86_64/coreos/stable
                   Version: 31.20200420.3.0 (2020-05-06T16:19:21Z)
                BaseCommit: b3fc3a3e8513d7e424d0ced1e2517484cb766d238951f2fdec3da2fed3522efb
              GPGSignature: Valid signature by 7D22D5867F2A4236474BF7B850CB390B3C3359C4
           LayeredPackages: glusterfs
@lucab
Copy link
Contributor

lucab commented May 19, 2020

Thanks for the report! I have a feeling that your machine may be spitting more output on the primary console, which is not the one you are looking at.

Can you please try tweaking the kernel args in grub to only keep the VGA console (as described here coreos/fedora-coreos-docs#74), and report back if there are some useful details?

@lucab
Copy link
Contributor

lucab commented May 19, 2020

Additionally, it looks like this machine has some further customizations (I see multiple deployments and some layered packages).
Can you please attach the Ignition config being used (redacting any secrets first)?

@AlexGustafsson
Copy link
Author

The following ignition file was used:

variant: fcos
version: 1.0.0
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - "ssh-ed25519 AAA..."

The following commands were used to install the packages seen in the status report above:

sudo rpm-ostree install glusterfs
# Realizing it was the wrong package
sudo rpm-ostree uninstall glusterfs
# Installing the correct package
sudo rpm-ostree install glusterfs-server

These commands were run manually after some time. The issue still occurred during installation, but I'm pretty sure the second time it occurred were after rebooting after having installed the glusterfs-server package. One of the affected machines, however had restarted fine after installing it. So there might not be too much of a correlation.

As for the VGA console, I'll try to get back ASAP if I get more helpful output. During boot, there is a lot of output which ends with something along the line of "switching to colored terminal", then most of the output is removed, leaving either the prompt of the SSH keys used, or the screen shown in the images above.

@AlexGustafsson
Copy link
Author

There were some splitting going on, just as you mentioned. I was able to get into the emergency mode by only leaving console=tty0 as the console option in the grub entry.

There is one failed unit:

ostree-remount.service

The only error during boot, via journalctl -xb
ostree

The last lines of journalctl -xb
audit

It seems to me that my audit type=1131 is merely an affect of the mount error.

@AlexGustafsson
Copy link
Author

AlexGustafsson commented May 23, 2020

I'll add that this time it happened at least twice in a row while rebooting, first during a "normal" reboot (I've been running a script to randomly restart servers throughout the day in order to test this), and then when rebooting after having changed the grub setting. Rebooting once more let me into the system. Without warnings or errors.

Would it be sensible to mitigate this by editing the systemd unit for the ostree mount and add a restart if it fails, instead of the current behavior discussed here ostreedev/ostree#1697 (I think)?

@lucab
Copy link
Contributor

lucab commented May 23, 2020

Thanks for the followup! I think this is another symptom of the same mount-race issue at #488.
If that is true, next testing release should be not suffering this anymore.

@AlexGustafsson
Copy link
Author

@lucab Do you happen to know when the next testing release is scheduled to arrive?

@dustymabe
Copy link
Member

@lucab Do you happen to know when the next testing release is scheduled to arrive?

You can follow along at coreos/fedora-coreos-streams#108. Note that this is the first time we're planning to switch over to Fedora 32, so it will be a big update.

If you just want to test to see if the problem goes away you can use a development build from https://builds.coreos.fedoraproject.org/browser. The latest testing-devel is 31.20200521.20.0.

@jlebon
Copy link
Member

jlebon commented May 25, 2020

Yes, this is the same issue as #488. Let's close this one and track the fix there.

@jlebon jlebon closed this as completed May 25, 2020
@dustymabe
Copy link
Member

Yes, this is the same issue as #488. Let's close this one and track the fix there.

Assuming the above statement is correct, the change to correct this problem landed in coreos/fedora-coreos-config#416.

The fix for this went into testing stream release 32.20200601.2.1. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. labels Jun 4, 2020
@dustymabe
Copy link
Member

The fix for this went into stable stream release 32.20200601.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Jun 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants