Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd unit ldconfig.service fails to start #1493

Open
ader1990 opened this issue Jul 4, 2024 · 6 comments
Open

systemd unit ldconfig.service fails to start #1493

ader1990 opened this issue Jul 4, 2024 · 6 comments
Labels
kind/bug Something isn't working

Comments

@ader1990
Copy link

ader1990 commented Jul 4, 2024

Description

systemd unit ldconfig.service fails to start sometimes on the first boot or the subsequent reboots, but it cannot be reproduced reliably.
ldconfig.service runs before the switchroot during the initrd stage and is more cumbersome to properly reproduce it. [UPDATE] ldconfig.service runs after switchroot.

This is an issue that I have seen in the wild for a while, usually after rebooting a Flatcar instance on ARM64. From what I know, this issue does not affect the functionality of the Flatcar instance.

This issue has made a recurrence recently in the Github Actions. From the GitHub Actions:

  L1: "  "
    L2: " Error: _raid.go:245: could not reboot machine: machine __a8565d1f-ee8d-4220-9dbb-92c8f030eefb__ failed basic checks: some systemd units failed:"
    L3: "??? ldconfig.service loaded failed failed Rebuild Dynamic Linker Cache"
    L4: "status: "
    L5: "journal:-- No entries --"
    L6: "harness.go:593: Found systemd unit failed to start (?[0;1;39mldconfig.service?[0m - Rebuild Dynamic Linker Cache.  ) on machine a8565d1f-ee8d-4220-9dbb-92c8f030eefb console_"
    L7: " "

Debugging this issue, I could get a warning message by running manually ldconfig -X, which had exit code 0 and error message: Message: /lib/ld.so.conf is not an ELF file - it has the wrong magic bytes ....
This issue might be due to the wrong path for the /lib/ld.so.conf and it might be related to this commit: flatcar/scripts@ba45a2b.
I do not know yet if the warning messsage and the systemd unit failure are related, but maybe the warning message sometimes is seen as error output and thus fails the unit.

The definition of the systemd unit:

systemctl cat ldconfig.service
# /usr/lib/systemd/system/ldconfig.service
#  SPDX-License-Identifier: LGPL-2.1-or-later
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Rebuild Dynamic Linker Cache
Documentation=man:ldconfig(8)

ConditionNeedsUpdate=|/etc
ConditionFileNotEmpty=|!/etc/ld.so.cache

DefaultDependencies=no
After=local-fs.target
Before=sysinit.target systemd-update-done.service
Conflicts=shutdown.target initrd-switch-root.target
Before=shutdown.target initrd-switch-root.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/ldconfig -X

Impact

The test framework needs to re-run the test, sometimes 3 or 4 times. In real world scenarios, because of the systemd unit failure, automation might break or sanitiy check tools might flag this issue.

Environment and steps to reproduce

Example test run that had to retry some of the Mantle tests:
https://github.com/flatcar/scripts/actions/runs/9777950641?pr=2089

Expected behavior

systemd unit ldconfig.service should not fail.

@ader1990 ader1990 added the kind/bug Something isn't working label Jul 4, 2024
@tormath1
Copy link
Contributor

tormath1 commented Jul 4, 2024

I had this issue once in a VM and I suspected a memory issue.

@ader1990
Copy link
Author

ader1990 commented Jul 5, 2024

I could reliably reproduce the issue by bisecting the RAM size of a qemu-arm64 vm on arm64. 256MB of RAM, the system OOMs and does not boot. 512 MB of RAM, the system runs fine. But on 384 MB of RAM, I got this in the journalctl logs:

Jul 05 09:22:09 localhost systemd-coredump[1517]: Process 1378 (ldconfig) of user 0 dumped core.

                                                  Stack trace of thread 1378:
                                                  #0  0x0000ffff8f15331c n/a (n/a + 0x0)
                                                  #1  0x0000ffff8f153308 n/a (n/a + 0x0)
                                                  #2  0x0000ffff8f145610 n/a (n/a + 0x0)
                                                  #3  0x0000ffff8f137f18 n/a (n/a + 0x0)
                                                  #4  0x0000ffff8f13ac08 n/a (n/a + 0x0)
                                                  #5  0x0000ffff8f13bf9c n/a (n/a + 0x0)
                                                  #6  0x0000ffff8f138b40 n/a (n/a + 0x0)
                                                  #7  0x0000ffff8f13dfc0 n/a (n/a + 0x0)
                                                  #8  0x0000ffff8f13e2e4 n/a (n/a + 0x0)
                                                  #9  0x0000ffff8f1391b0 n/a (n/a + 0x0)
                                                  ELF object binary architecture: AARCH64
Jul 05 09:22:09 localhost systemd[1]: Started systemd-userdbd.service - User Database Manager.
Jul 05 09:22:09 localhost systemd[1]: ldconfig.service: Main process exited, code=dumped, status=6/ABRT
Jul 05 09:22:09 localhost systemd[1]: ldconfig.service: Failed with result 'core-dump'.
Jul 05 09:22:09 localhost systemd[1]: Failed to start ldconfig.service - Rebuild Dynamic Linker Cache.

ader1990 added a commit to flatcar/mantle that referenced this issue Jul 5, 2024
ader1990 added a commit to flatcar/mantle that referenced this issue Jul 5, 2024
@jepio
Copy link
Member

jepio commented Jul 5, 2024

ldconfig runs after switch-root:

$ journalctl -b0 | grep -Ee 'switch-root|ldconfig'
Jul 05 09:50:26 localhost systemd[1]: Reached target initrd-switch-root.target - Switch Root.
Jul 05 09:50:26 localhost systemd[1]: Starting initrd-switch-root.service - Switch Root...
Jul 05 09:50:27 localhost systemd[1]: initrd-switch-root.service: Deactivated successfully.
Jul 05 09:50:27 localhost systemd[1]: Stopped initrd-switch-root.service - Switch Root.
Jul 05 09:50:27 localhost systemd[1]: Stopped target initrd-switch-root.target - Switch Root.
Jul 05 09:50:27 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=initrd-switch-root comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 05 09:50:27 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=initrd-switch-root comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 05 09:50:28 localhost systemd[1]: Starting ldconfig.service - Rebuild Dynamic Linker Cache...
Jul 05 09:50:29 localhost ldconfig[1230]: /sbin/ldconfig: /lib/ld.so.conf is not an ELF file - it has the wrong magic bytes at the start.
Jul 05 09:50:29 localhost systemd[1]: Finished ldconfig.service - Rebuild Dynamic Linker Cache.
Jul 05 09:50:29 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='unit=ldconfig comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

The segfault due to low memory may be different than other failure modes. I debugged the coredump for an occurance of this in tests once, and saw that it had to do with parsing /var/cache/ldconfig/aux-cache - something about a mismatch between device major/minor + inode in cache and on disk. I couldn't make sense of this any further at the time.

@ader1990
Copy link
Author

ader1990 commented Jul 5, 2024

plot1
@jepio I added the systemd-analyze plot to see the proper picture with ldconfig starting after switch-root. I do not understand why this condition is also present in the ldconfig.service definition: Before=shutdown.target initrd-switch-root.target.

@ader1990
Copy link
Author

ader1990 commented Jul 5, 2024

Added a png for the systemd plot, in case the svg does not work.
plot1

@chewi
Copy link
Contributor

chewi commented Aug 20, 2024

I doubt this is memory-related. I just saw this on a local 2GB QEMU instance. It seems random, it worked many other times.

I think the /lib/ld.so.conf warning is unrelated. I had previously considered this an odd location to put this. I thought about suggesting /usr/etc/ld.so.conf instead as the bfd linker uses this by default when building, falling back to /etc/ld.so.confg.

krnowak added a commit to flatcar/scripts that referenced this issue Oct 16, 2024
This is in hope to either fix or pinpoint the cause of
flatcar/Flatcar#1493.
krnowak added a commit to flatcar/scripts that referenced this issue Oct 16, 2024
This is in hope to either fix or pinpoint the cause of
flatcar/Flatcar#1493.
krnowak added a commit to flatcar/scripts that referenced this issue Oct 16, 2024
This is in hope to either fix or pinpoint the cause of
flatcar/Flatcar#1493.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Status: 📝 Needs Triage
Development

No branches or pull requests

4 participants