-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
privileged podman ps
broken after reboot
#22159
Comments
podman ps
broken after rebootpodman ps
broken after reboot
assigned to @giuseppe as he wrote the original code ... mind taking a look? if you dont have time, unassign yourself. |
actually it looks like a PR was provider #22160 |
I don't disagree with your patch, but I see you discussing how reboots are breaking Podman and that absolutely should not be a problem. |
Right, as best I can tell, the patch I made is probably only for the case of To clarify a reboot appears to cause the issue in my environment because systemd is going wild with SIGKILL on my containers. I did some more digging. The error message I see appears to come from I do see that this has been changed by this pr and is in v4.4 and newer. If I'm reading it right The linked PR says:
I've figured out how to make systemd stop SIGKILL-ing my containers: Add I guess I'll still get into this situation if my Also for what it's worth this error doesn't prevent launching/inspecting new containers, just ps functionality, but that makes it hard to know what you have running. Given this happens on ubuntu 22.04 and the distro released version of podman available there is v3.4.4 (which I understand is rather old at this point) and it appears there have been at least some PRs/changes to the codepath resulting in the error I'm seeing; I'm ok if you want to close this issue. I also don't have a newer environment where I can test with the latest podman to try to reproduce this error either, nor an easy way to get a newer version of podman onto my older hosts, so I think this is as far as I can get debugging/troubleshooting it. A couple final questions just to make sure I didn't miss something obvious:
|
We should be wiping container state on a reboot, though? Systemd SIGKILL shouldn't matter; we detect a reboot, we should reset container state to a sane value automatically, unless podman's tmpdir is not a |
Sync is basically an escape hatch for things having gone really wrong. It should not be mandatory at all. I'll defer to @giuseppe on the Delegate question |
I did find some other discussion here on github before opening this issue, and confirmed that the runroot is on tmpfs( |
For this, we actually care about tmpdir - e.g. from
The fourth line, tmp dir, is presently pointing at a completely nonsensical We're discussing improving this behavior a lot in #22141 (actively reporting bad configs to users in 5.0, and refusing to even start if an unhandled reboot is detected in 5.1) which should alleviate this for good in the future. |
Ah, I have a
Full ` podman ps --log-level=trace --sync` output
|
it creates two sub-cgroups in the current cgroup. Delegate is necessary because it tells systemd that the service can modify cgroups, as it is the case with Podman when using |
The code in question no longer exists and 3.4.4 is way to old for us to support so closing |
Issue Description
A running container killed with
SIGKILL
, and with the network namespace/tmpfs removed (say, because of a reboot), results in podman getting stuck in a broken state. In our environment, this seems to reliably happen after rebooting with a container running (systemd seems to SIGKILL our containers, and the reboot cleans up the network namespaces/tmpfs state files). After the host is in a broken state, runningpodman ps --sync
or similar commands result in an error like:error joining network namespace for container CONTAINERID
. Additionallypodman inspect
returns that the container is still running. I would expect that podman discovers this and recovers in a sane manner, currently the only fix we have is to runpodman rm -f <broken container id>
Some details that might be relevant/differ from stock deployment:
crun version 1.14.1, commit: de537a7965bfbe9992e2cfae0baeb56a08128171
), because we run a 6.5.x kernel and the stock crun is broken with regards to symlinksI believe I tracked down (at least part of) the underlying problem, and it appears to still be present in
main
. Even if it's not my specific problem, it's definitely incorrect code and should be fixed. I am not able to compile and test a new version to confirm if it fixes my issue or not yet. The issue is with this block of code:podman/libpod/oci_conmon_common.go
Lines 235 to 250 in 2aad385
The problem appears to be thinking that
cmd.Start()
will return a non-nil error if the program exits abnormally with some stderr output, which is whatcmd.Run()
will do, butcmd.Start()
never will.cmd.Start()
returns simply whether the program could be started, and returns before the program is finished. So we never check the stderr output and update the container to exited if it's actually dead, instead charging ahead and leading to the failure we see later.c554672 is where this error was introduced (~6 years ago)
For my broken containers,
crun state
returns no such file or directory for the status file.Steps to reproduce the issue
Steps to reproduce the issue
systemctl kill podman-unit
)podman ps
(orpodman ps --sync
)I did try to reproduce this in an easier manner, the following steps resulted in the same brokenness:
$ sudo podman run --rm -it --name test-container docker.io/library/bash:latest
ps ax | grep podman
, then kill theconmon
, andpodman run
usingkill -9 <PIDS>
ip netns del cni-...
(look in /run/netns to figure out the name)podman ps
Describe the results you received
Error messages and podman thinking the container is still alive:
Describe the results you expected
Podman to correctly discover that the container is no longer running, update it's status appropriately, and not return any error messages (or possibly return a non-fatal error message)
podman info output
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
No
Additional environment details
Additional environment details
Additional information
Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting
The text was updated successfully, but these errors were encountered: