-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman rm: fails, and leaves everything hosed #15367
Comments
Here's a cannot-remove one, int remote ubuntu root on August 2 Aha, and here it is int podman f36 root, confirming that it's not a remote-only bug. |
Here's the |
The |
In runtime_ctr I see:
Could this be a situation where the c.lock is never Locked? |
Here's one with possibly useful context. sys remote f36 rootless:
Then, immediately after that, kablooie: |
Interesting. This one sys remote ubuntu-2204 root goes into the
|
Not sure if at all related, but got a similar - likely storage related - issue as well on openSUSE MicroOS: adathor@vegas:~> podman pod --log-level debug rm -i -f hedgedoc
INFO[0000] podman filtering at log level debug
DEBU[0000] Called rm.PersistentPreRunE(podman pod --log-level debug rm -i -f hedgedoc)
DEBU[0000] Merged system config "/usr/share/containers/containers.conf"
DEBU[0000] Using conmon: "/usr/bin/conmon"
DEBU[0000] Initializing boltdb state at /home/podman_vol/home-vol/podman/containers/storage/libpod/bolt_state.db
DEBU[0000] systemd-logind: Unknown object '/'.
DEBU[0000] Using graph driver btrfs
DEBU[0000] Using graph root /home/podman_vol/home-vol/podman/containers/storage
DEBU[0000] Using run root /run/user/1000/containers
DEBU[0000] Using static dir /home/podman_vol/home-vol/podman/containers/storage/libpod
DEBU[0000] Using tmp dir /run/user/1000/libpod/tmp
DEBU[0000] Using volume path /home/podman_vol/home-vol/podman/containers/storage/volumes
DEBU[0000] Set libpod namespace to ""
DEBU[0000] [graphdriver] trying provided driver "btrfs"
DEBU[0000] Initializing event backend journald
DEBU[0000] Configured OCI runtime crun initialization failed: no valid executable found for OCI runtime crun: invalid argument
DEBU[0000] Configured OCI runtime runj initialization failed: no valid executable found for OCI runtime runj: invalid argument
DEBU[0000] Configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0000] Configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0000] Configured OCI runtime krun initialization failed: no valid executable found for OCI runtime krun: invalid argument
DEBU[0000] Using OCI runtime "/usr/bin/runc"
INFO[0000] Setting parallel job count to 13
DEBU[0000] Removing container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb
DEBU[0000] Cleaning up container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb
DEBU[0000] Network is already cleaned up, skipping...
DEBU[0000] Container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb storage is already unmounted, skipping...
DEBU[0000] Removing all exec sessions for container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb
DEBU[0000] Container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb storage is already unmounted, skipping...
INFO[0000] Storage for container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb already removed
DEBU[0000] Removing container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b
DEBU[0000] Cleaning up container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b
DEBU[0000] Failed to reset unit file: "Unit 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b.service not loaded."
DEBU[0000] Container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b storage is already unmounted, skipping...
DEBU[0000] Removing all exec sessions for container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b
DEBU[0000] Container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b storage is already unmounted, skipping...
INFO[0000] Storage for container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b already removed
ERRO[0000] Removing container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b from pod ad97dd32896ba8af8f4a76144f9a6f6c26cab1b352ef47a11e91af48cbbdf2d4: error freeing lock for container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b: no such file or directory adathor@vegas:~> podman info
host:
arch: amd64
buildahVersion: 1.27.0
cgroupControllers:
- memory
- pids
cgroupManager: systemd
cgroupVersion: v2
conmon:
package: conmon-2.1.2-1.1.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.1.2, commit: unknown'
cpuUtilization:
idlePercent: 93.86
systemPercent: 2.51
userPercent: 3.64
cpus: 4
distribution:
distribution: '"opensuse-microos"'
version: "20220822"
eventLogger: journald
hostname: vegas
idMappings:
gidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
kernel: 5.19.2-1-default
linkmode: dynamic
logDriver: journald
memFree: 16723730432
memTotal: 20513906688
networkBackend: cni
ociRuntime:
name: runc
package: runc-1.1.3-2.1.x86_64
path: /usr/bin/runc
version: |-
runc version 1.1.3
commit: v1.1.3-0-ga916309fff0f
spec: 1.0.2-dev
go: go1.18.3
libseccomp: 2.5.4
os: linux
remoteSocket:
exists: true
path: /run/user/1000/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: true
seccompEnabled: true
seccompProfilePath: /etc/containers/seccomp.json
selinuxEnabled: true
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-1.1.11-1.6.x86_64
version: |-
slirp4netns version 1.1.11
commit: unknown
libslirp: 4.7.0
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.5.4
swapFree: 0
swapTotal: 0
uptime: 0h 13m 0.00s
plugins:
authorization: null
log:
- k8s-file
- none
- passthrough
- journald
network:
- bridge
- macvlan
- ipvlan
volume:
- local
registries:
search:
- registry.opensuse.org
- registry.suse.com
- docker.io
store:
configFile: /home/adathor/.config/containers/storage.conf
containerStore:
number: 10
paused: 0
running: 7
stopped: 3
graphDriverName: btrfs
graphOptions: {}
graphRoot: /home/podman_vol/home-vol/podman/containers/storage
graphRootAllocated: 10000830279680
graphRootUsed: 247550713856
graphStatus:
Build Version: Btrfs v5.18.1
Library Version: "102"
imageCopyTmpDir: /var/tmp
imageStore:
number: 28
runRoot: /run/user/1000/containers
volumePath: /home/podman_vol/home-vol/podman/containers/storage/volumes
version:
APIVersion: 4.2.0
Built: 1660176000
BuiltTime: Thu Aug 11 07:00:00 2022
GitCommit: ""
GoVersion: go1.16.15
Os: linux
OsArch: linux/amd64 |
Here's another @containers/podman-maintainers since this is now in the wild, I'd say it's higher priority. |
No |
Here's another one with the
|
Interesting variation on remote f36-aarch64 root: "cannot remove container" but subsequent tests actually pass:
|
Another one like yesterday's, again on remote f36-aarch64 root, again just one
|
Here's the unlinkat/ebusy one on f36 remote rootless OK, f2f over, we need serious attention on this one. |
f36 remote rootless again |
Re-adding |
Another one starting with And here's another maybe-related-maybe-not, it happens in |
The symptoms look similar to #11594, don't they? |
Sometimes, but not enough for me to combine the two issues. #11594 is a one-shot: it happens, and life goes on. This issue is a catastrophe: once it happens, podman never works again. |
Potentially interesting correlation: #16154 (unmarshal blah blah, podman-remote only) seems to be highly correlated with this one. My gut feeling is that "unmarshal" is its own bug; that what's happening is that THIS issue (the forever-hose) is triggering, and sending an error that podman-remote cannot deal with. So the underlying error is actually this one. But please take this with a grain of salt. |
This one is really hitting us hard. Last ten days:
|
Hard to know if this one (ubuntu root) is the hosed flake or the unlinkat/EBUSY one, but I'm going to add it here. |
Another case where the first failure presents with "blah blah is mounted", then cascades from there into disaster:
Can we please, please get some attention on this? It's a bad one, flaking at least once per day. |
|
@containers/podman-maintainers PTAL |
The issue is the timesMounted, err := r.store.Mounted(ctr.ID) call below could be getting screwed up or have a race condition.
It is possible to loose an unmount on a killed process. This couner is attempting to keep track of the number of podman mount XYZ timesMounted would show 3. If for some reason we had a race where someone else was executing a mount while executing this code, it could cause issues. Or if a podman container cleanup failed and the number of mounts was not decremented. |
Now seen in f37 gating tests, too |
...also can't remove a pod after, being lazy, and attempting rm -f on a running pod.
|
@Johnpc123 to help you out, i was able to remove the pod that i borked like you did by removing it through the cockpit-podman web interface. it threw that error at me but was able to remove it completely nonetheless, while |
@mtrmac I think your recent c/storage fixes may have resolved this issue. It seems to happen only with remote which somehow made me think about your in-process fixes. WDYT? |
This is quite long, and might well mix different root causes:
In summary nothing of the above is obviously related to the c/storage work I was doing. The container state cleanup (unlike layer state cleanup) is very minimal in c/storage. My hunch would be to follow the |
Thanks for taking a look, @mtrmac ! |
@edsantiago fresh occurrences of the flake since #16886? |
@vrothberg it's hard to say: there have been a lot of flakes in the past few weeks, even despite a low number of CI runs, and it'll take me half a day to evaluate/categorize them all. From a quick overview, though, I don't see the classic "everything hosed" signature. I'll keep my eyes open for it. PS isn't today a holiday for you? |
Thanks, Ed!
Not in France. But in the US, so please don't let me bother you today. There's plenty of time for that this year :) |
Here goes. I'm not 100% sure that these are the same bug, because they don't exhibit the "everything-hosed" symptom, but they do look like the same "podman rm fails" symptom. Is it possible that #16886 fixed the everything-hosed bug, but we're still seeing the "could not remove container, could not be stopped" bug?
|
The container lock is released before stopping/killing which implies certain race conditions with, for instance, the cleanup process changing the container state to stopped, exited or other states. The (remaining) flakes seen in containers#16142 and containers#15367 strongly indicate a race in between the stopping/killing a container and the cleanup process. To fix the flake make sure to ignore invalid-state errors. An alternative fix would be to change `KillContainer` to not return such errors at all but commit c77691f indicates an explicit desire to have these errors being reported in the sig proxy. [NO NEW TESTS NEEDED] as it's a race already covered by the system tests. Fixes: containers#16142 Fixes: containers#15367 Signed-off-by: Valentin Rothberg <[email protected]>
Another hard-to-isolate flake. I'm not convinced that it's truly podman-remote-only, because I've actually seen this one on my laptop. Two symptoms (possibly unrelated):
and
With the second one, once it happens, the entire system becomes completely unusable.
Here's an example of the cannot-remove one
Here's an example of the error-freeing one
The text was updated successfully, but these errors were encountered: