-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The unlinkat/EBUSY flake is back #11594
Comments
The e2e Cleanup handlers run various 'rm' commands, but don't check the exit status of any. This masks real problems such as the unlinkat-EBUSY flake (containers#11594) which only triggers on container/pod rm. Solution: check exit status: a failure in cleanup will now be considered a failure in the test itself. Signed-off-by: Ed Santiago <[email protected]>
this looks like a new issue not related to the previous one we had in c/storage. It looks like a race when cleaning up |
this one instead looks like the one we had in storage :/ It could be an issue specific to ppc64le |
Here's something similar. The string "unlinkat" does not appear here (f33 root container) but it seems to have to do with pods and EBUSY, so I'll report here:
|
That's separate - it's a cgroups issue, Conmon is not exiting by the time we want to remove the pod's cgroup. |
Thanks. Filed #11946. |
And, back on topic, this is still happening. About 1-2 every three days. As I mentioned before, these are really tedious to look for and link, so yeah, I'll post links if desired, but I really hope someone just has an Aha moment and realizes how to fix it. |
Still happening:
So: both root and rootless. Fedora and Ubuntu. Local and remote. Always in |
Hmmm, I just managed to accidentally reproduce this on my laptop. I ^C'ed a while-forever loop that was doing $ bin/podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
99cc76cd2fe8 docker.io/library/fedora:latest true 2 seconds ago Created charming_dirac
$ bin/podman rm -f -a <--- This paused for about 10s.
Error: error removing container 99cc76cd2fe854a86529081a519a90510243ab4681e3a79e1b7899e53827fb96 root filesystem: 2 errors occurred:
* unlinkat /home/esm/.local/share/containers/storage/overlay-containers/99cc76cd2fe854a86529081a519a90510243ab4681e3a79e1b7899e53827fb96/userdata/shm: device or resource busy
* unlinkat /home/esm/.local/share/containers/storage/overlay-containers/99cc76cd2fe854a86529081a519a90510243ab4681e3a79e1b7899e53827fb96/userdata/shm: device or resource busy
(Yes, two blank lines afterward). The pause surprised me: I did not expect to need |
A friendly reminder that this issue had no activity for 30 days. |
Just reproduced it again on my laptop, again unintentionally via ^C: ...
^C
$ bin/podman rm foo
Error: error removing container f512e6e9274b8ccb8d4b2715c482308a00c404b553ae81611146f339df107ba7 root filesystem: 2 errors occurred:
* unlinkat /home/esm/.local/share/containers/storage/overlay-containers/f512e6e9274b8ccb8d4b2715c482308a00c404b553ae81611146f339df107ba7/userdata/shm: device or resource busy
* unlinkat /home/esm/.local/share/containers/storage/overlay-containers/f512e6e9274b8ccb8d4b2715c482308a00c404b553ae81611146f339df107ba7/userdata/shm: device or resource busy Based on the output of the loop, I'm pretty sure the ^C interrupted this command:
|
Reproduced this again in RHEL8 gating tests:
|
if the container deletion fails with the simple os.RemoveAll(path), then attempt again using system.EnsureRemoveAll(path). The difference is that system.EnsureRemoveAll(path) tries to umount eventual mounts that are keeping the directory busy. It might help with: containers/podman#11594 Signed-off-by: Giuseppe Scrivano <[email protected]>
I was not able to reproduce locally even once, so this is just an attempt that hopefully can help dealing with the race we have seen here: containers/storage#1106 |
if the container deletion fails with the simple os.RemoveAll(path), then attempt again using system.EnsureRemoveAll(path). The difference is that system.EnsureRemoveAll(path) tries to umount eventual mounts that are keeping the directory busy. It might help with: containers/podman#11594 Signed-off-by: Giuseppe Scrivano <[email protected]>
A friendly reminder that this issue had no activity for 30 days. |
Still.
Please don't make me cite more examples: these are not easily captured by my flake logger, and it is super tedious for me to manually find and link to each one. |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
Haven't seen this in, oh, five days. (Plus lots more in April and March and February; but like I said above, it's tedious to gather these). |
Seen today f36 remote rootless:
|
still happening on root containers with podman 4.1.0 on EL7, happy to provide you with remote access:
|
Seen just now, ubuntu 2110 remote root:
This then caused a cascade of failures, where every test after that failed:
|
Seems to be much easier to trigger when running on the 3.10 kernel |
Okay, we've got a problem. This is blowing up today. New one, f36 remote rootless |
And another one. Now f36 remote root. |
And yet another one, sys remote fedora-36 rootless. @containers/podman-maintainers PTAL this one is getting ugly really fast. |
@giuseppe, if you find time, could you take a look? |
Another one: sys remote ubuntu root |
Another recent one: sys remote f36 rootless on August 8 |
In this one remote ubuntu root, the unlinkat/EBUSY flake triggers the error-freeing error which renders the system unusable (#15367) |
A friendly reminder that this issue had no activity for 30 days. |
Reminder @giuseppe @vrothberg |
This one could be the same one, or could be the rm-hose issue (#15367), I can't tell. podman ubuntu rootless, in $ podman kube down /tmp/podman_bats.pH6C43/testpod.yaml
Pods stopped:
3cb812278cf782c71fb501a171658fac32c9fdbbba3055ec9d2a6ccbd003a2f8
Pods removed:
Error: 1 error occurred:
* removing container 10413d40f879e63f9d5802f8dc26343b8b6267020014e1ae0491b26c8eea33cc from pod 3cb812278cf782c71fb501a171658fac32c9fdbbba3055ec9d2a6ccbd003a2f8: removing container 10413d40f879e63f9d5802f8dc26343b8b6267020014e1ae0491b26c8eea33cc root filesystem: 1 error occurred:
* unlinkat /home/some25997dude/.local/share/containers/storage/overlay-containers/10413d40f879e63f9d5802f8dc26343b8b6267020014e1ae0491b26c8eea33cc/userdata/shm: device or resource busy
[ rc=125 (** EXPECTED 0 **) ] |
The function is being used in a number of places, notably container removal and cleanup. While container removal already loops over EBUSY, cleanup does not. To make sure that all callers of Unmount get a fair chance of unmounting cleanly, also loop there. I used the same values as containerd: 50 loops with 50ms sleeps. Context: containers/podman/issues/11594 Signed-off-by: Valentin Rothberg <[email protected]>
The function is being used in a number of places, notably container removal and cleanup. While container removal already loops over EBUSY, cleanup does not. To make sure that all callers of Unmount get a fair chance of unmounting cleanly, also loop there. I used the same values as containerd: 50 loops with 50ms sleeps. Context: containers/podman/issues/11594 Signed-off-by: Valentin Rothberg <[email protected]>
The function is being used in a number of places, notably container removal and cleanup. While container removal already loops over EBUSY, cleanup does not. To make sure that all callers of Unmount get a fair chance of unmounting cleanly, also loop there. I used the same values as containerd: 50 loops with 50ms sleeps. Context: containers/podman/issues/11594 Signed-off-by: Valentin Rothberg <[email protected]>
Since #16159 merged, I am going to close it. Please reopen if I am mistaken. |
Background: we thought this had been fixed in containers/storage#926 . Further background: #7139, #10454.
Symptom [source]:
Most often seen in "
podman play kube with image data
" test, with "podman pod rm -fa
", but also seen in "podman create container with --uidmap and conmon PidFile accessible
" test (withpodman rm
, notpodman pod rm
, and with one unlinkat line, not two). Happening on ubuntu and fedora.Flakes began on 2021-08-31 and continue through today. The multiarch folks are also seeing it.
These flakes do not appear in my flake logs because they seem to be happening in
common_test.go:Cleanup()
, which (sigh) does not do exit-status checks.Log links available if absolutely necessary, but please don't ask unless absolutely necessary: my flake checker doesn't catch these (see immediately above), so finding logs requires tedious manual effort.
What happened on August 30/31?
The text was updated successfully, but these errors were encountered: