The unlinkat/EBUSY flake is back #11594

edsantiago · 2021-09-15T17:13:28Z

Background: we thought this had been fixed in containers/storage#926 . Further background: #7139, #10454.

Symptom [source]:

Running: podman [options] pod rm -fa
Error: error removing container SHA root filesystem: 2 errors occurred:
	* unlinkat /tmp/podman_test184319701/crio/vfs-containers/SHA/userdata/shm: device or resource busy
	* unlinkat /tmp/podman_test184319701/crio/vfs-containers/SHA/userdata/shm: device or resource busy

Most often seen in "podman play kube with image data" test, with "podman pod rm -fa", but also seen in "podman create container with --uidmap and conmon PidFile accessible" test (with podman rm, not podman pod rm, and with one unlinkat line, not two). Happening on ubuntu and fedora.

Flakes began on 2021-08-31 and continue through today. The multiarch folks are also seeing it.

These flakes do not appear in my flake logs because they seem to be happening in common_test.go:Cleanup(), which (sigh) does not do exit-status checks.

Log links available if absolutely necessary, but please don't ask unless absolutely necessary: my flake checker doesn't catch these (see immediately above), so finding logs requires tedious manual effort.

What happened on August 30/31?

The text was updated successfully, but these errors were encountered:

The e2e Cleanup handlers run various 'rm' commands, but don't check the exit status of any. This masks real problems such as the unlinkat-EBUSY flake (containers#11594) which only triggers on container/pod rm. Solution: check exit status: a failure in cleanup will now be considered a failure in the test itself. Signed-off-by: Ed Santiago <[email protected]>

giuseppe · 2021-09-16T11:09:18Z

this looks like a new issue not related to the previous one we had in c/storage.

It looks like a race when cleaning up /dev/shm shared among the containers in the pod

giuseppe · 2021-09-16T11:21:14Z

Flakes began on 2021-08-31 and continue through today. The multiarch folks are also seeing it.

this one instead looks like the one we had in storage :/ It could be an issue specific to ppc64le

edsantiago · 2021-10-12T16:37:01Z

Here's something similar. The string "unlinkat" does not appear here (f33 root container) but it seems to have to do with pods and EBUSY, so I'll report here:

Running: podman [options] system prune -f
time="2021-10-12T16:12:29Z" level=error msg="Deleting pod be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894 cgroup /libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894/conmon: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894/conmon: device or resource busy"
Deleted Containers
b31a202ad92adf0c0110efe9d93894a5e54c05189f684d5b5914ad121962faa4
Deleted Pods
Error: error removing pod be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894 conmon cgroup: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894/conmon: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894/conmon: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894/conmon: device or resource busy
         [AfterEach] Podman prune

mheon · 2021-10-12T17:17:29Z

That's separate - it's a cgroups issue, Conmon is not exiting by the time we want to remove the pod's cgroup.

edsantiago · 2021-10-12T20:13:32Z

Thanks. Filed #11946.

edsantiago · 2021-10-12T20:15:25Z

And, back on topic, this is still happening. About 1-2 every three days. As I mentioned before, these are really tedious to look for and link, so yeah, I'll post links if desired, but I really hope someone just has an Aha moment and realizes how to fix it.

edsantiago · 2021-11-09T17:01:35Z

Still happening:

So: both root and rootless. Fedora and Ubuntu. Local and remote. Always in podman play kube with image data test.

edsantiago · 2021-11-16T18:12:56Z

Hmmm, I just managed to accidentally reproduce this on my laptop. I ^C'ed a while-forever loop that was doing podman run. This left a container in state Created, which I then tried to rm:

$ bin/podman ps -a
CONTAINER ID  IMAGE                            COMMAND     CREATED        STATUS      PORTS       NAMES
99cc76cd2fe8  docker.io/library/fedora:latest  true        2 seconds ago  Created                 charming_dirac
$ bin/podman rm -f -a      <--- This paused for about 10s.
Error: error removing container 99cc76cd2fe854a86529081a519a90510243ab4681e3a79e1b7899e53827fb96 root filesystem: 2 errors occurred:
        * unlinkat /home/esm/.local/share/containers/storage/overlay-containers/99cc76cd2fe854a86529081a519a90510243ab4681e3a79e1b7899e53827fb96/userdata/shm: device or resource busy
        * unlinkat /home/esm/.local/share/containers/storage/overlay-containers/99cc76cd2fe854a86529081a519a90510243ab4681e3a79e1b7899e53827fb96/userdata/shm: device or resource busy

(Yes, two blank lines afterward).

The pause surprised me: I did not expect to need -t 0 on a Created container.

github-actions · 2021-12-17T00:04:11Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2022-01-05T14:32:58Z

Just reproduced it again on my laptop, again unintentionally via ^C:

...
^C
$ bin/podman rm foo
Error: error removing container f512e6e9274b8ccb8d4b2715c482308a00c404b553ae81611146f339df107ba7 root filesystem: 2 errors occurred:
        * unlinkat /home/esm/.local/share/containers/storage/overlay-containers/f512e6e9274b8ccb8d4b2715c482308a00c404b553ae81611146f339df107ba7/userdata/shm: device or resource busy
        * unlinkat /home/esm/.local/share/containers/storage/overlay-containers/f512e6e9274b8ccb8d4b2715c482308a00c404b553ae81611146f339df107ba7/userdata/shm: device or resource busy

Based on the output of the loop, I'm pretty sure the ^C interrupted this command:

bin/podman run -d --name foo $IMAGE sh -c 'printenv NOTIFY_SOCKET;echo READY;while ! test -f /stop;do sleep 0.1;done'

jnovy · 2022-01-13T13:02:11Z

Reproduced this again in RHEL8 gating tests:

Error: error removing container 033f3b8a1eec95570e68aacc12035c56113d9e180eb3e83bbe6a48ee05f7ce71 root filesystem: 1 error occurred: * unlinkat /var/lib/containers/storage/overlay/240447b311711770c5d6a17a15f9cc6ec7284a98d05dccce841abec60bbdb7ab/merged: device or resource busy [ rc=125 (** EXPECTED 0 **) ]

if the container deletion fails with the simple os.RemoveAll(path), then attempt again using system.EnsureRemoveAll(path). The difference is that system.EnsureRemoveAll(path) tries to umount eventual mounts that are keeping the directory busy. It might help with: containers/podman#11594 Signed-off-by: Giuseppe Scrivano <[email protected]>

giuseppe · 2022-01-14T12:58:52Z

I was not able to reproduce locally even once, so this is just an attempt that hopefully can help dealing with the race we have seen here: containers/storage#1106

if the container deletion fails with the simple os.RemoveAll(path), then attempt again using system.EnsureRemoveAll(path). The difference is that system.EnsureRemoveAll(path) tries to umount eventual mounts that are keeping the directory busy. It might help with: containers/podman#11594 Signed-off-by: Giuseppe Scrivano <[email protected]>

github-actions · 2022-02-14T00:04:45Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2022-02-14T13:39:29Z

Still.

02-09 int podman f34 root
02-09 ditto, meaning, again in the same run
02-11 netavark int rootless (this one also has over three hundred unlinkat /tmp/.../catatonit: EPERM errors which I'm just not going to worry about right now)
02-11 again netavark, but in a completely different PR (and ditto the catatonit thing)

Please don't make me cite more examples: these are not easily captured by my flake logger, and it is super tedious for me to manually find and link to each one.

github-actions · 2022-03-17T00:05:50Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2022-04-19T00:06:21Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2022-04-19T13:51:03Z

Haven't seen this in, oh, five days. (Plus lots more in April and March and February; but like I said above, it's tedious to gather these).

edsantiago · 2022-05-24T18:39:42Z

Seen today f36 remote rootless:

[+1790s] not ok 345 podman play --annotation
...
         # $ podman-remote --url unix:/tmp/podman_tmp_1ytS stop -a -t 0
         # 7a31b6a381e43f0ddb956072018fa33037657c471ff8a9250206fc5eb00b65dc
         # 35a84ff304a006de59c19941cad812fe5417345aa64ded5073a2837ed98b81c3
         # $ podman-remote --url unix:/tmp/podman_tmp_1ytS pod rm -t 0 -f test_pod
         # Error: error removing container 35a84ff304a006de59c19941cad812fe5417345aa64ded5073a2837ed98b81c3 root filesystem: 1 error occurred:
         # 	* unlinkat /home/some5835dude/.local/share/containers/storage/overlay/4c6ba48ec95c425623b4e8964b3fc07ee6f96d78dd9147e56535cdbff09ac39e/merged: device or resource busy

Eugeny · 2022-06-30T10:33:01Z

still happening on root containers with podman 4.1.0 on EL7, happy to provide you with remote access:

WARN[0025] Found incomplete layer "831b9ae6149f440fe45423384f7ba5ca916a92ee23c8401e855a8a609127af24", deleting it 
Error: error mounting storage for container e84b32d982c7a4fb56ddf3db1a2d75351cc02934c1e7fda25f8a2d7888d47aa8: unlinkat /var/lib/containers/storage/overlay/831b9ae6149f440fe45423384f7ba5ca916a92ee23c8401e855a8a609127af24/merged: device or resource busy

edsantiago · 2022-07-07T14:50:16Z

Seen just now, ubuntu 2110 remote root:

[+0654s] not ok 255 podman pod create --share
         # (from function `die' in file test/system/[helpers.bash, line 555](https://github.com/containers/podman/blob/cd16c6a6a7d153b18a7f25a5728a44d0acf64b3f/test/system/helpers.bash#L555),
         #  from function `run_podman' in file test/system/[helpers.bash, line 219](https://github.com/containers/podman/blob/cd16c6a6a7d153b18a7f25a5728a44d0acf64b3f/test/system/helpers.bash#L219),
         #  in test file test/system/[200-pod.bats, line 342](https://github.com/containers/podman/blob/cd16c6a6a7d153b18a7f25a5728a44d0acf64b3f/test/system/200-pod.bats#L342))
         #   `run_podman pod create --share +pid --replace --name $pod_name' failed with status 125
         # # podman-remote --url unix:/tmp/podman_tmp_orNk rm -t 0 --all --force --ignore
         # # podman-remote --url unix:/tmp/podman_tmp_orNk ps --all --external --format {{.ID}} {{.Names}}
         # # podman-remote --url unix:/tmp/podman_tmp_orNk images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # quay.io/libpod/testimage:20220615 f26aa69bb3f3
         # # podman-remote --url unix:/tmp/podman_tmp_orNk pod create --share bogus --name dd1keqxlvx
         # Error: failed to make pod: Invalid kernel namespace to share: bogus. Options are: cgroup, ipc, net, pid, uts or none
         # [ rc=125 (expected) ]
         # # podman-remote --url unix:/tmp/podman_tmp_orNk pod create --share ipc --name dd1keqxlvx
         # 21ac2d3fe9939e699a6d1f4da0adf5cbed36a6e27c4b455b6f865ba832f7258c
         # # podman-remote --url unix:/tmp/podman_tmp_orNk pod inspect dd1keqxlvx --format {{.SharedNamespaces}}
         # [ipc]
         # # podman-remote --url unix:/tmp/podman_tmp_orNk run --rm --pod dd1keqxlvx --hostname foobar quay.io/libpod/testimage:20220615 hostname
         # foobar
         # # podman-remote --url unix:/tmp/podman_tmp_orNk pod create --share +pid --replace --name dd1keqxlvx
         # Error: error removing container 14a06e3b746c083e938bad3dfbbcd4869effcd5013df1e49ec3e39837edcfadf root filesystem: 1 error occurred:
         # 	* unlinkat /var/lib/containers/storage/overlay/08c4e1dad9a8889447368484b7ceadbb5551e171c5917a00d38c2f2db4049b40/merged: device or resource busy

This then caused a cascade of failures, where every test after that failed:

         # Error: error retrieving parent of image "f26aa69bb3f3de470d358b6c4e7f9de8ca165ef7f1689af1cdd446902dc27928": you may need to remove the image to resolve the error: unlinkat /var/lib/containers/storage/overlay/08c4e1dad9a8889447368484b7ceadbb5551e171c5917a00d38c2f2db4049b40/merged: device or resource busy

Eugeny · 2022-07-07T15:26:54Z

Seems to be much easier to trigger when running on the 3.10 kernel

edsantiago · 2022-07-07T15:48:35Z

Okay, we've got a problem. This is blowing up today. New one, f36 remote rootless

edsantiago · 2022-07-07T16:36:03Z

And another one. Now f36 remote root.

edsantiago · 2022-07-08T00:53:57Z

And yet another one, sys remote fedora-36 rootless.

@containers/podman-maintainers PTAL this one is getting ugly really fast.

vrothberg · 2022-07-08T08:08:09Z

@giuseppe, if you find time, could you take a look?

edsantiago · 2022-07-12T12:04:01Z

Another one: sys remote ubuntu root

edsantiago · 2022-08-17T19:03:13Z

Another recent one: sys remote f36 rootless on August 8

edsantiago · 2022-08-22T20:36:09Z

In this one remote ubuntu root, the unlinkat/EBUSY flake triggers the error-freeing error which renders the system unusable (#15367)

vrothberg · 2022-08-26T11:36:42Z

@giuseppe if you find time, could you take a look? It seems related to #15367.

github-actions · 2022-09-30T00:15:57Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2022-09-30T14:22:17Z

Reminder @giuseppe @vrothberg

edsantiago · 2022-10-12T16:00:40Z

This one could be the same one, or could be the rm-hose issue (#15367), I can't tell. podman ubuntu rootless, in kube down:

$ podman kube down /tmp/podman_bats.pH6C43/testpod.yaml
Pods stopped:
3cb812278cf782c71fb501a171658fac32c9fdbbba3055ec9d2a6ccbd003a2f8
Pods removed:
Error: 1 error occurred:
	* removing container 10413d40f879e63f9d5802f8dc26343b8b6267020014e1ae0491b26c8eea33cc from pod 3cb812278cf782c71fb501a171658fac32c9fdbbba3055ec9d2a6ccbd003a2f8: removing container 10413d40f879e63f9d5802f8dc26343b8b6267020014e1ae0491b26c8eea33cc root filesystem: 1 error occurred:
	* unlinkat /home/some25997dude/.local/share/containers/storage/overlay-containers/10413d40f879e63f9d5802f8dc26343b8b6267020014e1ae0491b26c8eea33cc/userdata/shm: device or resource busy
[ rc=125 (** EXPECTED 0 **) ]

The function is being used in a number of places, notably container removal and cleanup. While container removal already loops over EBUSY, cleanup does not. To make sure that all callers of Unmount get a fair chance of unmounting cleanly, also loop there. I used the same values as containerd: 50 loops with 50ms sleeps. Context: containers/podman/issues/11594 Signed-off-by: Valentin Rothberg <[email protected]>

vrothberg · 2022-12-15T12:21:39Z

Since #16159 merged, I am going to close it. Please reopen if I am mistaken.

edsantiago added the flakes Flakes from Continuous Integration label Sep 15, 2021

edsantiago mentioned this issue Sep 15, 2021

e2e test cleanup: check exit statuses #11597

Closed

github-actions bot added the stale-issue label Dec 17, 2021

rhatdan removed the stale-issue label Dec 17, 2021

giuseppe mentioned this issue Jan 14, 2022

store: use system.EnsureRemoveAll to rm container containers/storage#1106

Merged

github-actions bot added the stale-issue label Feb 14, 2022

edsantiago removed the stale-issue label Feb 14, 2022

github-actions bot added the stale-issue label Mar 17, 2022

rhatdan removed the stale-issue label Mar 19, 2022

github-actions bot added the stale-issue label Apr 19, 2022

rhatdan removed the stale-issue label Apr 19, 2022

umohnani8 assigned giuseppe Jul 21, 2022

edsantiago mentioned this issue Aug 17, 2022

podman rm: fails, and leaves everything hosed #15367

Closed

github-actions bot added the stale-issue label Sep 30, 2022

rhatdan removed the stale-issue label Sep 30, 2022

vrothberg mentioned this issue Oct 13, 2022

container deletion - EBUSY improvement containers/storage#1382

Merged

vrothberg mentioned this issue Oct 13, 2022

vendor containers/storage@main #16159

Merged

vrothberg closed this as completed Dec 15, 2022

mtrmac mentioned this issue Feb 3, 2023

kube play: cannot unmarshal ... PodRmReport.RmReport.Err #16154

Open

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 7, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The unlinkat/EBUSY flake is back #11594

The unlinkat/EBUSY flake is back #11594

edsantiago commented Sep 15, 2021

giuseppe commented Sep 16, 2021

giuseppe commented Sep 16, 2021

edsantiago commented Oct 12, 2021

mheon commented Oct 12, 2021

edsantiago commented Oct 12, 2021

edsantiago commented Oct 12, 2021

edsantiago commented Nov 9, 2021

edsantiago commented Nov 16, 2021

github-actions bot commented Dec 17, 2021

edsantiago commented Jan 5, 2022

jnovy commented Jan 13, 2022

giuseppe commented Jan 14, 2022

github-actions bot commented Feb 14, 2022

edsantiago commented Feb 14, 2022

github-actions bot commented Mar 17, 2022

github-actions bot commented Apr 19, 2022

edsantiago commented Apr 19, 2022

edsantiago commented May 24, 2022

Eugeny commented Jun 30, 2022 •

edited

Loading

edsantiago commented Jul 7, 2022

Eugeny commented Jul 7, 2022

edsantiago commented Jul 7, 2022

edsantiago commented Jul 7, 2022

edsantiago commented Jul 8, 2022

vrothberg commented Jul 8, 2022

edsantiago commented Jul 12, 2022

edsantiago commented Aug 17, 2022

edsantiago commented Aug 22, 2022

vrothberg commented Aug 26, 2022

github-actions bot commented Sep 30, 2022

rhatdan commented Sep 30, 2022

edsantiago commented Oct 12, 2022

vrothberg commented Dec 15, 2022

The unlinkat/EBUSY flake is back #11594

The unlinkat/EBUSY flake is back #11594

Comments

edsantiago commented Sep 15, 2021

giuseppe commented Sep 16, 2021

giuseppe commented Sep 16, 2021

edsantiago commented Oct 12, 2021

mheon commented Oct 12, 2021

edsantiago commented Oct 12, 2021

edsantiago commented Oct 12, 2021

edsantiago commented Nov 9, 2021

edsantiago commented Nov 16, 2021

github-actions bot commented Dec 17, 2021

edsantiago commented Jan 5, 2022

jnovy commented Jan 13, 2022

giuseppe commented Jan 14, 2022

github-actions bot commented Feb 14, 2022

edsantiago commented Feb 14, 2022

github-actions bot commented Mar 17, 2022

github-actions bot commented Apr 19, 2022

edsantiago commented Apr 19, 2022

edsantiago commented May 24, 2022

Eugeny commented Jun 30, 2022 • edited Loading

edsantiago commented Jul 7, 2022

Eugeny commented Jul 7, 2022

edsantiago commented Jul 7, 2022

edsantiago commented Jul 7, 2022

edsantiago commented Jul 8, 2022

vrothberg commented Jul 8, 2022

edsantiago commented Jul 12, 2022

edsantiago commented Aug 17, 2022

edsantiago commented Aug 22, 2022

vrothberg commented Aug 26, 2022

github-actions bot commented Sep 30, 2022

rhatdan commented Sep 30, 2022

edsantiago commented Oct 12, 2022

vrothberg commented Dec 15, 2022

Eugeny commented Jun 30, 2022 •

edited

Loading