Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unlinkat/EBUSY/hosed is back (Jan 2023) #17216

Closed
edsantiago opened this issue Jan 25, 2023 · 21 comments · Fixed by containers/storage#1724
Closed

unlinkat/EBUSY/hosed is back (Jan 2023) #17216

edsantiago opened this issue Jan 25, 2023 · 21 comments · Fixed by containers/storage#1724
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

Another variant of the "podman rm hoses everything" flake:

not ok 510 podman kube --network
...
$ podman-remote --url unix:/tmp/podman_tmp_X6wS pod rm -t 0 -f test_pod
Error: 2 errors occurred:
	* removing container 0e3da7bfb50c6d7fb4f92ea81c6470baa918abf7e06d577eca57c6c8f659200b from pod 0c72f802ee245ebd7058f5f53d38563817d7837b7a0541b15e7b6bcf0ca290ee: removing container 0e3da7bfb50c6d7fb4f92ea81c6470baa918abf7e06d577eca57c6c8f659200b root filesystem: 1 error occurred:
	* unlinkat /home/some26533dude/.local/share/containers/storage/overlay/0cf8c86cb0aa2c77d3bc9488ade58ca3cb56bc8183f98dc9090c822f58bc8192/merged: device or resource busy

	* removing container 4dabd961b7beefe2d9354d3ca3a6f3e298e4c4fe3fbf554d74938ef6b0115388 from pod 0c72f802ee245ebd7058f5f53d38563817d7837b7a0541b15e7b6bcf0ca290ee: a container that depends on container 4dabd961b7beefe2d9354d3ca3a6f3e298e4c4fe3fbf554d74938ef6b0115388 could not be removed: container state improper

All subsequent tests fail, sometimes with timeout, or with unmarshal errors, or with EBUSY.

I'm like 99% sure that this is the same as #17042 (the "crun: executable not found") flake, because it's happening in the same place and leaves the system hosed in a similar way, but it's easier to merge than to unmerge so I'm filing separately.

Seen in f37 remote rootless.

@edsantiago
Copy link
Member Author

Three instances in the last two days. Kind of alarming. One of them has a new symptom:

not ok 296 podman pod create --pod new: --hostname
...
# podman-remote --url unix:/tmp/podman_tmp_wYar run --rm --pod new:huvzvkueyo quay.io/libpod/testimage:20221018 hostname
Error: preparing container c30dad44a78ba7684a182d0870ccce915a99133c9a88188a9d0701ed459b26d0 for attach: starting some containers: internal libpod error
[ rc=126 (** EXPECTED 0 **) ]
(everything then goes all screwed up, with unlinkat/EBUSY forever)

@edsantiago
Copy link
Member Author

Here's another weird one: remote f36 root. Starts off as "crun: executable not found" (#17042), then has a "cannot unmarshal" error (#16154), and then finally goes into unlinkat/EBUSY and hosed forever.

@edsantiago edsantiago added the remote Problem is in podman-remote label Feb 13, 2023
@edsantiago
Copy link
Member Author

Another one: remote f36 root:

# podman-remote --url unix:/tmp/podman_tmp_LEuv stop -a -t 0
36ed2abdffa8cfc2591d825c028bac02e0abc1415e2a430de64a12ac1935c444
236d79205e569bad85683705a20901548c1ca5aa0335f3d4e18becf096e18e08
# podman-remote --url unix:/tmp/podman_tmp_LEuv pod rm -t 0 -f test_pod
Error: 2 errors occurred:
	* removing container 236d79205e569bad85683705a20901548c1ca5aa0335f3d4e18becf096e18e08 from pod 21a1d2c2cda13389ce4743e441de43fee6d99f8389534dd511a8972e1fd45a8e: removing container 236d79205e569bad85683705a20901548c1ca5aa0335f3d4e18becf096e18e08 root filesystem: 1 error occurred:
	* unlinkat /var/lib/containers/storage/overlay/ae077b0c51a733437ad97e9e0ea5df990e8ffb6ff9b370a69fbb1f4e7dd7a51d/merged: device or resource busy


	* removing container 36ed2abdffa8cfc2591d825c028bac02e0abc1415e2a430de64a12ac1935c444 from pod 21a1d2c2cda13389ce4743e441de43fee6d99f8389534dd511a8972e1fd45a8e: a container that depends on container 36ed2abdffa8cfc2591d825c028bac02e0abc1415e2a430de64a12ac1935c444 could not be removed: container state improper
[ rc=125 (** EXPECTED 0 **) ]

(all subsequent tests hosed).

So far, all instances I've seen are remote so I've added that label. But we've seen root & rootless, and also one on aarch64.

@edsantiago
Copy link
Member Author

Another one; this one happened earlier in the tests, and hosed everything after: remote f36 root

@edsantiago
Copy link
Member Author

Another one, remote f37 rootless.

@edsantiago
Copy link
Member Author

Another one, remote f36 root

@edsantiago
Copy link
Member Author

remote f36 root, in a slirp test.

@edsantiago
Copy link
Member Author

remote f37 root, in a v4.4 PR

@edsantiago edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023
@edsantiago
Copy link
Member Author

Another of our most popular flakes

@edsantiago
Copy link
Member Author

Normally this flake happens in the kube play tests. Here's one (f37 remote root) where it happens much earlier:

not ok 303 podman pod - communicating via /dev/shm 
...
#  podman-remote --url unix:/tmp/podman_tmp_x0Na pod create --infra=true --name=podmxBpcLIQNf
Error: failed to make pod: building local pause image: deleting build container "fc8e2f816b4a219c1e3193ee9eef03769d3da89148eb7de4906f453396cc7505": 1 error occurred:
	* unlinkat /var/lib/containers/storage/overlay/ef6c45957caf87e9b7219c2f540b099088a6eec69640526c9d50b9c298504b74/merged: device or resource busy

Seems like this always happens with pod-related commands.

@edsantiago
Copy link
Member Author

I can't reproduce this one despite weeks of trying. What I have been able to do is analyze logs and figure out some common factors. Like:

  • journal: the crash is correlated with a journal entry that says container died, and it is a container that should not die abnormally; it's usually a podman run --rm one
  • server: the crash is correlated with these server log entries:
...level=warning msg="StopSignal SIGTERM failed to stop container XXX-infra in 10 seconds, resorting to SIGKILL"
...level=warning msg="Found incomplete layer \"SHA\", deleting it"

The "-infra" supports my hypothesis that this is pod-related: that is, even if the failing test has nothing to do with pods, the previous test usually does.

Here are the last two weeks of flakes:

...and, for convenience, here are links to a representative test failure (from today), its journal, and podman server log.

edsantiago added a commit to edsantiago/libpod that referenced this issue May 10, 2023
Several tweaks to see if we can track down containers#17216, the unlinkat-ebusy
flake:

 - teardown(): if a cleanup command fails, display it and its
   output to the debug channel. This should never happen, but
   it can and does (see containers#18180, dependent containers). We
   need to know about it.

 - selinux tests: use unique pod names. This should help when
   scanning journal logs.

 - many tests: add "-f -t0" to "pod rm"

And, several unrelated changes caught by accident:
 - images-commit-with-comment test: was leaving a stray image
   behind. Clean it up, and make a few more readability tweaks

 - podman-remote-group-add test: add an explicit skip()
   when not remote. (Otherwise, test passes cleanly on
   podman local, which is misleading)

 - lots of container cleanup and/or adding "--rm" to run commands,
   to avoid leaving stray containers

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

An interesting variation:

[+0748s] ok 286 podman --group-add without keep-groups 
[+0868s] not ok 287 podman --group-add keep-groups plus added groups 
...
         # # podman-remote --url unix:/tmp/podman_tmp_IIda images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman-remote?

That is, it fails in setup, in a podman images command. podman-remote images is hanging. All subsequent tests fail, of course, because this is unlinkat-ebusy. The weird thing is, we don't actually see unlinkat-ebusy until more than five minutes later:

[+1372s] not ok 292 podman userns=keep-id in a pod
...
         # # podman-remote --url unix:/tmp/podman_tmp_IIda images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # Error: loading primary layer store data: 1 error occurred:
         # 	* deleting layer "caff11450e9c0fd34bf116aafa0d74e31fd191c9929ae7d4c4d32733f0c10cab": unlinkat /var/lib/containers/storage/overlay/caff11450e9c0fd34bf116aafa0d74e31fd191c9929ae7d4c4d32733f0c10cab/merged: device or resource busy

(and this one is not a hang/timeout, it's just the standard unlinkat-ebusy).

@edsantiago
Copy link
Member Author

I instrumented my hammer PR and am finding a wealth of data, much more than I can make sense of, but enough to worry me. I would like this bug to get escalated.

Start here, with a typical failure:

not ok 308 podman pod create --pod new: --hostname
...
# podman-remote --url unix:/tmp/podman_tmp_D3Sk run --rm --pod new:budtyln2np --hostname foobar quay.io/libpod/testimage:20221018 hostname
foobar
time="2023-05-17T19:32:22Z" level=error msg="Removing container a4ebd4afe500acfcab03f4f909aedb91319d1936cb0c26b9be025317998a07f0:
      looking up container \"a4ebd4afe500acfcab03f4f909aedb91319d1936cb0c26b9be025317998a07f0\":
      1 error occurred:
      * deleting layer \"e5fa1d8558746a8aa6dd4e1ce129f4bf39a2879eca48f28ee3247f2e140f01c3\":
        unlinkat /var/lib/containers/storage/overlay/e5fa1d8558746a8aa6dd4e1ce129f4bf39a2879eca48f28ee3247f2e140f01c3/merged:
        device or resource busy

Then go to the podman server logs for this run, and search in-page for e5fa1. Notice a lot of "incomplete layer" messages.

Then go to the journal log and look for e5fa1 again. Only one hit, but it's at 19:41:45, which is ten minutes after the 19:32:22 failure message above. I went back to 19:32:22 in the log and scanned, but could not find anything unusual. (I may not know what to look for).

Here's another failure, this time the string to look for is 709b0, and here's server log and journal log. This time, though, 709b0 does not appear in the journal. Journal actually ends eight minutes after the flake timestamp. I don't know what to make of this.

system test logs now include ps auxww --forest and mount output, in case that's helpful. (Didn't help me).

@mtrmac
Copy link
Collaborator

mtrmac commented May 18, 2023

(Theory, “…/merged: device or resource busy”)

Actually the mount output was helpful for me (maybe because I’m just new to this): it shows that …/merged is mounted in a fairly usual way. That also indicates that the layer was correctly created, at least correctly enough for something to mount it — so we are dealing with the deletion path, not a layer that is “incomplete” because it is being created.

And when the deletion operation is initiated (but not when later trying to recover from a failed deletion of a layer marked incomplete, and re-trying the delete), the code is supposed to gracefully handle attempts to delete a mounted layer, by unmounting it.

AFAICS that should succeed (or report an error) unless mountpoints.json, or the in-memory data, reports that layer as not being mounted — we don’t unmount it in that case.

So my current theory is that the mount counts get out of sync. I don’t immediately see why/how, especially there isn’t anything in the logs to suggest that anything involved has failed. (I can imagine this failing — the sequential nature of ((un)mount, update mount count) means that an unfortunate process kill can get the count permanently incorrect.)


(Theory, “incomplete layer”:)

The way layer deletion nowadays works is that we first persistently mark a layer “incomplete”, and then start deleting it.

The idea is that if deletion is interrupted, the “incomplete” flag will cause the next user of the store to finish deleting that layer.

Now, if the layer deletion was not interrupted but failed, there will still be a persistent “incomplete” flag, and the next user will still finish deleting that layer.

Except — if the deletion always fails, then the “next user” will also fail to delete that layer, and that will also mean the “next user” just reports that error and exists, doing nothing else. And all following Podman/… processes will likewise try to delete the layer, generate the log entry, fail, and exit.

So, the bad news is that on such a failure, the store is permanently hosed and unusable for any other purpose, until the reason for the failure is fixed.

(In this case, that might mean “just” manually unmounting the …/merged location.)


(Theory, both together:)

layerStore.Delete, in this order:

  • unmounts the layer if mount counts suggests that needs to happen
  • marks the layer as incomplete
  • then actually starts deleting things

So this suggests that the mount count was zero when layerStore.Delete started, but there was actually an overlay mounted at that time, and than triggered the final “resource busy” failure when the actual deletion started.

So it’s not some rogue process keeping the filesystem busy that we forgot to terminate before deleting a layer; we got the mount count wrong before initiating the delete.


Looking at https://api.cirrus-ci.com/v1/artifact/task/5486372476157952/html/sys-remote-fedora-38-aarch64-root-host-sqlite.log.html#t--00308 , not all subsequent image-related operations after the first “resource busy” error are failing. Are the tests forcibly resetting the store? Or is this some even more weird state?

@edsantiago
Copy link
Member Author

That's a really great observation (your last sentence). My experience is that the first failure is not always the super-fatal one: sometimes one or two tests pass afterward. And then, in cleanup, podman images goes kerblooey and nothing ever works again. I don't understand that.

This might be a good time to log all the new flake instances (i.e. last 2 days) with my new debug instrumentation:

  • fedora-38 root host boltdb
    • 05-18 14:02 in [sys] podman selinux: containers in --no-infra pods do not share context
  • fedora-38-aarch64 root host sqlite
    • 05-17 15:44 in [sys] podman pod create --pod new: --hostname
  • rawhide root host sqlite

(removed sys remote because there has never been an instance of these that are not sys-remote).

@mtrmac
Copy link
Collaborator

mtrmac commented May 19, 2023

I’ve read through the relevant code and filed containers/storage#1606 and containers/storage#1607 . Neither is an obvious smoking gun, but at least the latter suggests that we might have been failing earlier without much noise.

@TomSweeneyRedHat
Copy link
Member

We have a report from a customer in a BZ, thinking the problem in the BZ may be related to this. As an informational, that BZ is https://bugzilla.redhat.com/show_bug.cgi?id=2127347

edsantiago added a commit to edsantiago/libpod that referenced this issue May 22, 2023
Instrument system tests in hopes of tracking down containers#17216,
the unlinkat-ebusy-hosed flake.

Oh, also, timestamp.awk: timestamps have always been UTC, but
add a 'Z' to make it unambiguous.

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

We got a trigger with the new debug code. And, interestingly, this one is associated with an instance of #17042 (the "executable not found" flake):

# podman-remote [...] run --rm --pod new:9wubveiqvi quay.io/libpod/testimage:20221018 hostname
time="2023-05-23T19:26:13Z" level=error msg="Removing container 82b463...:
      cleaning up storage: removing container 82b463... root filesystem: 1 error occurred:
         * unlinkat /var/lib/containers/storage/overlay/615a0e7.../merged: device or resource busy\n\n"
Error: preparing container 82b463... for attach: runc: runc create failed: unable to start container process:
        exec: "hostname": executable file not found in $PATH: OCI runtime attempted to invoke a command that was not found

@edsantiago
Copy link
Member Author

Another instance of unlinkat-ebusy correlated with executable-not-found: debian root

@edsantiago
Copy link
Member Author

Just a reminder that this problem has not gone away. There does seem to be a suspicious correlation with #17042.

@edsantiago
Copy link
Member Author

Still our number-one flake. Here's one interesting variation (rawhide root):

[+0765s] ok 295 podman userns=keep-id
[+0768s] ok 296 podman userns=keep-id in a pod
[+0799s] ok 297 podman run --blkio-weight-device
[+0920s] not ok 298 podman --ipc=host
# (from function `run_podman' in file test/system/[helpers.bash, line 261](https://github.com/containers/podman/blob/2ebc9004f40c20916e94ad2e3538571b058c40b5/test/system/helpers.bash#L261),
         #  from function `basic_setup' in file test/system/[helpers.bash, line 66](https://github.com/containers/podman/blob/2ebc9004f40c20916e94ad2e3538571b058c40b5/test/system/helpers.bash#L66),
         #  from function `setup' in test file test/system/[helpers.bash, line 154](https://github.com/containers/podman/blob/2ebc9004f40c20916e94ad2e3538571b058c40b5/test/system/helpers.bash#L154))
         #   `basic_setup' failed
<+     > # # podman-remote --url unix:/tmp/podman_tmp_MwKu rm -t 0 --all --force --ignore
<+049ms> # # podman-remote --url unix:/tmp/podman_tmp_MwKu ps --all --external --format {{.ID}} {{.Names}}
<+049ms> # # podman-remote --url unix:/tmp/podman_tmp_MwKu images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
<+0120s> # timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman-remote?
<+006ms> # [ rc=124 ]
         # *** TIMED OUT ***

What I find interesting here is that the pod connection is TWO tests back, not one. That is: "userns=keep-id in a pod" is a pod-related test, and it passes; then "blkio-weight" also passes, but it has nothing to do with pods! (Reminder: unlinkat/ebusy seems strongly associated with play-kube and/or pods).

giuseppe added a commit to giuseppe/storage that referenced this issue Oct 4, 2023
Move the execution of RecordWrite() before the graphDriver Cleanup().
This addresses a longstanding issue that occurs when the Podman
cleanup process is forcely terminated and on some occasions the
termination happens after the Cleanup() but before the change is
recorded.  This causes that the next user is not notified about the
change and will mount the container without the home directory
below (the infamous /var/lib/containers/storage/overlay mount).
Then when the next time the graphDriver is initialized, the home
directory is mounted on top of the existing mounts causing some
containers to fail with ENOENT since all files are hidden and some
others cannot be cleaned up since their mount directory is covered by
the home directory mount.

Closes: containers/podman#18831
Closes: containers/podman#17216
Closes: containers/podman#17042

Signed-off-by: Giuseppe Scrivano <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 4, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants