-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unlinkat/EBUSY/hosed is back (Jan 2023) #17216
Comments
Three instances in the last two days. Kind of alarming. One of them has a new symptom:
|
Here's another weird one: remote f36 root. Starts off as "crun: executable not found" (#17042), then has a "cannot unmarshal" error (#16154), and then finally goes into unlinkat/EBUSY and hosed forever. |
Another one: remote f36 root:
(all subsequent tests hosed). So far, all instances I've seen are |
Another one; this one happened earlier in the tests, and hosed everything after: remote f36 root |
Another one, remote f37 rootless. |
Another one, remote f36 root |
remote f36 root, in a slirp test. |
remote f37 root, in a v4.4 PR |
Another of our most popular flakes
|
Normally this flake happens in the
Seems like this always happens with pod-related commands. |
I can't reproduce this one despite weeks of trying. What I have been able to do is analyze logs and figure out some common factors. Like:
The "-infra" supports my hypothesis that this is pod-related: that is, even if the failing test has nothing to do with pods, the previous test usually does. Here are the last two weeks of flakes:
...and, for convenience, here are links to a representative test failure (from today), its journal, and podman server log. |
Several tweaks to see if we can track down containers#17216, the unlinkat-ebusy flake: - teardown(): if a cleanup command fails, display it and its output to the debug channel. This should never happen, but it can and does (see containers#18180, dependent containers). We need to know about it. - selinux tests: use unique pod names. This should help when scanning journal logs. - many tests: add "-f -t0" to "pod rm" And, several unrelated changes caught by accident: - images-commit-with-comment test: was leaving a stray image behind. Clean it up, and make a few more readability tweaks - podman-remote-group-add test: add an explicit skip() when not remote. (Otherwise, test passes cleanly on podman local, which is misleading) - lots of container cleanup and/or adding "--rm" to run commands, to avoid leaving stray containers Signed-off-by: Ed Santiago <[email protected]>
An interesting variation:
That is, it fails in
(and this one is not a hang/timeout, it's just the standard unlinkat-ebusy). |
I instrumented my hammer PR and am finding a wealth of data, much more than I can make sense of, but enough to worry me. I would like this bug to get escalated. Start here, with a typical failure:
Then go to the podman server logs for this run, and search in-page for Then go to the journal log and look for Here's another failure, this time the string to look for is system test logs now include |
(Theory, “…/merged: device or resource busy”) Actually the And when the deletion operation is initiated (but not when later trying to recover from a failed deletion of a layer marked incomplete, and re-trying the delete), the code is supposed to gracefully handle attempts to delete a mounted layer, by unmounting it. AFAICS that should succeed (or report an error) unless So my current theory is that the mount counts get out of sync. I don’t immediately see why/how, especially there isn’t anything in the logs to suggest that anything involved has failed. (I can imagine this failing — the sequential nature of ((un)mount, update mount count) means that an unfortunate process kill can get the count permanently incorrect.) (Theory, “incomplete layer”:) The way layer deletion nowadays works is that we first persistently mark a layer “incomplete”, and then start deleting it. The idea is that if deletion is interrupted, the “incomplete” flag will cause the next user of the store to finish deleting that layer. Now, if the layer deletion was not interrupted but failed, there will still be a persistent “incomplete” flag, and the next user will still finish deleting that layer. Except — if the deletion always fails, then the “next user” will also fail to delete that layer, and that will also mean the “next user” just reports that error and exists, doing nothing else. And all following Podman/… processes will likewise try to delete the layer, generate the log entry, fail, and exit. So, the bad news is that on such a failure, the store is permanently hosed and unusable for any other purpose, until the reason for the failure is fixed. (In this case, that might mean “just” manually unmounting the (Theory, both together:)
So this suggests that the mount count was zero when So it’s not some rogue process keeping the filesystem busy that we forgot to terminate before deleting a layer; we got the mount count wrong before initiating the delete. Looking at https://api.cirrus-ci.com/v1/artifact/task/5486372476157952/html/sys-remote-fedora-38-aarch64-root-host-sqlite.log.html#t--00308 , not all subsequent image-related operations after the first “resource busy” error are failing. Are the tests forcibly resetting the store? Or is this some even more weird state? |
That's a really great observation (your last sentence). My experience is that the first failure is not always the super-fatal one: sometimes one or two tests pass afterward. And then, in cleanup, This might be a good time to log all the new flake instances (i.e. last 2 days) with my new debug instrumentation:
(removed |
I’ve read through the relevant code and filed containers/storage#1606 and containers/storage#1607 . Neither is an obvious smoking gun, but at least the latter suggests that we might have been failing earlier without much noise. |
We have a report from a customer in a BZ, thinking the problem in the BZ may be related to this. As an informational, that BZ is https://bugzilla.redhat.com/show_bug.cgi?id=2127347 |
Instrument system tests in hopes of tracking down containers#17216, the unlinkat-ebusy-hosed flake. Oh, also, timestamp.awk: timestamps have always been UTC, but add a 'Z' to make it unambiguous. Signed-off-by: Ed Santiago <[email protected]>
We got a trigger with the new debug code. And, interestingly, this one is associated with an instance of #17042 (the "executable not found" flake):
|
Another instance of unlinkat-ebusy correlated with executable-not-found: debian root |
Just a reminder that this problem has not gone away. There does seem to be a suspicious correlation with #17042.
|
Still our number-one flake. Here's one interesting variation (rawhide root):
What I find interesting here is that the pod connection is TWO tests back, not one. That is: "userns=keep-id in a pod" is a pod-related test, and it passes; then "blkio-weight" also passes, but it has nothing to do with pods! (Reminder: unlinkat/ebusy seems strongly associated with play-kube and/or pods). |
Move the execution of RecordWrite() before the graphDriver Cleanup(). This addresses a longstanding issue that occurs when the Podman cleanup process is forcely terminated and on some occasions the termination happens after the Cleanup() but before the change is recorded. This causes that the next user is not notified about the change and will mount the container without the home directory below (the infamous /var/lib/containers/storage/overlay mount). Then when the next time the graphDriver is initialized, the home directory is mounted on top of the existing mounts causing some containers to fail with ENOENT since all files are hidden and some others cannot be cleaned up since their mount directory is covered by the home directory mount. Closes: containers/podman#18831 Closes: containers/podman#17216 Closes: containers/podman#17042 Signed-off-by: Giuseppe Scrivano <[email protected]>
Another variant of the "podman rm hoses everything" flake:
All subsequent tests fail, sometimes with timeout, or with unmarshal errors, or with EBUSY.
I'm like 99% sure that this is the same as #17042 (the "crun: executable not found") flake, because it's happening in the same place and leaves the system hosed in a similar way, but it's easier to merge than to unmerge so I'm filing separately.
Seen in f37 remote rootless.
The text was updated successfully, but these errors were encountered: