-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It's baaaack: podman images: Error: top layer [...] not found in layer tree #8148
Comments
Can you remove the bad image? |
No: $ podman rmi a7a3
Error: 1 error occurred:
* top layer 1f832d5208105d5dde3f814d391ff7b4ddb557fd3bdbcb79418906242772dc73 of image a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a not found in layer tree
$ podman rmi -a -f
Error: 3 errors occurred:
* top layer 1f832d5208105d5dde3f814d391ff7b4ddb557fd3bdbcb79418906242772dc73 of image a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a not found in layer tree
* top layer 1f832d5208105d5dde3f814d391ff7b4ddb557fd3bdbcb79418906242772dc73 of image a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a not found in layer tree
* unable to delete all images, check errors and re-run image removal if needed
$ podman rmi 1f8
Error: 1 error occurred:
* unable to find a name and tag match for 1f8 in repotags: no such image I assume that removing |
@nalind Can't we get this rm to work. |
Uh-oh. I just had a flake in one of my PRs with exactly the same symptom: |
It looks like you can edit $HOME/.local/share/containers/storage/overlay-images/images.json and remove the bad image from the json file and get your images working again. |
A friendly reminder that this issue had no activity for 30 days. |
This is an important issue for us to fix,or at least have a simple way of cleaning this up, other then destroy all containers. |
Just had to manually remove literally every |
Seeing the following when trying to list images built with
Ended up just |
Could you give us the exact steps to recreate? |
My above issue with |
Ok closing, reopen if it happens again |
Hey, I just stunmbled upon this: I tried running the docker io node-red container on a raspberry pi 4 with Ubuntu 20.10 and podman 2.0.6 installed from default apt repos.
Things I did:
I resolved this by doing the fix from above, but I'll leave this here for future reference. |
I too just ran across this error @nalind (hi!)
Not sure if this helps but
and
|
I think I broke it with a Ctrl-C mid Note the make target is called
|
I'll take this one @nalind 👍 |
Internally, Podman constructs a tree of layers in containers/storage to quickly compute relations among layers and hence images. To compute the tree, we intersect all local layers with all local images. So far, lookup errors have been fatal which has turned out to be a mistake since it seems fairly easy to cause storage corruptions, for instance, when killing builds. In that case, a (partial) image may list a layer which does not exist (anymore). Since the errors were fatal, there was no easy way to clean up and many commands were erroring out. To improve usability, turn the fatal errors into warnings that guide the user into resolving the issue. In this case, a `podman system reset` may be the approriate way for now. [NO TESTS NEEDED] because I have no reliable way to force it. [1] containers#8148 (comment) Signed-off-by: Valentin Rothberg <[email protected]>
Internally, Podman constructs a tree of layers in containers/storage to quickly compute relations among layers and hence images. To compute the tree, we intersect all local layers with all local images. So far, lookup errors have been fatal which has turned out to be a mistake since it seems fairly easy to cause storage corruptions, for instance, when killing builds. In that case, a (partial) image may list a layer which does not exist (anymore). Since the errors were fatal, there was no easy way to clean up and many commands were erroring out. To improve usability, turn the fatal errors into warnings that guide the user into resolving the issue. In this case, a `podman system reset` may be the approriate way for now. [NO TESTS NEEDED] because I have no reliable way to force it. [1] containers#8148 (comment) Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg thanks for the patch. EDIT: went with removing the faulty image from
(as a side note, I produced that by interrupting a |
Thanks for the report, @martinetd! I will take a look at the
We have been talking about this internally and @giuseppe has plans to address that. @giuseppe, did you tackle the fsync issues already? |
Here's a PR to address the reported error in The error shouldn't be fatal. @martinetd, if possible, could you try out the PR and see if that fully resolves your issue? |
Thanks! |
The storage can easily be corrupted when a build or pull process (or any process *writing* to the storage) has been killed. The corruption surfaces in Podman reporting that a given layer could not be found in the layer tree. Those errors must not be fatal but only logged, such that the image removal may continue. Otherwise, a user may be unable to remove an image. [NO TESTS NEEDED] as I do not yet have a reliable way to cause such a storage corruption. Reported-in: containers#8148 (comment) Signed-off-by: Valentin Rothberg <[email protected]>
I'm not so much concerned about a fsck mode than atomic updates if possible at all. In this case the error seems to be more of an ordering problem, the metadata (json image descriptions) is written before the image? so there's a timing during which if the pull operation is interrupted Bad Things Happen™. I'd like to fix these if possible, so the image always points either to the old image or the new image -- there will potentially be dangling files until a fsck command is made but that feels less important to me (especially if pulling the same image again will use the same names, so ultimately in my case the dangling files would mostly fix themselves) I'm not familiar with the data structure so I might just be saying something stupid that doesn't make sense for the overlayfs driver, but I feel it should be possible with what I've seen of it so far. |
without a "fsck mode" we would need to have a "do a full syncfs before writing the metadata mode". I am fine with that as long as it is not the default, as a syncfs can be very expensive. Causing such corruptions unfortunately is very easy. Locally I could reproduce it just by forcefully powering off a VM few seconds after the image pull is completed |
Full fs What follows is longer than I had initially planned, but it's a subject I care about in general so please bear with me a few minutes... In this case there seem to be multiple distinct problems:
That's definitely more work than a full syncfs call, but it also doesn't disturb other workloads, and now I've seen dnf doesn't even bother with fsync I'm not sure if it's needed at all in our case because of the next problem:
Thanks! |
package managers usually call
I wasn't aware of such issue. If it happens without a powerdown, then it seems like something we should fix now without any sync involved. Do you have a reproducer for it?
this is way slower than a single syncfs. Long term, I'd like that containers/storage has the equivalent of: ostreedev/ostree#49 As you can see, usually a syncfs is much faster than calling fsync on each file (in facts both yum and dpkg perform quite bad in this area) |
I now get a different error:
I reproduced it with minimal data, please try with attached tar (github won't let me attach tar, but it's small enough so base64-encoded it as lazy way out):
(ugh, it's another error now... Well, I guess that's still worth looking at)
I just hit ^C during a podman pull, just reproduced on 3.0.1 after some (no output skipped, just added new lines before prompt for readability ; localhost:5000 is a local registry from docker.io/library/registry:2 ):
interestingly the last ^C shows after the next prompt, with podman pull having had an exit status of 0, so I'm not sure if it was leftover from the run before that didn't go well with the fresh pull...?
uh, not necessarily. Well, it all depends on the workload of your machine -- syncfs has a terrible cost on some workload (from my previous life on HPC systems with way too much scratch data). syncing there is actively harmful to whatever else happens on the machine, while it might be slightly more costly for podman to do what dpkg does I don't believe the overhead is that bad (does ostree do what dkpg does by triggering write then calling fsync when it's over? the initial sync should be almost free and by the time you call the real one there should be nothing left to flush for most files so that'll be cheap as well) and you're not hanging a production machine for a couple of minutes everytime they do a pull. I've seen that, and never want to see any syncfs on these servers ever again. Anyway, let's look at the interrupted errors first. |
@martinetd, in case there are still some things left, could you open a new issue? It's not sustainable to tackle too many things at once in an already closed issue. A dedicated one helps to keep track of things and backport them if needed. |
Looks like a new variant of #7444, but am opening a new PR.
I seem to have gotten myself into a similar state again:
Smoking gun seems to be: running BATS tests while also watching Tech Talk (heavy bandwidth hog). Test was pulling a huge image (
quay.io/libpod/fedora:31
), got interrupted by timeout, everything exploded after that. Nothing works any more.rootless, master @ 01f642f
The text was updated successfully, but these errors were encountered: