-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman stop, after kube play: Storage for container xxx has been removed #19702
Comments
The list so far:
|
New correlated symptom seen in f38 root:
This is in #17831 with @giuseppe's #19760 cherrypicked. Could be coincidence. I can't look into it now. |
I'm giving up on this: I am pulling the stderr-on-teardown checks from my flake-check PR. It's too much, costing me way too much time between this and #19721. Until these two are fixed, I can't justify the time it takes me to sort through these flakes. FWIW, here is the catalog so far:
Seen in: int podman fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite |
I looked into it. We could demote the log from error to info as other code location do but I want to understand how this can happen. @edsantiago, let me know if this is urgent. I can do a change without fully understanding what's going on. |
More background in cabe134 |
@vrothberg thank you, this is not urgent from my perspective. The background is: @Luap99 and I would really like to enable checks in CI for spurious warnings. We can't actually do that, because there are soooooooo many, but every few months I try and see what new ones have cropped up. This one and #19721 are, I think, new warnings since the last time I ran checks. "New", to me, means that something changed recently, and me whining loudly might trigger a memory in someone. Of course, priority may change if this starts showing up in the field. |
I stared at the code for a long time and tried to reproduce but was not successful so far. There's clearly a bug somewhere because Podman should not attempt unmounting the root FS of a container when it's not mounted (anymore). I'd like to go to the root of that. What I find curious: @edsantiago, we don't see the error log in the system tests, do we? |
It's a warning message that doesn't affect exit status, and we don't check for those in system tests. |
...and it's high time that I do something about that (checking for warnings in system tests). I have a proof-of-concept, it's working nicely, but I now need to go through the dozens of failures looking for which are bugs and which are genuinely ok. That will be next week. TTFN. |
And, now that we're checking for warnings in system tests, here we are (f38 rootless):
(I don't know why the |
@edsantiago, I am under the impression that we only get this error in the context of pods (and |
@vrothberg TBH I have no idea. I tend to look at flakes and let my hindbrain look for common patterns, only delving deep when necessary. Here my brain noticed "kube play"; it will take deliberate effort to poke deeper. It's now on my TODO list. |
Thanks, @edsantiago ! I took a look at the code before PTO but could not find any obvious issue in the code. With cabe134 also having no clear indication of how this can happen, I think we're up to some treasure hunt. Re: pods: #4033 mentions --pod=xxx as well, which increases confidence that we need to have closer inspection of container-removal code inside a Pod. There is a number of conditionals (also impacting locking) which may reveal what's going on. |
@vrothberg in looking at the flake lists above, I noticed this one, container clone, which has nothing to do with kube. I wrote a reproducer, ran it, and bam, in about ten minutes: # while :;do /tmp/foo2.sh || break;done
...
ca7fff1bbc5ad7643201687a4898619bf158f2a09b01299c1ae614fc687e5ac7
1147f28dc5aba39d7646ce7dff672b55cf0c74c78dce92ed2315b18321a745e2
9300e5454658f3d4d0d522c98f65eebfe31ef0a07e55cd55574227191f88c371
ca7fff1bbc5ad7643201687a4898619bf158f2a09b01299c1ae614fc687e5ac7
ca7fff1bbc5ad7643201687a4898619bf158f2a09b01299c1ae614fc687e5ac7
9300e5454658f3d4d0d522c98f65eebfe31ef0a07e55cd55574227191f88c371
1147f28dc5aba39d7646ce7dff672b55cf0c74c78dce92ed2315b18321a745e2
FAILED
time="2023-09-18T12:29:25-04:00" level=error msg="IPAM error: failed to get ips for container ID ca7fff1bbc5ad7643201687a4898619bf158f2a09b01299c1ae614fc687e5ac7 on network podman"
time="2023-09-18T12:29:25-04:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2023-09-18T12:29:25-04:00" level=error msg="tearing down network namespace configuration for container ca7fff1bbc5ad7643201687a4898619bf158f2a09b01299c1ae614fc687e5ac7: netavark: open container netns: open /run/netns/netns-be34b3e0-1ebe-20e7-54c4-1cdbc8b1546c: IO error: No such file or directory (os error 2)"
time="2023-09-18T12:29:25-04:00" level=error msg="Unable to clean up network for container ca7fff1bbc5ad7643201687a4898619bf158f2a09b01299c1ae614fc687e5ac7: \"unmounting network namespace for container ca7fff1bbc5ad7643201687a4898619bf158f2a09b01299c1ae614fc687e5ac7: failed to unmount NS: at /run/netns/netns-be34b3e0-1ebe-20e7-54c4-1cdbc8b1546c: no such file or directory\""
time="2023-09-18T12:29:25-04:00" level=error msg="Storage for container ca7fff1bbc5ad7643201687a4898619bf158f2a09b01299c1ae614fc687e5ac7 has been removed" Because I was sloppy & lazy, I don't know if the errors are coming from the I am 99% confident that this is correlated with #19721 (I see the correlation in my flake logs too) but I can't begin to guess what the connection is. |
With # /tmp/foo3.sh
...
e57c422616e5467aed877756aaae65eabb17782e0c62c9972064090ea9ca5aa8
1dc64b57e7fa8c6055667d5c4cb928d4288ceb1a6bd4b66ecbc9ecce85553a6b
1deab6842e13f58a7db6f025c9f04c05437bedddb98cbebf03feed3ab13e6e49
e57c422616e5467aed877756aaae65eabb17782e0c62c9972064090ea9ca5aa8
FAILED IN STOP
time="2023-09-18T13:14:55-04:00" level=error msg="IPAM error: failed to get ips for container ID e57c422616e5467aed877756aaae65eabb17782e0c62c9972064090ea9ca5aa8 on network podman"
time="2023-09-18T13:14:55-04:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2023-09-18T13:14:55-04:00" level=error msg="tearing down network namespace configuration for container e57c422616e5467aed877756aaae65eabb17782e0c62c9972064090ea9ca5aa8: netavark: open container netns: open /run/netns/netns-75483bef-09ef-d471-4538-1dca691fc819: IO error: No such file or directory (os error 2)"
time="2023-09-18T13:14:55-04:00" level=error msg="Unable to clean up network for container e57c422616e5467aed877756aaae65eabb17782e0c62c9972064090ea9ca5aa8: \"unmounting network namespace for container e57c422616e5467aed877756aaae65eabb17782e0c62c9972064090ea9ca5aa8: failed to unmount NS: at /run/netns/netns-75483bef-09ef-d471-4538-1dca691fc819: no such file or directory\""
time="2023-09-18T13:14:55-04:00" level=error msg="Storage for container e57c422616e5467aed877756aaae65eabb17782e0c62c9972064090ea9ca5aa8 has been removed"
# bin/podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1dc64b57e7fa quay.io/libpod/testimage:20221018 sh 6 minutes ago Created exciting_chaum
1deab6842e13 quay.io/libpod/testimage:20221018 sh 6 minutes ago Created exciting_chaum-clone
e57c422616e5 quay.io/libpod/testimage:20221018 sh 6 minutes ago Exited (0) 6 minutes ago exciting_chaum-clone1 |
It finally failed with plain ...
65c2dfb97b929bdcb69a37ec26990f03ef814b0f668fe9d3c88847a511e6288e
1a1ca5ade903fdcb212ee27449e91b29ebc5ca768fe36f03bab1e8341bbb3081
65c2dfb97b929bdcb69a37ec26990f03ef814b0f668fe9d3c88847a511e6288e
6a6a1673966acb0c6afa809675419f1589d1e189acb5e5358262f789ff299232
FAILED IN STOP
time="2023-09-18T16:46:34-04:00" level=error msg="IPAM error: failed to get ips for container ID 65c2dfb97b929bdcb69a37ec26990f03ef814b0f668fe9d3c88847a511e6288e on network podman"
time="2023-09-18T16:46:34-04:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2023-09-18T16:46:34-04:00" level=error msg="tearing down network namespace configuration for container 65c2dfb97b929bdcb69a37ec26990f03ef814b0f668fe9d3c88847a511e6288e: netavark: open container netns: open /run/netns/netns-85a4b447-1c5b-7d70-50dd-1796ad8e5fe5: IO error: No such file or directory (os error 2)"
time="2023-09-18T16:46:34-04:00" level=error msg="Unable to clean up network for container 65c2dfb97b929bdcb69a37ec26990f03ef814b0f668fe9d3c88847a511e6288e: \"unmounting network namespace for container 65c2dfb97b929bdcb69a37ec26990f03ef814b0f668fe9d3c88847a511e6288e: failed to unmount NS: at /run/netns/netns-85a4b447-1c5b-7d70-50dd-1796ad8e5fe5: no such file or directory\""
time="2023-09-18T16:46:34-04:00" level=error msg="Storage for container 65c2dfb97b929bdcb69a37ec26990f03ef814b0f668fe9d3c88847a511e6288e has been removed" |
This shows that we try to cleanup twice which cannot work. |
The past two weeks. Mostly in #17831 except for the one in
Seen in: int/sys fedora-37/fedora-38/fedora-39 root/rootless boltdb/sqlite |
is this still happening? 8ac2aa7 could have fixed it |
Last seen Oct 11, and a quick check shows that some of these failures included #20299, so I'm reluctant to close just yet. "Quick check" means scrolling to top of error log, clicking
Seen in: int+sys podman fedora-38+fedora-39β+rawhide root+rootless host boltdb+sqlite |
Today, with up-to-date main. f38 root. |
There is a potential race condition we are seeing where we are seeing a message about a removed container which could be caused by a non mounted container, this change should clarify which is causing it. Also if the container does not exists, just warn the user instead of reporting an error, not much the user can do. Fixes: containers#19702 [NO NEW TESTS NEEDED] Signed-off-by: Daniel J Walsh <[email protected]>
Seeing this often when I cherrypick #18442
e.g. f38 root. Twice.
The text was updated successfully, but these errors were encountered: