-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebooted, dangling file in /var/lib/cni/networks/podman prevents container starting #3759
Comments
This usually happens when another container has taken the IP that container
was using. Do you have any other containers running using that IP?
…On Thu, Aug 8, 2019, 04:15 space88man ***@***.***> wrote:
*Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)*
/kind bug
*Description*
Container won't start after system reboot with:
ERRO[0000] Error adding network: failed to allocate for range 0: requested IP address 10.88.0.20 is not available in range set 10.88.0.1-10.88.255.254
ERRO[0000] Error while adding pod to CNI network "podman": failed to allocate for range 0: requested IP address 10.88.0.20 is not available in range set 10.88.0.1-10.88.255.254
Error: unable to start container "freeswitch-init_1": error configuring network namespace for container 7690a7bc4960b76799d906dbc07a32d174a5e75581886349d3604d455e093cf7: failed to allocate for range 0: requested IP address 10.88.0.20 is not available in range set 10.88.0.1-10.88.255.254
*Steps to reproduce the issue:*
1.
Upgraded Fedora 30 host, rebooted, tried to start a container
2.
3.
*Describe the results you received:*
INFO[0000] Found CNI network podman (type=bridge) at /etc/cni/net.d/87-podman-bridge.conflist
DEBU[0000] Made network namespace at /var/run/netns/cni-76dce0ef-599f-b625-7710-a7eeef4159e5 for container 7690a7bc4960b76799d906dbc07a32d174a5e75581886349d3604d455e093cf7
INFO[0000] Got pod network &{Name:freeswitch-init_1 Namespace:freeswitch-init_1 ID:7690a7bc4960b76799d906dbc07a32d174a5e75581886349d3604d455e093cf7 NetNS:/var/run/netns/cni-76dce0ef-599f-b625-7710-a7eeef4159e5 PortMappings:[] Networks:[podman] NetworkConfig:map[podman:{IP:10.88.0.20}]}
INFO[0000] About to add CNI network cni-loopback (type=loopback)
DEBU[0000] overlay: mount_data=nodev,metacopy=on,lowerdir=/var/lib/containers/storage/overlay/l/Z6S3A3PN6PO5DCD4E3FZZOFLO6:/var/lib/containers/storage/overlay/l/II26Q5SUSAWGOVHLKY3MKC7GNU,upperdir=/var/lib/containers/storage/overlay/8cc72886f3925230caa200d3bce8b24599897b0c64ca5a6a4d6d56871f118d4d/diff,workdir=/var/lib/containers/storage/overlay/8cc72886f3925230caa200d3bce8b24599897b0c64ca5a6a4d6d56871f118d4d/work,context="system_u:object_r:container_file_t:s0:c362,c945"
DEBU[0000] mounted container "7690a7bc4960b76799d906dbc07a32d174a5e75581886349d3604d455e093cf7" at "/var/lib/containers/storage/overlay/8cc72886f3925230caa200d3bce8b24599897b0c64ca5a6a4d6d56871f118d4d/merged"
DEBU[0000] Created root filesystem for container 7690a7bc4960b76799d906dbc07a32d174a5e75581886349d3604d455e093cf7 at /var/lib/containers/storage/overlay/8cc72886f3925230caa200d3bce8b24599897b0c64ca5a6a4d6d56871f118d4d/merged
INFO[0000] Got pod network &{Name:freeswitch-init_1 Namespace:freeswitch-init_1 ID:7690a7bc4960b76799d906dbc07a32d174a5e75581886349d3604d455e093cf7 NetNS:/var/run/netns/cni-76dce0ef-599f-b625-7710-a7eeef4159e5 PortMappings:[] Networks:[podman] NetworkConfig:map[podman:{IP:10.88.0.20}]}
INFO[0000] About to add CNI network podman (type=bridge)
ERRO[0000] Error adding network: failed to allocate for range 0: requested IP address 10.88.0.20 is not available in range set 10.88.0.1-10.88.255.254
ERRO[0000] Error while adding pod to CNI network "podman": failed to allocate for range 0: requested IP address 10.88.0.20 is not available in range set 10.88.0.1-10.88.255.254
DEBU[0000] Network is already cleaned up, skipping...
DEBU[0000] unmounted container "7690a7bc4960b76799d906dbc07a32d174a5e75581886349d3604d455e093cf7"
DEBU[0000] Cleaning up container 7690a7bc4960b76799d906dbc07a32d174a5e75581886349d3604d455e093cf7
DEBU[0000] Network is already cleaned up, skipping...
DEBU[0000] Container 7690a7bc4960b76799d906dbc07a32d174a5e75581886349d3604d455e093cf7 storage is already unmounted, skipping...
ERRO[0000] unable to start container "freeswitch-init_1": error configuring network namespace for container 7690a7bc4960b76799d906dbc07a32d174a5e75581886349d3604d455e093cf7: failed to allocate for range 0: requested IP address 10.88.0.20 is not available in range set 10.88.0.1-10.88.255.254
*Describe the results you expected:*
Container starts
*Additional information you deem important (e.g. issue happens only
occasionally):*
*Output of podman version:*
Version: 1.4.4
RemoteAPI Version: 1
Go Version: go1.12.7
OS/Arch: linux/amd64
*Output of podman info --debug:*
debug:
compiler: gc
git commit: ""
go version: go1.12.7
podman version: 1.4.4
host:
BuildahVersion: 1.9.0
Conmon:
package: podman-1.4.4-4.fc30.x86_64
path: /usr/libexec/podman/conmon
version: 'conmon version 1.0.0-dev, commit: 164df8af4e62dc759c312eab4b97ea9fb6b5f1fc'
Distribution:
distribution: fedora
version: "30"
MemFree: 7664599040
MemTotal: 8340746240
OCIRuntime:
package: runc-1.0.0-93.dev.gitb9b6cc6.fc30.x86_64
path: /usr/bin/runc
version: |-
runc version 1.0.0-rc8+dev
commit: e3b4c1108f7d1bf0d09ab612ea09927d9b59b4e3
spec: 1.0.1-dev
SwapFree: 4294963200
SwapTotal: 4294963200
arch: amd64
cpus: 4
hostname: containers.localdomain
kernel: 5.2.5-200.fc30.x86_64
os: linux
rootless: false
uptime: 7m 27.03s
registries:
blocked: null
insecure: null
search:
- docker.io
- registry.fedoraproject.org
- quay.io
- registry.access.redhat.com
- registry.centos.org
store:
ConfigFile: /etc/containers/storage.conf
ContainerStore:
number: 3
GraphDriverName: overlay
GraphOptions:
- overlay.mountopt=nodev,metacopy=on
GraphRoot: /var/lib/containers/storage
GraphStatus:
Backing Filesystem: xfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "true"
ImageStore:
number: 6
RunRoot: /var/run/containers/storage
VolumePath: /var/lib/containers/storage/volumes
*Additional environment details (AWS, VirtualBox, physical, etc.):*
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3759>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB3AOCEP3KRIM4SU5MGH7B3QDPI2TANCNFSM4IKHTBLQ>
.
|
No; this was after a graceful reboot. |
@mccv1r0 Can you take a look at this? it looks like CNI didn't clean up address reservations after a clean reboot? |
@mheon the runtime is solely responsible for calling CNI DEL for every container that is no longer running but has not had DEL called on it. I know that podman/CRIO keep a cache of containers somewhere on-disk. When that cache is reconciled with what is actually running when CRIO/podman start, they need to call CNI DEL on every container in that cache list that is not currently running to allow the network plugin to clean up. |
@dcbw Is this something we could configure to put these files on a tmpfs, so that we don't have to cleanup after a reboot? |
@rhatdan you have no idea what kind of information the network plugin has to clean up, so you have to tell the network plugin to clean up. Which is CNI DEL. If your container is no longer running, and it wasn't given a CNI DEL, you must call CNI DEL to clean up. |
@mheon @rhatdan you have a container database:
presumably that doesn't get blown away on restart. So after the next time podman runs (for any pod) it'll need to reconcile that database list with what's actually running and prune out the old stuff and call CNI DEL on those that aren't running. Or something like that. |
We do actually blow away most database state on reboot, on the assumption that none of it survived the reboot - what was running no longer is, what was mounted no longer is. It may be possible to work a CNI DEL into the process of refreshing the state post-reboot - I'll investigate. |
Alright, this bit promises to be complicated. We wipe container state very early in the refresh process, because we can't retrieve containers otherwise - the state is not sane because of the restart, so all of our validation fails. The runtime never touches a state with the CNI result available after a reboot - it's gone by the time we have the ability to actually get containers. We can change this to preserve the cached CNI result, but some bits of what we pass to OCICNI as part of |
Hmmm. If the network namespace path isn't strictly mandatory - I think we can probably call OCICNI's |
Should be fixed by #4086 |
CNI expects that a DELETE be run before re-creating container networks. If a reboot occurs quickly enough that containers can't stop and clean up, that DELETE never happens, and Podman currently wipes the old network info and thinks the state has been entirely cleared. Unfortunately, that may not be the case on the CNI side. Some things - like IP address reservations - may not have been cleared. To solve this, manually re-run CNI Delete on refresh. If the container has already been deleted this seems harmless. If not, it should clear lingering state. Fixes: containers#3759 Signed-off-by: Matthew Heon <[email protected]>
When two or more containers are started at the same time, the latter always failsERRO[0000] Error adding network: failed to allocate for range 0: 10.88.1.230 has been allocated to 5a920437e66254079842109be75ba0f082ad85761f017e1673c527827abeb91c, duplicate allocation is not allowed |
Can you provide a reproducer for this? What Podman version are you on? |
Centos 7 same error, after gracefull reboot some containers don't start.
Steps to reproduce:
|
I think your log trace cut off some longer lines - in particular, the actual error message from CNI is missing. Can you try again? |
@mheon Update log and step to reproduce without reboot. Podman version 1.6.4 |
podman version
podman info
|
Basically i thinked when systemd start all containers together CNI network initialisation create race condition. But now i think runtime environment sometime doesn't correct processed. Anyway repeating creating chain in iptables nat table is race.
|
For qwick fix, i hacked CNI init. /etc/systemd/system/podman-init.service
/etc/systemd/system/container-test-01.service.d/after.conf
|
I need to verify if this still occurs on 2.0; if it does, we can add a lock
for network initialization to prevent us from running CNI concurrently.
…On Sat, Jul 25, 2020, 07:54 Alex Gluck ***@***.***> wrote:
For qwick fix, i hacked CNI init.
/etc/systemd/system/podman-init.service
[Unit]
Description=Podman hack for init CNI before start containers
Documentation=man:podman-run
[Service]
ExecStartPre=/usr/bin/podman run --rm --name podman-init -d -p 65500:65500/udp centos:centos7 sleep 30
ExecStart=/usr/bin/sleep 10
Type=simple
[Install]
WantedBy=multi-user.target
/etc/systemd/system/container-test-01.service.d/after.conf
[Unit]
After=podman-init.service
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3759 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3AOCFGWFTKLVPCX7MQFBLR5LBWVANCNFSM4IKHTBLQ>
.
|
I am seeing that exact error message as root. The problem occurs after a reboot. My CI labs just got 2.0.5 a couple of days ago but according to @sunyitao this won't fix it, unfortunately. I have however implemented the WO suggested by @AlexGluck and created a small |
@rhatdan This seems like a good candidate for systemd-tmpfiles? Directory not on a tmpfs that we want wiped after each reboot. |
Sure, what is the path? |
All subfolders of |
I think we also need to prevent files named |
Won't CNI recreate them if they are removed? |
|
Nice. OK, safe to remove everything in those directories then. |
CNI sometimes leaves Network information in /var/lib/cni/networks when the system crashes or containers do not shut down properly. This PR will cleanup these left over files, so that container engines will get a clean enviroment when the system reboots. Related to: containers#3759 Signed-off-by: Daniel J Walsh <[email protected]>
CNI sometimes leaves Network information in /var/lib/cni/networks when the system crashes or containers do not shut down properly. This PR will cleanup these left over files, so that container engines will get a clean enviroment when the system reboots. Related to: containers#3759 Signed-off-by: Daniel J Walsh <[email protected]>
Still occur on Fedora 33, podman 2.2.1
Temperately fixed by mounting
|
Also happens with podman 3.0.0-0.1.rc1.fc32 (rebuilt src.rpm from koji). Downgraded to podman-2.2.1-1.fc32.x86_64 and had to remove the files by hand. I have custom network:
what triggered the problem was stopping and starting a container. Same thing locked the IP of another container too. Downgrade did not help, reboot too, so I had to remove the files by hand and the containers started again. |
Could you check if you got this file?
If it supposed to be deleting content in /var/lib/cni/networks directory? |
I only got the following:
|
You may also install podman in a container with Since current version doesn't contain |
I guess the fix is in podman 3.0 |
It is:
So my case is different, it happened without rebooting. I guess I have to open separate issue if I can reproduce it again... |
It was working for me the last time I tried it. |
Probably more for the RedHat8 bugtracker but I'll post this here none the less. This problem exists in RedHat8, so if you are trying to start two or more docker containers using systemctl none will start due to this issue, so I was putting:
In each of the systemctl start scripts. |
podman 3.0 will be in RHEL8.4 release. |
Error: unable to start container 95bfae788dd9dc358a9b0c1e7513c42cc39a7a040e7df75e8023bb7d9b74fe9d: error configuring network namespace for container 95bfae788dd9dc358a9b0c1e7513c42cc39a7a040e7df75e8023bb7d9b74fe9d: failed to allocate for range 0: 172.16.16.37 has been allocated to 95bfae788dd9dc358a9b0c1e7513c42cc39a7a040e7df75e8023bb7d9b74fe9d, duplicate allocation is not allowed
|
This has been fixed on upstream. There are presently no plans to backport this fix to v3.0. You can work around it by mounting a tmpfs on |
I'm still facing this issue in 3.0:
|
As was pointed out by @mheon this is fixed in podman 4. |
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
I did a
systemctl reboot
; encountered a orphan/var/lib/cni/networks/podman/10.88.0.20
filewhich prevented the container from starting.
Steps to reproduce the issue:
Upgraded Fedora 30 host, rebooted, tried to start a container
Describe the results you received:
Describe the results you expected:
Container starts
Additional information you deem important (e.g. issue happens only occasionally):
Output of
podman version
:Output of
podman info --debug
:Additional environment details (AWS, VirtualBox, physical, etc.):
In
/var/lib/cni/networks/podman
I have:10.88.0.20
: contains the ID of the containerlast_reserved_ip.0
: contains10.88.0.20
lock
The guard file
10.88.0.20
must have been left behind from the reboot. After deleting the file, the container could start.The text was updated successfully, but these errors were encountered: