Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in log using vxlan backend under CPU load #414

Closed
ppontryagin opened this issue Mar 3, 2016 · 7 comments
Closed

Errors in log using vxlan backend under CPU load #414

ppontryagin opened this issue Mar 3, 2016 · 7 comments

Comments

@ppontryagin
Copy link

I have the following messages in the journal if cpu usage is > 60%

Mar 02 21:36:58 ip-10-0-2-240.eu-central-1.compute.internal sdnotify-proxy[2093]: E0302  21:36:58.118918 00001 device.go:218] Failed to receive from netlink: no buffer space available
Mar 02 21:36:59 ip-10-0-2-240.eu-central-1.compute.internal sdnotify-proxy[2093]: E0302  21:36:59.215630 00001 device.go:218] Failed to receive from netlink: no buffer space available
Mar 02 21:37:00 ip-10-0-2-240.eu-central-1.compute.internal sdnotify-proxy[2093]: E0302 21:37:00.316524 00001 device.go:218] Failed to receive from netlink: no buffer space available
Mar 02 21:37:01 ip-10-0-2-240.eu-central-1.compute.internal sdnotify-proxy[2093]: E0302 21:37:01.465853 00001 device.go:218] Failed to receive from netlink: no buffer space available

It goes from here
https://github.com/coreos/flannel/blob/7cb8d6b9a80632828e9569033d411361afee1816/backend/vxlan/device.go#L218

Do I need to tweak some parameters?

I have already played with

sudo sysctl -w net.ipv4.tcp_rmem="10240 87380 12582912"
sudo sysctl -w net.ipv4.tcp_wmem="10240 87380 12582912"

sudo sysctl -w net.ipv4.tcp_rmem="102400 873800 125829120"
sudo sysctl -w net.ipv4.tcp_wmem="102400 873800 125829120"

sudo sysctl -w net.core.wmem_max="125829120"
sudo sysctl -w net.core.rmem_max="125829120"

sudo sysctl -w net.ipv4.tcp_window_scaling="1"
sudo sysctl -w net.ipv4.tcp_timestamps="1"

sudo sysctl -w net.ipv4.tcp_sack="1"
sudo sysctl -w net.core.netdev_max_backlog="5000"

sudo sysctl -w net.ipv4.udp_mem="102400 873800 125829120"
sudo sysctl -w net.ipv4.udp_mem="102400 873800 125829120"

sudo sysctl -w net.ipv4.udp_rmem_min="10240"
sudo sysctl -w net.ipv4.udp_wmem_min="10240"

sudo sysctl -w net.core.rmem_default=524280
sudo sysctl -w net.core.rmem_max=524280
@tomdee tomdee changed the title Errors in log Errors in log using vxlan backend under CPU load Jun 14, 2016
@tomdee
Copy link
Contributor

tomdee commented Jun 14, 2016

I'm interested to hear if anyone else has hit this or if anyone has a repro for it.

@macb
Copy link

macb commented Feb 6, 2017

We've also run into this, box is generally around <5% idle cpu, so could also be cpu related.

@xalex84
Copy link

xalex84 commented Apr 5, 2017

Hi, we are also encountering the same issue.
Since we don't have the problem on all the boxes, we are analysing the issue in order to understand if there is a match for specific pods hosted by specific boxes.

Flanneld v0.7.0
Etcd2 2.3.7
Kubernetes v1.5.2

Average cpu load of the nodes is 25%

@x8k
Copy link

x8k commented Apr 6, 2017

Same problem.

Apr 06 06:47:10 node204 flanneld[2584]: E0406 06:47:10.852782    2584 device.go:222] Failed to receive from netlink: no buffer space available
Apr 06 06:51:44 node204 flanneld[2584]: E0406 06:51:44.384812    2584 device.go:222] Failed to receive from netlink: no buffer space available
Apr 06 06:56:01 node204 flanneld[2584]: E0406 06:56:01.500101    2584 device.go:222] Failed to receive from netlink: no buffer space available
Apr 06 07:04:31 node204 flanneld[2584]: E0406 07:04:31.445611    2584 device.go:222] Failed to receive from netlink: no buffer space available

Ubuntu 16.04 on baremetal x86_64
Flanneld v0.6.2 and Flanneld v0.7.0

Is not related to the pods, nodes without pods produce errors as nodes with pods do not produce error.

Node without pod and with errors:
%Cpu(s): 0.3 us, 0.8 sy, 0.0 ni, 98.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
Node without pod and without errors:
%Cpu(s): 19.1 us, 2.0 sy, 0.0 ni, 77.2 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st

It seems to be some (too small) default setup during the netlink call that permits the communication between the kernel and the userspace process

@zihaoyu
Copy link

zihaoyu commented Jul 16, 2017

Seeing the same thing #779 . Our cluster is about 250 to 300 nodes.

@zihaoyu
Copy link

zihaoyu commented Jul 17, 2017

core@ip-10-72-148-29 ~ $ ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 8951
        inet 10.6.171.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::6c89:a7ff:febd:578f  prefixlen 64  scopeid 0x20<link>
        ether 6e:89:a7:bd:57:8f  txqueuelen 0  (Ethernet)
        RX packets 56700389  bytes 74290145890 (69.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 105960931  bytes 79748645904 (74.2 GiB)
        TX errors 0  dropped 22871 overruns 0  carrier 0  collisions 0

Dropping packets is seen on minions in a ~ 270-node cluster.

@x8k
Copy link

x8k commented Aug 29, 2017

Only as info.
These parameters remove the log entry increasing the netlink buffer.

net.core.rmem_default = 524280
net.ipv4.udp_rmem_min = 10240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants