Errors in log using vxlan backend under CPU load #414

ppontryagin · 2016-03-03T09:30:28Z

I have the following messages in the journal if cpu usage is > 60%

Mar 02 21:36:58 ip-10-0-2-240.eu-central-1.compute.internal sdnotify-proxy[2093]: E0302  21:36:58.118918 00001 device.go:218] Failed to receive from netlink: no buffer space available
Mar 02 21:36:59 ip-10-0-2-240.eu-central-1.compute.internal sdnotify-proxy[2093]: E0302  21:36:59.215630 00001 device.go:218] Failed to receive from netlink: no buffer space available
Mar 02 21:37:00 ip-10-0-2-240.eu-central-1.compute.internal sdnotify-proxy[2093]: E0302 21:37:00.316524 00001 device.go:218] Failed to receive from netlink: no buffer space available
Mar 02 21:37:01 ip-10-0-2-240.eu-central-1.compute.internal sdnotify-proxy[2093]: E0302 21:37:01.465853 00001 device.go:218] Failed to receive from netlink: no buffer space available

It goes from here
https://github.com/coreos/flannel/blob/7cb8d6b9a80632828e9569033d411361afee1816/backend/vxlan/device.go#L218

Do I need to tweak some parameters?

I have already played with

sudo sysctl -w net.ipv4.tcp_rmem="10240 87380 12582912"
sudo sysctl -w net.ipv4.tcp_wmem="10240 87380 12582912"

sudo sysctl -w net.ipv4.tcp_rmem="102400 873800 125829120"
sudo sysctl -w net.ipv4.tcp_wmem="102400 873800 125829120"

sudo sysctl -w net.core.wmem_max="125829120"
sudo sysctl -w net.core.rmem_max="125829120"

sudo sysctl -w net.ipv4.tcp_window_scaling="1"
sudo sysctl -w net.ipv4.tcp_timestamps="1"

sudo sysctl -w net.ipv4.tcp_sack="1"
sudo sysctl -w net.core.netdev_max_backlog="5000"

sudo sysctl -w net.ipv4.udp_mem="102400 873800 125829120"
sudo sysctl -w net.ipv4.udp_mem="102400 873800 125829120"

sudo sysctl -w net.ipv4.udp_rmem_min="10240"
sudo sysctl -w net.ipv4.udp_wmem_min="10240"

sudo sysctl -w net.core.rmem_default=524280
sudo sysctl -w net.core.rmem_max=524280

The text was updated successfully, but these errors were encountered:

tomdee · 2016-06-14T17:33:29Z

I'm interested to hear if anyone else has hit this or if anyone has a repro for it.

macb · 2017-02-06T17:40:27Z

We've also run into this, box is generally around <5% idle cpu, so could also be cpu related.

xalex84 · 2017-04-05T16:02:52Z

Hi, we are also encountering the same issue.
Since we don't have the problem on all the boxes, we are analysing the issue in order to understand if there is a match for specific pods hosted by specific boxes.

Flanneld v0.7.0
Etcd2 2.3.7
Kubernetes v1.5.2

Average cpu load of the nodes is 25%

x8k · 2017-04-06T09:31:31Z

Same problem.

Apr 06 06:47:10 node204 flanneld[2584]: E0406 06:47:10.852782    2584 device.go:222] Failed to receive from netlink: no buffer space available
Apr 06 06:51:44 node204 flanneld[2584]: E0406 06:51:44.384812    2584 device.go:222] Failed to receive from netlink: no buffer space available
Apr 06 06:56:01 node204 flanneld[2584]: E0406 06:56:01.500101    2584 device.go:222] Failed to receive from netlink: no buffer space available
Apr 06 07:04:31 node204 flanneld[2584]: E0406 07:04:31.445611    2584 device.go:222] Failed to receive from netlink: no buffer space available

Ubuntu 16.04 on baremetal x86_64
Flanneld v0.6.2 and Flanneld v0.7.0

Is not related to the pods, nodes without pods produce errors as nodes with pods do not produce error.

Node without pod and with errors:
%Cpu(s): 0.3 us, 0.8 sy, 0.0 ni, 98.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
Node without pod and without errors:
%Cpu(s): 19.1 us, 2.0 sy, 0.0 ni, 77.2 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st

It seems to be some (too small) default setup during the netlink call that permits the communication between the kernel and the userspace process

zihaoyu · 2017-07-16T21:46:53Z

Seeing the same thing #779 . Our cluster is about 250 to 300 nodes.

zihaoyu · 2017-07-17T21:57:27Z

core@ip-10-72-148-29 ~ $ ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 8951
        inet 10.6.171.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::6c89:a7ff:febd:578f  prefixlen 64  scopeid 0x20<link>
        ether 6e:89:a7:bd:57:8f  txqueuelen 0  (Ethernet)
        RX packets 56700389  bytes 74290145890 (69.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 105960931  bytes 79748645904 (74.2 GiB)
        TX errors 0  dropped 22871 overruns 0  carrier 0  collisions 0

Dropping packets is seen on minions in a ~ 270-node cluster.

x8k · 2017-08-29T09:58:15Z

Only as info.
These parameters remove the log entry increasing the netlink buffer.

net.core.rmem_default = 524280
net.ipv4.udp_rmem_min = 10240

tomdee added area/performance components/backend/vxlan labels Jun 14, 2016

tomdee changed the title ~~Errors in log~~ Errors in log using vxlan backend under CPU load Jun 14, 2016

tomdee self-assigned this Apr 24, 2017

tomdee mentioned this issue Apr 28, 2017

Flannel.1 drop packets regularly in a 80 nodes cluster #592

Closed

tomdee removed their assignment May 19, 2017

tomdee mentioned this issue Jul 27, 2017

backend/vxlan: simplify vxlan processing #785

Merged

tomdee closed this as completed in #785 Aug 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors in log using vxlan backend under CPU load #414

Errors in log using vxlan backend under CPU load #414

ppontryagin commented Mar 3, 2016

tomdee commented Jun 14, 2016

macb commented Feb 6, 2017

xalex84 commented Apr 5, 2017

x8k commented Apr 6, 2017

zihaoyu commented Jul 16, 2017

zihaoyu commented Jul 17, 2017

x8k commented Aug 29, 2017

Errors in log using vxlan backend under CPU load #414

Errors in log using vxlan backend under CPU load #414

Comments

ppontryagin commented Mar 3, 2016

tomdee commented Jun 14, 2016

macb commented Feb 6, 2017

xalex84 commented Apr 5, 2017

x8k commented Apr 6, 2017

zihaoyu commented Jul 16, 2017

zihaoyu commented Jul 17, 2017

x8k commented Aug 29, 2017