MTU issue with IPIP and host-networking #1709

squeed · 2018-02-26T18:23:12Z

The problem: When using Calico on Kubernetes with some host-networking pods, the Linux MTU cache results in unreachability.

This is due to IP masquerading when accessing destinations outside the ClusterCIDR, along with services running in host networking.

Setup Details:

Consider a Kubernetes cluster running Calico. Accordingly, the Calico daemon is running on every node and configures the calico device (tunl0) with an IP within that node's PodCIDR. PodCIDRs are chosen from the ClusterCIDR of 10.2.0.0/16.

Because Calico uses ip-in-ip encapsulation, all of the pods (and the tunl0 interface) have an MTU of 1480.

Host A (10.1.1.50) is a Kubernetes node. The calico daemon has set up the tunnel and given it a Calico IP of 10.2.0.1
- Pod A1 is on host networking, so has a PodIP of 10.1.1.10
- Pod A2 has a Calico IP, so has a PodIP of 10.2.0.2
Host B (10.1.1.51) has a single pod:
- Pod B1 uses Calico, an has a PodIP of 10.2.1.2

The problem:

Pod B1 opens a connection to pod A1 (on host networking). A TCP SYN is sent to 10.1.1.50, the HostIP of host A with an MSS of 1460 (the pod's eth0 MTU of 1480 less TCP overhead).
Host B masquerades the source IP, using the outgoing interface 10.1.1.51
Host A sees a SYN from 10.1.1.51 with a TCP MSS of 1460. It stores 1480 in its route cache's MTU field for 10.1.1.51. The connection proceeds normally, and is closed.
Pod B1 opens a connection to pod A2. A SYN is sent to 10.2.0.2, and the connection is established over the ipip tunnel.
A2 tries to send a large response, and it is broken in to 1480 byte packets. The DF bit is set, since this is TCP. The packet leaves the pod and goes to the host.
Host A tries to encapsulate the packet, adding 20 bytes of overhead.
The packet, now 1500 bytes, is too large to be sent to its destination IP of 10.1.1.51 and is dropped. Linux does not generate an ICMP "Packet too big" message.

In other words, packets over 1460 bytes in size will be silently dropped for all pods between A and B.

The text was updated successfully, but these errors were encountered:

fxpester · 2018-02-28T15:39:44Z

I solved this by lowering calico's MTU -
calicoctl config set --raw=felix IpInIpMtu 1450

unicell · 2018-02-28T22:02:29Z

@squeed I guess that's why default Calico manifest uses 1440 for IPIP MTU.

https://github.com/projectcalico/calico/blob/v3.0.3/v3.0/getting-started/kubernetes/installation/hosted/kubeadm/1.7/calico.yaml#L230-L232

I'm taking v3.0 calico.yaml spec as an example. Wish there's document somewhere stating why the settings, otherwise might be hitting the issue of yours.

squeed · 2018-03-01T14:20:21Z

It doesn't matter what the mtu is, because whatever value the pods have will be stored in the hosts' cache, for that host.

As an experiment, I set the pod's MTU to be 1460, while the MTU of the tunl0 was 1480. Because of the masquerading, the route cache used the lower value:

core@master1 ~ $ ip route get 10.1.1.50
10.1.1.50 dev ens3 src 10.1.1.10 uid 500 
    cache expires 323sec mtu 1460

Both of IPs are on normal 1500 byte interfaces. The mtu cache "should" show 1500.

detiber · 2018-05-24T16:03:16Z

If Linux is not sending the ICMP messages needed for pmtu discovery, then is it a matter of ensuring the ip_no_pmtu_disc and/or ip_forward_use_pmtu sysctls are set properly?

squeed · 2018-05-25T14:45:50Z

The problem is more subtle; it is managing PMTU correctly. The problem is that the same IP address (due to the masquerade) has a variable MTU. This compounds with its use as a tunnel endpoint.

I haven't tried disabling PMTU entirely. That might work, but it almost certainly causes more problems :-)

tmjd · 2018-06-11T03:15:47Z

@squeed What kernel version you were using when you were testing this? I've been trying a bit to reproduce what you were seeing and have not been able to yet. I attempted to check out the mtu cache values like you showed and was unable to. After some googling to figure out why I could not get any mtu cache output I looked at the ip-route man page which shows Starting with Linux kernel version 3.6, there is no routing cache for IPv4 anymore. (Hence my question about kernel version.)

squeed · 2018-06-11T09:37:09Z

@tmjd it was a recent kernel version, since I was running CoreOS stable. I don't have it off-hand. I'll spin up another cluster and try and repro.

So, recent Linux kernels don't have a route-cache, that's true (they just have an efficient prefix-tree). However, they do maintain something called the "exception cache," where they store things like MTU overrides. So we're still hitting that path.

tmjd · 2018-06-11T13:18:13Z

Is there anything special you did to get cache output from ip route get...? I've tried both coreos (1576.4.0) and Ubuntu 16.04 and both produce output like the following when using the commands you were suggesting.

core@k8s-node-02 ~ $ ip route get 172.18.18.102
172.18.18.102 dev eth1 src 172.18.18.103 uid 500 
    cache

I've also tried using netstat -eCr and get no cache information. (I've also tried the commands using sudo in case it was a permsissions issue.)

What is your testing environment? So far I've tried in GCE using Ubuntu and a local Vagrant setup with Coreos.

squeed · 2018-06-11T14:21:15Z

ip route <dest> will only show mtu if there is an exception for that individual destination.

My testing environment is the CoreOS tectonic installer running on a few virtualbox machines. Nothing particularly special.

whereisaaron · 2018-06-11T17:13:08Z

I came across this post when solving a recent AWS+CoreOS+k8s issue. This sounded like a different, Calico-specific issue. But now @squeed mentions CoreOS, then this could be related to my issue, which I documented and resolved over on the most excellent kube-aws project. Although I focus on the VPC-level issues, I also noticed it causes Calico and Flannel to have mismatches configurations also.

kubernetes-retired/kube-aws#1349

CoreOS 1745.3.1 and 1745.4.0 include a networkd bug that causes problems for clusters with mixed instance types (e.g. T2 and M3/4/5). This is fixed in 1745.5.0 (stable).

All the 'current' AWS instance types support jumbo frames (MTU = 9001). This is set via DHCP, however the networkd in these CoreOS versions fails to do that. This leaves the instances with their default MTU. While T2 instance support MTU=9001 they appear to default to MTU=1500. This leaves you with different nodes in the cluster with different MTUs.

Clients of TCP load balancers will get PMTU errors where they will think the PMTU is 8951 or 1500 when it is actually 1450. You'll tend to get MTU hangs or disconnections if connections heads to T2 worker nodes due to the incorrect MTU.

If you have T2 nodes for your control plane, if you upgrade to this versions (1745.3.1 and 1745.4.0) you'll likely see all your workers go to 'NotReady' and appear to stop reporting state to controllers via the API load balancer. In reality the controller MTU has suddenly gone from 9001 to 1500, and it takes a while for the load balancer and worker nodes to work this out. In my experience the workers should recover in about 10 minutes.

dimm0 · 2018-06-11T17:37:33Z

In my cluster I'm trying to figure how to set different MTU for different nodes with calico in CNI config. Is there a way to do that at all?

squeed · 2018-06-12T12:04:16Z

@whereisaaron @dimm0 this issue isn't not about the MTU of the underlying interface (though that is an interesting problem). This is specifically about the design of calico causing inconsistent MTU caching and unreachability within the overlay network. I do want to make sure this particular issue doesn't become a dumping ground for all kinds of MTU weirdness.

dimm0 · 2018-06-12T13:24:09Z

Some other ppl are thinking that's the same issue I'm having (#2026), but yeah, I agree

saumoh · 2018-06-15T18:56:58Z

@squeed Could u try to recreate this issue with latest CoreOS-stable.
I tried with the following version but could not reproduce the scenario where hostA sets a mtu of 1460 for hostB in the route "cache"

$ cat /etc/lsb-release 
DISTRIB_ID="Container Linux by CoreOS"
DISTRIB_RELEASE=1745.6.0
DISTRIB_CODENAME="Rhyolite"
DISTRIB_DESCRIPTION="Container Linux by CoreOS 1745.6.0 (Rhyolite)"
core@k8s-master ~ $ uname -r
4.14.48-coreos-r1

hekonsek · 2018-06-22T07:35:58Z

I've just run into the same issue. It seems that starting Kubernetes Nginx Ingress Controller in network=host mode causes the same problems. In my case lowering tunl0 MTU from 1440 to 1300 did the job and solved the problem.

hekonsek · 2018-06-22T07:38:36Z

In case somebody wanna to reproduce the bug. I've deployed my Kubernetes cluster on Scaleway's Fedora 28 with the latest Kubespray. Then I deployed ingress controller using Helm Chart (https://github.com/kubernetes/charts/tree/master/stable/nginx-ingress) and controller.hostNetwork option set to true.

Then you can just deploy any pod exposing REST endpoint and generate output larger than MTU. If you try to curl the pod endpoint you will see client waiting forever for a response. Sniffing network traffic confirms that client receives only part of the response and then waits for the rest.

anjuls · 2018-07-06T13:18:28Z

@hekonsek I am also facing this intermittent problem in a 12 node prod cluster. Coredns is working but ingress controller and dashboard cant talk to Kubernetes svc. I didnt face this issue in small cluster of 4nodes. I will try changing the mtu and see if it works.

hekonsek · 2018-07-06T13:27:58Z

@anjuls In my case it was 3-nodes cluster.

anjuls · 2018-07-13T06:27:38Z

@hekonsek I managed to fix my cluster.

I switched to the latest calico version v3.1.3
I also moved my etcd outside kubernetes instead of using calico etcd. I was having 3 master nodes and what I found that calico-etcd instances were not clustered. The yaml doesn't support HA. So all three etcd instances were running independently causing random network related problems in cluster.
Now all services are working fine.

ieugen · 2018-07-15T01:30:56Z

@hekonsek : I'm having the same issues with a similar setup: 1+3 node cluster on top of wireguard VPN using calico cni. k8s version is 1.11 installed with kubeadm. All nodes run Debian Stretch.

I've managed to reproduce it by making a packet capture and in wireshark, I "followed" the TCP stream and saw the size of the data. In my case it is 1868.
Any response (reuqest??) with 1868 bytes or more cause gateway timeout on ingress-nginx.
To reproduce it in my case, I saved the wireshark data and used curl

curl -X POST --data @1867-bytes-of-data-work.log https://test-svc.example.com -- this works
curl -X POST --data @1868-bytes-of-data-fail.log https://test-svc.example.com -- this fails

I'm kind of a noob in networking area so my question is how can I determine the proper MTU value in my case, when I tunnel traffic also via wireguard VPN. I've found [1] that talks about similar issues.

A second point that I would like to raise is that this issue should be mentioned in the Calico for installation.

This is how my interfaces look like: I have different MTU values for wireguard, calico and tunnel.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
3: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
5: cali9cafa0a893e@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
6: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 192.168.3.1/32 brd 192.168.3.1 scope global tunl0
7: caliad17b2e6582@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
9: califcc50f7010f@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default

[1] StreisandEffect/streisand#1089

boranx · 2018-10-16T04:02:22Z

We are facing the same issues. Any update about this?

* wip: influxdb-operator * wip: influxdb-operator * wip: influxdb-operator * wip: influxdb-operator * wip: influxdb-operator * wip: prometheus_operator * wip: raspberrypi change * wip: raspberrypi change * wip: raspberrypi change * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * working collect * wip: prometheus_operator * WIP: more collection scripts * WIP: diff * wip: more perf * wip: perf kernel building * WIP: more info * wip: collect-info * wip: collect-info * wip: more debugging stuff * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: fixed log collection * WIP: before * WIP: just more info * WIP: perf * WIP: fix certmanager * wip: fix * wip: fix * wip: fix * wip: fix * wip: fix * wip: fix * wip: fix * wip: fix * wip: fix * wip: cert-self-signed * wip: fix * wip: fix * ca * wip: fix * wip: cert-self-signed * wip: cert-manager - borglab * wip: https://www.azul.com/jhiccup/ * wip: calico tuning * wip: perf * wip: profile-sysdig * wip: fix calico * wip: fix calico * wip: fix calico * wip: fix calico * wip: debugger calico pod * projectcalico/calico#1709 * WIP: mtu * WIP: fix calico * scap sysdig * wip: new tools * wip: new tools * wip: new tools * calico-debbuger * wip: new tools * wip: new tools * wip: docker-registry * wip: docker-registry * wip: docker-registry * wip: docker-registry * wip: docker-registry * wip: traefik change * wip: cert-manager * wip: cert-manager * WIP: moving to cert-manager 0-7-0 * WIP: cert-manager pki maybe * wip: traefik bugs * wip: dummy app * WIP: calico * wip: profiler

squeed · 2020-10-05T12:10:41Z

I believe this series of kernel changes will fix this: https://www.mail-archive.com/[email protected]/msg345225.html

ocherfas · 2020-10-13T10:45:48Z

In my case, running ip route flush cache on the machine that drops the packets solved it temporarily. After a day the problem seems to get back.

Davidrjx · 2020-12-16T02:17:10Z

any update for this issue? i ran into similar problem at vms with running calico-node in pod not ready that looks like

...
 Warning  Unhealthy  20s (x382 over 63m)  kubelet, k-node-master  (combined from similar events): Readiness probe failed: Threshold time for bird readiness check:  30s
calico/node is not ready: BIRD is not ready: BGP not established with 10.246.*.12,10.246.*.132020-12-16 02:15:33.430 [INFO][5396] readiness.go 88: Number of node(s) with BGP peering established = 0
...

enginious-dev · 2021-02-11T18:36:08Z

@Davidrjx try execute
sudo ufw allow 179/tcp comment "Calico networking (BGP)"

Davidrjx · 2021-04-06T12:03:38Z

@Davidrjx try execute
sudo ufw allow 179/tcp comment "Calico networking (BGP)"

thanks and sorry for late reply.

gaopeiliang · 2021-08-25T13:11:31Z

I set tunl0 and veth mtu 1480 , and host device mtu 1500 , and /proc/sys/net/ipv4/ip_no_pmtu_disc = 0, some day , network link change , send need frag mtu 1330 ICMP error ,

route cache has been update .....

10.200.40.21 via 10.200.114.1 dev bond0.114 src 10.200.114.198
cache expires 597sec mtu 1330

but tunl0 ipip route not update ....

172.17.248.241 via 10.200.40.21 dev tunl0 src 172.17.84.128
cache expires 455sec mtu 1480

then container also send package with mtu 1480, big package will drop .......

I change tunl0 attr pmtudisc ,, with ip tunnel change tunl0 mode ipip pmtudisc

then tunl0 ipip route update
172.17.248.241 via 10.200.40.21 dev tunl0 src 172.17.84.128
cache expires 455sec mtu 1310

why not calico not set pmtudisc when setup ipip devices ?

ozdanborne added the priority/P1 label Mar 22, 2018

ozdanborne mentioned this issue Jun 7, 2018

Service IP not reachable from host, pod IPs work fine #2026

Closed

ieugen mentioned this issue Jul 15, 2018

How to setup MTU for tunnel - guide in docs does not work? #2072

Closed

bossjones added a commit to bossjones/bosslab-playbooks that referenced this issue Mar 8, 2019

https://github.com/projectcalico/calico/issues/1709

e64a779

figassis mentioned this issue Nov 28, 2019

Members unable to join when schedules in different nodes oracle/mysql-operator#295

Closed

sampatms closed this as completed Feb 15, 2020

sampatms reopened this Feb 15, 2020

caseydavenport added the kind/bug label Feb 20, 2020

spikecurtis added impact/high likelihood/low and removed priority/P1 labels Mar 17, 2020

mhwasil mentioned this issue Mar 23, 2020

Rook Ceph Mons unable to form quorum when using Calico CNI rook/rook#5065

Closed

vijayk13 mentioned this issue Apr 12, 2020

kubeadm join does not explicitly wait for etcd to have grown when joining secondary control plane kubernetes/kubeadm#1353

Closed

caseydavenport closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTU issue with IPIP and host-networking #1709

MTU issue with IPIP and host-networking #1709

squeed commented Feb 26, 2018 •

edited

Loading

fxpester commented Feb 28, 2018

unicell commented Feb 28, 2018 •

edited

Loading

squeed commented Mar 1, 2018

detiber commented May 24, 2018

squeed commented May 25, 2018

tmjd commented Jun 11, 2018

squeed commented Jun 11, 2018

tmjd commented Jun 11, 2018

squeed commented Jun 11, 2018

whereisaaron commented Jun 11, 2018 •

edited

Loading

dimm0 commented Jun 11, 2018

squeed commented Jun 12, 2018

dimm0 commented Jun 12, 2018

saumoh commented Jun 15, 2018

hekonsek commented Jun 22, 2018

hekonsek commented Jun 22, 2018 •

edited

Loading

anjuls commented Jul 6, 2018

hekonsek commented Jul 6, 2018

anjuls commented Jul 13, 2018

ieugen commented Jul 15, 2018 •

edited

Loading

boranx commented Oct 16, 2018

squeed commented Oct 5, 2020

ocherfas commented Oct 13, 2020

Davidrjx commented Dec 16, 2020

enginious-dev commented Feb 11, 2021

Davidrjx commented Apr 6, 2021

gaopeiliang commented Aug 25, 2021 •

edited

Loading

MTU issue with IPIP and host-networking #1709

MTU issue with IPIP and host-networking #1709

Comments

squeed commented Feb 26, 2018 • edited Loading

Setup Details:

The problem:

fxpester commented Feb 28, 2018

unicell commented Feb 28, 2018 • edited Loading

squeed commented Mar 1, 2018

detiber commented May 24, 2018

squeed commented May 25, 2018

tmjd commented Jun 11, 2018

squeed commented Jun 11, 2018

tmjd commented Jun 11, 2018

squeed commented Jun 11, 2018

whereisaaron commented Jun 11, 2018 • edited Loading

dimm0 commented Jun 11, 2018

squeed commented Jun 12, 2018

dimm0 commented Jun 12, 2018

saumoh commented Jun 15, 2018

hekonsek commented Jun 22, 2018

hekonsek commented Jun 22, 2018 • edited Loading

anjuls commented Jul 6, 2018

hekonsek commented Jul 6, 2018

anjuls commented Jul 13, 2018

ieugen commented Jul 15, 2018 • edited Loading

boranx commented Oct 16, 2018

squeed commented Oct 5, 2020

ocherfas commented Oct 13, 2020

Davidrjx commented Dec 16, 2020

enginious-dev commented Feb 11, 2021

Davidrjx commented Apr 6, 2021

gaopeiliang commented Aug 25, 2021 • edited Loading

squeed commented Feb 26, 2018 •

edited

Loading

unicell commented Feb 28, 2018 •

edited

Loading

whereisaaron commented Jun 11, 2018 •

edited

Loading

hekonsek commented Jun 22, 2018 •

edited

Loading

ieugen commented Jul 15, 2018 •

edited

Loading

gaopeiliang commented Aug 25, 2021 •

edited

Loading