-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MTU issue with IPIP and host-networking #1709
Comments
I solved this by lowering calico's MTU - |
@squeed I guess that's why default Calico manifest uses I'm taking v3.0 calico.yaml spec as an example. Wish there's document somewhere stating why the settings, otherwise might be hitting the issue of yours. |
It doesn't matter what the mtu is, because whatever value the pods have will be stored in the hosts' cache, for that host. As an experiment, I set the pod's MTU to be 1460, while the MTU of the tunl0 was 1480. Because of the masquerading, the route cache used the lower value:
Both of IPs are on normal 1500 byte interfaces. The mtu cache "should" show 1500. |
If Linux is not sending the ICMP messages needed for pmtu discovery, then is it a matter of ensuring the ip_no_pmtu_disc and/or ip_forward_use_pmtu sysctls are set properly? |
The problem is more subtle; it is managing PMTU correctly. The problem is that the same IP address (due to the masquerade) has a variable MTU. This compounds with its use as a tunnel endpoint. I haven't tried disabling PMTU entirely. That might work, but it almost certainly causes more problems :-) |
@squeed What kernel version you were using when you were testing this? I've been trying a bit to reproduce what you were seeing and have not been able to yet. I attempted to check out the mtu cache values like you showed and was unable to. After some googling to figure out why I could not get any mtu cache output I looked at the |
@tmjd it was a recent kernel version, since I was running CoreOS stable. I don't have it off-hand. I'll spin up another cluster and try and repro. So, recent Linux kernels don't have a route-cache, that's true (they just have an efficient prefix-tree). However, they do maintain something called the "exception cache," where they store things like MTU overrides. So we're still hitting that path. |
Is there anything special you did to get cache output from
I've also tried using What is your testing environment? So far I've tried in GCE using Ubuntu and a local Vagrant setup with Coreos. |
My testing environment is the CoreOS tectonic installer running on a few virtualbox machines. Nothing particularly special. |
I came across this post when solving a recent AWS+CoreOS+k8s issue. This sounded like a different, Calico-specific issue. But now @squeed mentions CoreOS, then this could be related to my issue, which I documented and resolved over on the most excellent kubernetes-retired/kube-aws#1349
|
In my cluster I'm trying to figure how to set different MTU for different nodes with calico in CNI config. Is there a way to do that at all? |
@whereisaaron @dimm0 this issue isn't not about the MTU of the underlying interface (though that is an interesting problem). This is specifically about the design of calico causing inconsistent MTU caching and unreachability within the overlay network. I do want to make sure this particular issue doesn't become a dumping ground for all kinds of MTU weirdness. |
Some other ppl are thinking that's the same issue I'm having (#2026), but yeah, I agree |
@squeed Could u try to recreate this issue with latest
|
I've just run into the same issue. It seems that starting Kubernetes Nginx Ingress Controller in network=host mode causes the same problems. In my case lowering tunl0 MTU from 1440 to 1300 did the job and solved the problem. |
In case somebody wanna to reproduce the bug. I've deployed my Kubernetes cluster on Scaleway's Fedora 28 with the latest Kubespray. Then I deployed ingress controller using Helm Chart (https://github.com/kubernetes/charts/tree/master/stable/nginx-ingress) and Then you can just deploy any pod exposing REST endpoint and generate output larger than MTU. If you try to curl the pod endpoint you will see client waiting forever for a response. Sniffing network traffic confirms that client receives only part of the response and then waits for the rest. |
@hekonsek I am also facing this intermittent problem in a 12 node prod cluster. Coredns is working but ingress controller and dashboard cant talk to Kubernetes svc. I didnt face this issue in small cluster of 4nodes. I will try changing the mtu and see if it works. |
@anjuls In my case it was 3-nodes cluster. |
@hekonsek I managed to fix my cluster.
|
@hekonsek : I'm having the same issues with a similar setup: 1+3 node cluster on top of wireguard VPN using calico cni. k8s version is 1.11 installed with kubeadm. All nodes run Debian Stretch. I've managed to reproduce it by making a packet capture and in wireshark, I "followed" the TCP stream and saw the size of the data. In my case it is 1868.
I'm kind of a noob in networking area so my question is how can I determine the proper MTU value in my case, when I tunnel traffic also via wireguard VPN. I've found [1] that talks about similar issues. A second point that I would like to raise is that this issue should be mentioned in the Calico for installation. This is how my interfaces look like: I have different MTU values for wireguard, calico and tunnel.
|
We are facing the same issues. Any update about this? |
* wip: influxdb-operator * wip: influxdb-operator * wip: influxdb-operator * wip: influxdb-operator * wip: influxdb-operator * wip: prometheus_operator * wip: raspberrypi change * wip: raspberrypi change * wip: raspberrypi change * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * working collect * wip: prometheus_operator * WIP: more collection scripts * WIP: diff * wip: more perf * wip: perf kernel building * WIP: more info * wip: collect-info * wip: collect-info * wip: more debugging stuff * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: collect-info * wip: fixed log collection * WIP: before * WIP: just more info * WIP: perf * WIP: fix certmanager * wip: fix * wip: fix * wip: fix * wip: fix * wip: fix * wip: fix * wip: fix * wip: fix * wip: fix * wip: cert-self-signed * wip: fix * wip: fix * ca * wip: fix * wip: cert-self-signed * wip: cert-manager - borglab * wip: https://www.azul.com/jhiccup/ * wip: calico tuning * wip: perf * wip: profile-sysdig * wip: fix calico * wip: fix calico * wip: fix calico * wip: fix calico * wip: debugger calico pod * projectcalico/calico#1709 * WIP: mtu * WIP: fix calico * scap sysdig * wip: new tools * wip: new tools * wip: new tools * calico-debbuger * wip: new tools * wip: new tools * wip: docker-registry * wip: docker-registry * wip: docker-registry * wip: docker-registry * wip: docker-registry * wip: traefik change * wip: cert-manager * wip: cert-manager * WIP: moving to cert-manager 0-7-0 * WIP: cert-manager pki maybe * wip: traefik bugs * wip: dummy app * WIP: calico * wip: profiler
I believe this series of kernel changes will fix this: https://www.mail-archive.com/[email protected]/msg345225.html |
In my case, running |
any update for this issue? i ran into similar problem at vms with running calico-node in pod not ready that looks like
|
@Davidrjx try execute |
thanks and sorry for late reply. |
I set tunl0 and veth mtu 1480 , and host device mtu 1500 , and /proc/sys/net/ipv4/ip_no_pmtu_disc = 0, some day , network link change , send need frag mtu 1330 ICMP error , route cache has been update ..... 10.200.40.21 via 10.200.114.1 dev bond0.114 src 10.200.114.198 but tunl0 ipip route not update .... 172.17.248.241 via 10.200.40.21 dev tunl0 src 172.17.84.128 then container also send package with mtu 1480, big package will drop ....... I change tunl0 attr pmtudisc ,, with ip tunnel change tunl0 mode ipip pmtudisc then tunl0 ipip route update why not calico not set pmtudisc when setup ipip devices ? |
The problem: When using Calico on Kubernetes with some host-networking pods, the Linux MTU cache results in unreachability.
This is due to IP masquerading when accessing destinations outside the ClusterCIDR, along with services running in host networking.
Setup Details:
Consider a Kubernetes cluster running Calico. Accordingly, the Calico daemon is running on every node and configures the calico device (
tunl0
) with an IP within that node's PodCIDR. PodCIDRs are chosen from the ClusterCIDR of 10.2.0.0/16.Because Calico uses ip-in-ip encapsulation, all of the pods (and the
tunl0
interface) have an MTU of 1480.The problem:
In other words, packets over 1460 bytes in size will be silently dropped for all pods between A and B.
The text was updated successfully, but these errors were encountered: