Disable tx and rx offloading on VXLAN interfaces #1282

aojea · 2020-04-13T09:55:30Z

Description

Seems that some kernel versions have issues with VXLAN checksum
offloading, causing that flannel stop to work on some scenarios
where the traffic is encapsulated, but the checksum is wrong and is
discarded by the receiver.

A known workaround that works is disabling offloading on the flannel
interface:

ethtool --offload flannel.1 rx off tx off

This PR force to disable it always on VXLAN interfaces

Todos

Tests
Documentation
Release note

Release Note

disable rx and tx checksum offloading on VXLAN interfaces

aojea · 2020-04-13T09:57:41Z

Fixes #1282

Seems that some kernel versions have issues with VXLAN checksum offloading, causing that flannel stop to work on some scenarios where the traffic is encapsulated, but the checksum is wrong and is discarded by the receiver. A known workaround that works is disabling offloading on the flannel interface: ethtool --offload flannel.1 rx off tx off Flannel disables tx and rx offloading on VXLAN interfaces.

aojea · 2020-04-13T15:09:38Z

ethtool code from https://github.com/ligato/vpp-agent/blob/master/plugins/linux/ifplugin/linuxcalls/ethtool_linuxcalls.go

Capitrium · 2020-04-13T18:09:43Z

@aojea I just tested this on my dev cluster, still running into #1243. Seems like this might be more related to #1279, but I'm not seeing that on my cluster so I can't be sure.

aojea · 2020-04-14T12:58:59Z

@Capitrium seems that other people that has tested the patch has it working,
maybe you need to restart flannel so the interfaces are recreated and the workaround is applied?

aojea · 2020-04-14T12:59:36Z

ping @rajatchopra :-)

Miouge1 · 2020-04-14T13:10:32Z

@aojea any idea on which kernel versions are affected?

aojea · 2020-04-14T13:27:08Z

no idea, but this issue created an explosion of issues opened against kubernetes and related projects, it was also discussed in the sig-network mailing list https://groups.google.com/d/msg/kubernetes-sig-network/JxkTLd4M8WM/EW8O1E0PAgAJ
The information is scattered in multiple places sorry 😄

Capitrium · 2020-04-14T20:04:39Z

@aojea Yeah, looks like it was an issue with configuration or stale pods on my end - rebuilt and redeployed, seems like it's working now.

I was still seeing networking issues with pods on one node after deploying the patch and had to kill the node, but I was doing a fair amount of testing with different kube-proxy/kube-router/flannel versions and probably broke something else in the process. Most existing nodes and all new nodes are working properly. 👍

hakman · 2020-04-16T09:02:53Z

Very cool @aojea. This create a lot of discussions in Kops project also. Really happy to see that there is a solution :).

rajatchopra

Good work @aojea
Thanks.

Who is tracking the 'real' fix in the kernel? @dcbw Any idea?
Is there a RedHat bug on it?
This fix, while great, is only patchwork to hide the actual issue.

rajatchopra · 2020-04-21T13:07:08Z

@aojea I am not sure if we should disable rx/tx checksum always. A config parameter maybe? This is a temporary hack, right?

aojea · 2020-04-21T14:17:33Z

Who is tracking the 'real' fix in the kernel? @dcbw Any idea?

@dcbw posted more info here kubernetes/kubernetes#88986 (comment)
there is a patch upstream in the kernel

@aojea I am not sure if we should disable rx/tx checksum always. A config parameter maybe? This is a temporary hack, right?

@rajatchopra I've considered that, but

weave has disabled it in all veth interfaces 4 years ago
Don't rely on setting ethtool tx off on guest interfaces weaveworks/weave#1255 (comment)
this is configured only on the vxlan interfaces, and vxlan RFC recommends to not set the UDP checksum ... that means that checksum offloading is not really needed if it's 0
https://tools.ietf.org/html/rfc7348

   -  UDP Checksum: It SHOULD be transmitted as zero.  When a packet
     is received with a UDP checksum of zero, it MUST be accepted
     for decapsulation.  Optionally, if the encapsulating end point
     includes a non-zero UDP checksum, it MUST be correctly
     calculated across the entire packet including the IP header,
     UDP header, VXLAN header, and encapsulated MAC frame.  When a
     decapsulating end point receives a packet with a non-zero
     checksum, it MAY choose to verify the checksum value.  If it
     chooses to perform such verification, and the verification
     fails, the packet MUST be dropped.  If the decapsulating
     destination chooses not to perform the verification, or
     performs it successfully, the packet MUST be accepted for
     decapsulation.

and despite we make it configurable we have to disable offloading by default or people will keep hitting the bug ... is it worth the effort? I personally don't think nobody is going to enable it afterward and this will protect flannel from future breakages or odd scenarios like the described in the weave issue 🤷

rikatz · 2020-04-21T18:45:14Z

@aojea any idea on which kernel versions are affected?

Kernel 3.10.0 that's used in RHEL7/CentOS 7 with minor variations (but I've checked and this happens in CentOS from the 7.0 to 7.7 and with the latest RH provided kernel)

hakman · 2020-04-22T08:38:52Z

@rajatchopra any plans for a bugfix release containing this?
The Kops 1.17 release it almost ready and would like to know if we should wait for this or pick it up in the next release. Thanks! :)
kubernetes/kops#8614

hakman · 2020-05-01T01:39:55Z

The "proper" fix for this was accepted by netfilter.

The comment says that the problem seems to be there from day one, although it was probably not visible before UDP tunnels were implemented.
This makes this fix the best solution for now for anyone wanting to upgrade to k8s 1.17 and use flannel.

zhangguanzhang · 2020-05-27T02:29:15Z

Oddly enough, if I run a vxlan using a binary flannel file, it will work, whereas if I deploy it using kubeadm, the SVC access will timeout in 63 seconds

zhangguanzhang · 2020-05-28T06:55:41Z

I did a control group and found the cause of the problem，the os is Centos 7.6

deploy type	flannel version	flannel is running in pod?	will 63 sec delay?
kubeadm	v0.11.0	yes	yes
kubeadm	v0.12.0	yes	yes
kubeadm	v0.11.0	no	yes
kubeadm	v0.12.0	no	yes
ansible	v0.11.0	yes	no
ansible	v0.12.0	yes	no
ansible	v0.11.0	no	no
ansible	v0.12.0	no	no

The control group above proved that flannel was not the source of the problem, so I made a second control group：

type(kubeadm or ansible)	kube-proxy version	kube-proxy is running in pod?	will 63 sec delay?
kubeadm	v1.17.5	yes	yes
kubeadm	v1.17.5	no	no
kubeadm	v1.16.9	yes	no
kubeadm	v1.16.9	no	no
ansible	v1.17.5	yes	yes
ansible	v1.17.5	no	no

so，If the kube-proxy is running in pod, It will trigger the kernel bug,I found this by comparing submissions on github
kubernetes/kubernetes@fed5823
so I hack the docker images, there is the Dockerfile

FROM k8s.gcr.io/kube-proxy:v1.17.5
RUN rm -f /usr/sbin/iptables && 
    clean-install iptables

after I set the images use the hack images , it never causing 63 second delays in vxlan mode.

@danwinship Please take a look at why iptables is not installed in the docker image

aojea · 2020-05-28T11:05:52Z

@zhangguanzhang impressive work

iptables is installed in that image, however, due to another bug, it has to use the latest version and use an script to detect if it should use the nft or legacy backend

# Install latest iptables package from buster-backports
RUN echo deb http://deb.debian.org/debian buster-backports main >> /etc/apt/sources.list; \
    apt-get update; \
    apt-get -t buster-backports -y --no-install-recommends install iptables

so, do you think that the iptables version is the trigger?

jhohertz · 2020-05-28T11:28:06Z

I do. The recent release of kubernetes 1.16.10 is also affecting 1.16 for the first time so I studied the diffs yesterday. The only thing I see is that there is an iptables container build and the version was bumped from 11.x to 12.x, which 1.17 also did.

zhangguanzhang · 2020-05-28T11:32:05Z

The problem, in terms of docker image modification, is with iptables

danwinship · 2020-05-28T11:59:17Z

The iptables packaging bump triggered the --random-fully bug because the old (11.x) kube images had a version of iptables that was too old to support --random-fully, while the new (12.x) images, in addition to having the whole legacy-vs-nft wrapper thing, also have a much newer version of iptables that does support --random-fully. Unfortunately it turns out that just because the iptables binary supports it doesn't mean the kernel supports it. There's a PR to fix this. But anyway, this is a completely separate problem from the vxlan offload bug.

[EDIT:
Narrator: It wasn't a completely separate problem.
]

jhohertz · 2020-05-28T12:29:55Z

Yet the workaround often suggested seems to be disabling offload, and it does have an effect on the issue

MarkRose · 2020-05-28T13:59:11Z

Stable kernels 5.6.13, 5.4.41, 4.19.123, 4.14.181 and later have the checksum patch included.

zhangguanzhang · 2020-05-28T14:38:00Z

@MarkRose Which jkernels in Centos?

jhohertz · 2020-05-28T14:54:55Z

Trying to test a flatcar image w/ one of those kernels, but issues @ quay making for a fun day...

jhohertz · 2020-05-28T20:10:19Z

Just going to summarise what I know about this for sure now:

kubernetes updating ip-tables to 12.x breaks it (nothing else networking related in the 1.16.10 diff)
The systemd workaround for disabling vxlan on flannel.1 works ok in superficial testing
The kernel patch, now in the stable trees, does nothing to help matters (latest flatcar has a sufficiently recent kernel as of a couple days ago and was used to test.)

Which says, that while it's related to vxlan offload in some manner (because the workaround seems to work...), it is not that exact kernel bug (because it makes no difference when tested).

Looking to understand this random-fully thing now.

danwinship · 2020-05-28T20:38:46Z

@jhohertz and just to confirm, you're running RHEL/CentOS 7 with kernel 3.10.something?

jhohertz · 2020-05-28T20:56:09Z

Not here, I have been using coreos/flatcar container linux

rikatz · 2020-05-28T22:22:11Z

@danwinship just as a follow up from our sig-network meeting, I'll post the tests results from disabling --random-fully scenarios in kube-proxy and flannel, and also the generated rules for each scenario.

This is the same environment I could use to confirm that disabling tx offload solves (CentOS 7 + Flannel 0.11), but in this case I'm enabling tx offload again in everything:

Before each test, iptables rules are cleaned in the hosts so flannel and kube-proxy recreate them
In NodePort scenarios, I'm calling my own IP:Port as part of the cluster to reach a Pod running outside of it

Scenario 1 - Both kube-proxy and flannel with MASQUERADE rules created with --random-fully

NodePort: Exactly 63s
ClusterIP: Exactly 63s

IPTables rules containing --random-fully:

-A POSTROUTING -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE --random-fully

Scenario 2 - Original kube-proxy with random-fully and flannel recompiled without it
NodePort: Exactly 63s
ClusterIP: Exactly 63s

IPTables rules containing --random-fully:

-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE --random-fully

Scenario 3 - kube-proxy without random-fully and original flannel with random-fully enabled
NodePort: 0.03s
ClusterIP: 0.007s

IPTables rules containing --random-fully:

-A POSTROUTING -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully

Scenario 4 - kube-proxy and flannel without random-fully
NodePort: 0.005s
ClusterIP: 0.03s

IPTables rules containing --random-fully: NONE

This way, we can almost for sure say that the insertion of --random-fully in the MASQUERADE rules triggered an already existing Kernel Bug in vxlan part :)

I'll post this also in the original issue at k/k repo, and please let me know if I can help with any further test

rikatz · 2020-06-15T14:56:37Z

Also I've seen some discussion here about iptables-legacy x nft and this is starting to hit CentOS 8 users: kubernetes/kubernetes#91331

I was going to ask what's the effort and gain of putting the iptables-wrapper to work here but as noted by the README it seems it wouldn't solve the CentOS 8 case :/

aojea · 2020-06-16T07:56:27Z

/close

This is no longer needed thanks to @danwinship 👏
kubernetes/kubernetes#92035

hakman · 2020-06-16T08:02:38Z

@aojea will kubernetes/kubernetes#92035 be cherry-picked to older k8s releases?

aojea · 2020-06-16T08:17:10Z

@aojea will kubernetes/kubernetes#92035 be cherry-picked to older k8s releases?

it should be backported to the supported releases, just ping the author in the PR if he can do it, if he can not the process is described here https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md

hakman · 2020-06-16T08:38:13Z

Thanks @aojea and @danwinship 😄

Fluxay-666 · 2023-05-09T05:23:12Z

Thanks, I used the latest version v0.21.5 plus this PR patch to successfully fix the 63-second connection delay problem.

aojea mentioned this pull request Apr 13, 2020

Bare Metal K8S 63 Second Service Routing Delay - when accessing service via ClusterIP, or ExternalIP kubernetes/kubernetes#88986

Closed

This was referenced Apr 13, 2020

ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes #1243

Closed

Update Flannel manifests, install script and version (0.12) + fix tests scripts kubernetes-sigs/kubespray#5937

Merged

aojea force-pushed the vxlancsum branch from aa95bdd to 44dec0e Compare April 13, 2020 15:07

aojea changed the title ~~Make VXLAN checksum configurable~~ Disable tx and rx offloading on VXLAN interfaces Apr 13, 2020

andersosthus mentioned this pull request Apr 16, 2020

Remove support for Canal and the vxlan Flannel backend kubernetes/kops#8614

Closed

gamer22026 mentioned this pull request Apr 16, 2020

TCP offloading on vxlan.calico adaptor causing 63 second delays in VXLAN communications node->nodeport or node->clusterip:port. projectcalico/calico#3145

Closed

justinsb mentioned this pull request Apr 20, 2020

Fix net.bridge setting for Flannel on CentOS 7 kubernetes/kops#8381

Merged

rajatchopra approved these changes Apr 21, 2020

View reviewed changes

floryut mentioned this pull request Apr 28, 2020

Fix default value for standalone tests kubernetes-sigs/kubespray#6043

Merged

nemonik mentioned this pull request Apr 30, 2020

Issue with built in load balancer k3s-io/k3s#1216

Closed

brandond mentioned this pull request Apr 30, 2020

Formally add support for CentOS 7 k3s-io/k3s#1371

Closed

champtar mentioned this pull request May 12, 2020

DNS timeout in 2.13.0 with flannel/canal on rhel7 kubernetes-sigs/kubespray#6115

Closed

skamboj mentioned this pull request May 19, 2020

Additional 1s latency in host -> service IP -> pod when upgrading from 1.15.3 -> 1.18.1 on RHEL 8.1 kubernetes/kubernetes#90854

Closed

zhangguanzhang mentioned this pull request May 28, 2020

v1.17+ vxlan mode 63 sec delay kubernetes/kubernetes#91519

Closed

aojea closed this Jun 16, 2020

aojea deleted the vxlancsum branch June 16, 2020 07:56

raider444 mentioned this pull request Jul 3, 2020

Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: net/http: TLS handshake timeout cert-manager/cert-manager#2602

Closed

This was referenced Jul 9, 2020

Sonobuoy conformance test fails on Centos/RHEL k3s-io/k3s#1960

Closed

Master issue for tracking flannel vxlan failures k3s-io/k3s#2013

Closed

rpudlowski93 mentioned this pull request Jul 20, 2020

AWS RedHat - cluster networking issues/lags using canal and flannel plugins hitachienergy/epiphany#1072

Closed

couloum mentioned this pull request Nov 25, 2020

Additional 1s latency in host -> service IP -> pod with iptables kubernetes/kubernetes#96868

Closed

leodotcloud mentioned this pull request Dec 1, 2020

dnsPolicy in hostNetwork not working as expected kubernetes/kubernetes#87852

Closed

inspuradmin mentioned this pull request Dec 17, 2020

access service nodeport too slow(about 63s) inspursoft/board#911

Open

This comment was marked as spam.

Sign in to view

Disable tx and rx offloading on VXLAN interfaces #1282

Disable tx and rx offloading on VXLAN interfaces #1282

Conversation

aojea commented Apr 13, 2020 • edited Loading

Description

Todos

Release Note

aojea commented Apr 13, 2020

aojea commented Apr 13, 2020

Capitrium commented Apr 13, 2020

aojea commented Apr 14, 2020

aojea commented Apr 14, 2020

Miouge1 commented Apr 14, 2020

aojea commented Apr 14, 2020

Capitrium commented Apr 14, 2020

hakman commented Apr 16, 2020

rajatchopra left a comment

Choose a reason for hiding this comment

rajatchopra commented Apr 21, 2020

aojea commented Apr 21, 2020 • edited Loading

rikatz commented Apr 21, 2020

hakman commented Apr 22, 2020

hakman commented May 1, 2020 • edited Loading

zhangguanzhang commented May 27, 2020 • edited Loading

zhangguanzhang commented May 28, 2020 • edited Loading

aojea commented May 28, 2020

jhohertz commented May 28, 2020

zhangguanzhang commented May 28, 2020

danwinship commented May 28, 2020 • edited Loading

jhohertz commented May 28, 2020

MarkRose commented May 28, 2020

zhangguanzhang commented May 28, 2020

jhohertz commented May 28, 2020

jhohertz commented May 28, 2020 • edited Loading

danwinship commented May 28, 2020

jhohertz commented May 28, 2020

rikatz commented May 28, 2020

rikatz commented Jun 15, 2020

aojea commented Jun 16, 2020

hakman commented Jun 16, 2020

aojea commented Jun 16, 2020

hakman commented Jun 16, 2020

This comment was marked as spam.

Fluxay-666 commented May 9, 2023

aojea commented Apr 13, 2020 •

edited

Loading

aojea commented Apr 21, 2020 •

edited

Loading

hakman commented May 1, 2020 •

edited

Loading

zhangguanzhang commented May 27, 2020 •

edited

Loading

zhangguanzhang commented May 28, 2020 •

edited

Loading

danwinship commented May 28, 2020 •

edited

Loading

jhohertz commented May 28, 2020 •

edited

Loading