Super slow access to service IP from host (& host-networked pods) with Flannel CNI #1245

fengye87 · 2020-01-18T06:22:16Z

Ref: kubernetes/kubernetes#87233 (comment)

The k/k guys believed this is a Flannel's issue, so re-post here.

Flannel has compatability issues with k8s-1.17 flannel-io/flannel#1245. deploy calico plugin instead also for better proformance. Signed-off-by: Or Mergi <[email protected]>

Flannel has compatability issues with k8s-1.17 flannel-io/flannel#1245. deploy calico plugin instead also for better proformance. calico.yaml file is copied from Calico's documantation and no change should be done to it. Signed-off-by: Or Mergi <[email protected]>

* Deploy Calico pod network plugin on k8s-1.17 Flannel has compatability issues with k8s-1.17 flannel-io/flannel#1245. deploy calico plugin instead also for better proformance. calico.yaml file is copied from Calico's documantation and no change should be done to it. Signed-off-by: Or Mergi <[email protected]> * CNI manifest file names and kubernetes versions map This map will corrolate between k8s version and the plugin we would like to deploy. Signed-off-by: Or Mergi <[email protected]> * Separate cni selection logic from provision scripts. cli.sh, create /tmp/scripts directory in the VM and copy cni-map.sh . cnis-map.sh map between k8s version and cni manifest file name to use. node01.sh provision.sh, use cnis-map.sh to resolve the right cni manifest to use. Signed-off-by: Or Mergi <[email protected]>

mariusgrigoriu · 2020-02-04T00:09:14Z

We are seeing multiple reports that flannel + kube 1.17 don't play well:

kubernetes 1.17.0 - kubeadm init - kube-controller-manager status is ContainerCreating kubernetes/kubernetes#86961 (comment)
ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes #1243

@tomdee can you look at these?

mikebryant · 2020-02-05T21:07:26Z

I think I've been hitting this issue yesterday/today

Some tests I was doing, from one host (not in a container)

curl http://pod-on-other-node worked
curl http://service-ip worked after a delay of 1 second. tcpdump showed a retransmission which got through. This was reproducible

I've just swapped to the host-gw backend and everything's working normally

flannel: 0.11.0
kubernetes: 1.17.2, installed using kubeadm
on a baremetal switched network.

mariusgrigoriu · 2020-02-05T23:40:16Z

Something we noticed is that the number of conntrack insert_failed was dramatically higher while running kube 1.17.

thibautvincent · 2020-02-14T18:58:59Z

We experienced the same issue today. Fixed this by using the solution of @mikebryant. Is there any permanent solution on the way?

MansM · 2020-02-19T12:37:50Z

@tomdee as you are the last remaining maintainer, who should I ping/tag to get this looked at.

tkislan · 2020-03-25T09:36:14Z

Just FIY, this is not related only to 1.17 .. Because of these issues here, I've tried to downgrade from 1.17.3 to 1.16.8, but same result
First of all, route is missing from service cidr to cni0 interface gateway, so I had to manually add it in order for it to even resolve

ip route add 10.96.0.0/12 via 10.244.3.1

And after that, even traceroute is super slow

traceroute <service>.<namespace>.svc.cluster.local
traceroute to <service>.<namespace>.svc.cluster.local (10.106.49.44), 30 hops max, 38 byte packets
 1  10.244.3.1 (10.244.3.1)  3097.057 ms !H  3097.946 ms !H  3119.540 ms !H

mariusgrigoriu · 2020-03-31T19:14:36Z

Just curious, how many folks experiencing this issue are using hyperkube?

mengmann · 2020-04-13T13:45:09Z

I'm having this issue with vxlan backend both with flannel version 0.11 and 0.12 aswell.
Affected kubernetes versions 1.16.X, 1.17.x and 1.18.x.

Finally setting up a static route on my nodes to service network through cni0 interface helped me instantly:
ip route add 10.96.0.0/12 dev cni0

os: CentOS 7
install method: kubeadm
underlying plattform: Virtualbox 6

pytimer · 2020-04-17T08:33:17Z

Finally setting up a static route on my nodes to service network through cni0 interface helped me instantly:
ip route add 10.96.0.0/12 dev cni0

Fixed this problem by using the solution of @mengmann in kubernetes version v1.17.2 .

blueabysm · 2020-04-24T03:11:38Z

I think I've been hitting this issue yesterday/today

Some tests I was doing, from one host (not in a container)

curl http://pod-on-other-node worked

curl http://service-ip worked after a delay of 1 second. tcpdump showed a retransmission which got through. This was reproducible

I've just swapped to the host-gw backend and everything's working normally

flannel: 0.11.0
kubernetes: 1.17.2, installed using kubeadm
on a baremetal switched network.

Exactly the same issue here

skamboj · 2020-05-07T18:33:58Z

I think I've been hitting this issue yesterday/today
Some tests I was doing, from one host (not in a container)

curl http://pod-on-other-node worked

curl http://service-ip worked after a delay of 1 second. tcpdump showed a retransmission which got through. This was reproducible

I've just swapped to the host-gw backend and everything's working normally
flannel: 0.11.0
kubernetes: 1.17.2, installed using kubeadm
on a baremetal switched network.

Exactly the same issue here

Not sure if its the same issue but we noticed an additional delay of 1 second when upgrading from kubernetes 1.15.3 to 1.18.1. We seem to trace the problem to the --random-fully flag introduced by this PR. See the issue here

blueabysm · 2020-05-10T09:54:37Z

I think I've been hitting this issue yesterday/today
Some tests I was doing, from one host (not in a container)

curl http://pod-on-other-node worked

curl http://service-ip worked after a delay of 1 second. tcpdump showed a retransmission which got through. This was reproducible

I've just swapped to the host-gw backend and everything's working normally
flannel: 0.11.0
kubernetes: 1.17.2, installed using kubeadm
on a baremetal switched network.

Exactly the same issue here

Not sure if its the same issue but we noticed an additional delay of 1 second when upgrading from kubernetes 1.15.3 to 1.18.1. We seem to trace the problem to the --random-fully flag introduced by this PR. See the issue here

I'm currently working with kubernetes 17.3(some nodes 17.4). Fortunately there are not so many apps running on my new-built cluster, so I migrated them this week and changed the network fabric to calico according to this article. Now erverything works perfect. 😄

Raven888888 · 2023-01-03T10:00:02Z

@rbrtbnfgl What do you think about this issue?

I am experiencing the same slowness when accessing service external to the cluster (flannel cni).
Ping inside pod for "google.com" sometimes results in

bad address

but occasionally success after 15 seconds...

I am almost at the point of switching to calico as it claims to solve the problem...

rbrtbnfgl · 2023-01-03T10:14:06Z

Which version of Flannel are you using? This is a very old issue I think it will be better if you create a new one with your setup config. It could be a problem with the UDP checksum #1679

Raven888888 · 2023-01-03T11:27:51Z

@rbrtbnfgl
docker.io/rancher/mirrored-flannelcni-flannel:v0.20.2, which i believe is the latest already. The only thing I have changed is:

net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "vxlan",
        "VNI" : 4096,
        "Port": 4789
      }
    }

Because my cluster is a mix of windows (worker only) and linux nodes, and the port numbers need to change.

I did try running

 iptables -t nat -I FLANNEL-POSTRTG -m mark --mark 0x4000/0x4000 -j RETURN

on all the master and worker nodes (no reboot or restart any services whatsoever, just run the command purely), but still does not resolve the issue.

The tricky bit is, the problem is not consistent. I'd say, 80% of the time ping will either give "bad address" or take > 15s to resolve, 20% of the time it works reasonably ok.

NB: ping ip always work, it is the DNS resolution, or the communication from pod to dns resolver via flannel, that seems to be causing the issue.

rbrtbnfgl · 2023-01-03T13:33:02Z

Is the issue only on the pods on the windows nodes or also with the ones on linux?

Raven888888 · 2023-01-04T02:18:27Z

I have resolved my issue, although I am not sure how/why it causes issues to my cluster.

Root cause:
Coredns pods are all running in the same master node.
I suspect it may be due to the installation step sequence that mess it up? My steps:

Install 1 linux master node with HA and external etcd, then kubeadm init cluster.
By now it should already have coredns running, since there is only 1 master node, all coredns pods running in it.
Install flannel cni.
Add 2 more linux master nodes.
Add 3 more linux worker nodes.
Observe the issue above in any of the worker nodes.

Solution:
Run the following from master node

kubectl -n kube-system rollout restart deployment coredns

Perhaps after step 3, coredns is meant to be refreshed/restarted?

Anyway, thanks @rbrtbnfgl

stale · 2023-07-03T03:18:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ormergi mentioned this issue Jan 28, 2020

Create k8s provider with Fedora31 nodes. kubevirt/kubevirtci#246

Merged

ormergi mentioned this issue Jan 29, 2020

Deploy Calico cni with k8s-1.17 kubevirt/kubevirtci#256

Merged

This was referenced Feb 15, 2020

ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes #1243

Closed

dnsPolicy in hostNetwork not working as expected kubernetes/kubernetes#87852

Closed

netdata fail to resolve in kubernetes netdata/netdata#8060

Closed

joshbranham mentioned this issue Feb 24, 2020

Remove support for Canal and the vxlan Flannel backend kubernetes/kops#8614

Closed

gamer22026 mentioned this issue Mar 16, 2020

60 second delayed delivery of packet to pod #1268

Closed

davesargrad mentioned this issue Mar 17, 2020

Bare Metal K8S 63 Second Service Routing Delay - when accessing service via ClusterIP, or ExternalIP kubernetes/kubernetes#88986

Closed

aojea mentioned this issue Apr 12, 2020

Update Flannel manifests, install script and version (0.12) + fix tests scripts kubernetes-sigs/kubespray#5937

Merged

kuramal mentioned this issue May 15, 2020

Bugfix iptables --random-fully must supported by kernel kubernetes/kubernetes#91137

Closed

ormergi mentioned this issue Jun 8, 2020

Deploy Calico CNI on kind providers kubevirt/kubevirtci#383

Merged

sandys mentioned this issue Jun 9, 2020

Consider switching default CNI to Calico k3s-io/k3s#1880

Closed

ormergi mentioned this issue Jun 22, 2020

README: production user should change mac range k8snetworkplumbingwg/kubemacpool#191

Merged

stale bot added the wontfix label Jul 3, 2023

stale bot closed this as completed Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super slow access to service IP from host (& host-networked pods) with Flannel CNI #1245

Super slow access to service IP from host (& host-networked pods) with Flannel CNI #1245

fengye87 commented Jan 18, 2020

mariusgrigoriu commented Feb 4, 2020

mikebryant commented Feb 5, 2020

mariusgrigoriu commented Feb 5, 2020

thibautvincent commented Feb 14, 2020

MansM commented Feb 19, 2020

tkislan commented Mar 25, 2020

mariusgrigoriu commented Mar 31, 2020

mengmann commented Apr 13, 2020 •

edited

Loading

pytimer commented Apr 17, 2020

blueabysm commented Apr 24, 2020

skamboj commented May 7, 2020

blueabysm commented May 10, 2020

Raven888888 commented Jan 3, 2023

rbrtbnfgl commented Jan 3, 2023

Raven888888 commented Jan 3, 2023

rbrtbnfgl commented Jan 3, 2023 •

edited

Loading

Raven888888 commented Jan 4, 2023

stale bot commented Jul 3, 2023

Super slow access to service IP from host (& host-networked pods) with Flannel CNI #1245

Super slow access to service IP from host (& host-networked pods) with Flannel CNI #1245

Comments

fengye87 commented Jan 18, 2020

mariusgrigoriu commented Feb 4, 2020

mikebryant commented Feb 5, 2020

mariusgrigoriu commented Feb 5, 2020

thibautvincent commented Feb 14, 2020

MansM commented Feb 19, 2020

tkislan commented Mar 25, 2020

mariusgrigoriu commented Mar 31, 2020

mengmann commented Apr 13, 2020 • edited Loading

pytimer commented Apr 17, 2020

blueabysm commented Apr 24, 2020

skamboj commented May 7, 2020

blueabysm commented May 10, 2020

Raven888888 commented Jan 3, 2023

rbrtbnfgl commented Jan 3, 2023

Raven888888 commented Jan 3, 2023

rbrtbnfgl commented Jan 3, 2023 • edited Loading

Raven888888 commented Jan 4, 2023

stale bot commented Jul 3, 2023

mengmann commented Apr 13, 2020 •

edited

Loading

rbrtbnfgl commented Jan 3, 2023 •

edited

Loading