You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In AWS, Azure, and GCP, we consistently see timeouts when curling an external load balancer kubernetes service that has externalTrafficPolicy: Cluster. As soon as we change it to Local, we see no issues.
There is some context that makes this particularly interesting:
This is in vxlan mode and eBPF is NOT enabled
We are migrating from flannel, and we do not see this issue when we migrate in-place. The issue only exists on a new VM fresh install of calico without flannel.
Debugging steps
We have tried calico v3.16.1, v3.16.8, and v3.18.0 to no avail
In our AWS load balancer, we see that only the nodes that have the service pods are marked as healthy, all others are unhealthy. When I tried ncat of these nodes on the health check port, I see that the nodes without service pods always time out and the nodes with service pods fail most of the time (presumably the only requests that succeed are those that go to the node with a pod and happen to hit the iptables rule that results in sending the request to its own pod)
tcpdump seems to suggest that the flow is such that the packet can be forwarded to a pod and it can come back to the node, but then it gets dropped
$ iptables-save -c | grep DROP | grep -v '\[0:0\]'
:FORWARD DROP [823:49380]
[1008230:60493352] -A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
[168:6720] -A cali-fw-caliedc894ab6d7 -m comment --comment "cali:FgGMAMksDxmepY0G" -m conntrack --ctstate INVALID -j DROP
[1008230:60493352] -A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
When I curl the load balancer and watch the above command, :FORWARD DROP counts increase very quickly.
Expected Behavior
Starting up new VMs with fresh calico should not result in dropped packets for services that have externalTrafficPolicy: Cluster
Current Behavior
Starting up new VMs with fresh calico does result in dropped packets for services that have externalTrafficPolicy: Cluster when load balancer rules result in request being sent to pods on other nodes
Possible Solution
Running iptables -P FORWARD ACCEPT does fix the issue, but I noticed that the flannel -> calico in-place migration iptables also has FORWARD DROP policy but it never gets hit
Steps to Reproduce (for bugs)
Create a kubernetes LoadBalancer with externalTrafficPolicy: Cluster
Bring up fresh VMs on calico
curl the external LB or ncat node on service health check port
Your Environment
Calico version: v3.16.1 but tried v3.16.8 and v3.18.0
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes v1.16
Operating System and version: Centos 7
The text was updated successfully, but these errors were encountered:
The issue was that we weren't setting --cluster-cidr on kube-proxy, so kube-proxy wasn't setting the FORWARD iptable rules for requests to/from our cluster cidr. It worked on the flannel->calico in-place migration cluster because flannel daemon adds the rules when it is started.
Is this something that calico-node should do for the sake of being robust? Or is that something that should be left to kube-proxy?
Hi @austintackaberry , thanks for bringing this up. I think our intention was make sure that everybody set the --cluster-cidr flag on kube-proxy. We also thought most installers should have set that kube-proxy flag for us so we didn't handle it, but as you've pointed out, it can lead to some strange misunderstandings when upgrading from flannel. We'll see what we can do to either add the FORWARD rules ourselves or at least make it more obvious that the flag on kube-proxy is missing.
And of course if you have your own ideas on how it should be done, any PRs are welcome as well.
@fasaxc I don't think I have seen that problem, but it's good to know about it. Thanks for sharing, looks like an interesting issue
caseydavenport
changed the title
Requests to kubernetes load balancer time out when externalTrafficPolicy is Cluster
Requests to k8s load balancers time out when externalTrafficPolicy is Cluster and --cluster-cidr is not set
Jan 10, 2022
In AWS, Azure, and GCP, we consistently see timeouts when curling an external load balancer kubernetes service that has externalTrafficPolicy: Cluster. As soon as we change it to Local, we see no issues.
There is some context that makes this particularly interesting:
Debugging steps
ncat
of these nodes on the health check port, I see that the nodes without service pods always time out and the nodes with service pods fail most of the time (presumably the only requests that succeed are those that go to the node with a pod and happen to hit the iptables rule that results in sending the request to its own pod)When I curl the load balancer and watch the above command,
:FORWARD DROP
counts increase very quickly.Expected Behavior
Starting up new VMs with fresh calico should not result in dropped packets for services that have externalTrafficPolicy: Cluster
Current Behavior
Starting up new VMs with fresh calico does result in dropped packets for services that have externalTrafficPolicy: Cluster when load balancer rules result in request being sent to pods on other nodes
Possible Solution
Running
iptables -P FORWARD ACCEPT
does fix the issue, but I noticed that the flannel -> calico in-place migration iptables also has FORWARD DROP policy but it never gets hitSteps to Reproduce (for bugs)
Your Environment
The text was updated successfully, but these errors were encountered: