Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceLB cannot be accessed via loopback when service ExternalTrafficPolicy=Local #7561

Closed
proligde opened this issue May 16, 2023 · 19 comments
Assignees
Milestone

Comments

@proligde
Copy link

proligde commented May 16, 2023

K3s Version:
Migrating from v1.25.6+k3s1 to v1.25.7+k3s1

Node(s) CPU architecture, OS, and Version:
1 Node, x86, reproducible on different OS like Ubuntu 22.04 and Ubuntu on WSL

Cluster Configuration:
1 Node local dev development

Describe the bug:
We're using k3s as our local development environment platform, routing FQDNs to our dev machines using /etc/hosts entries pointing to 127.0.0.1

This worked perfectly fine for basically years now, until recently. I had to dig quite a while before I was able to pinpoint it to upgrading k3s from v1.25.6+k3s1 to v1.25.7+k3s1. It stops working on 1.25.7 and works again after downgrading to 1.25.6. The problem also exists on the most current 1.27.1

Steps To Reproduce:
Install k3s and ingress controller like so and access it using 127.0.0.1

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.7+k3s1 K3S_KUBECONFIG_MODE="644" INSTALL_K3S_EXEC="--disable=traefik" sh -

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx

helm install -n kube-system --version '^4.0.0' --set 'controller.watchIngressWithoutClass=true' --set 'controller.service.externalTrafficPolicy=Local' nginx-ingress ingress-nginx/ingress-nginx

curl http://127.0.0.1

Expected behavior:
A 404 Message returned by nginx

Actual behavior:
Network timeout

However - the Ports/Services are in fact working perfectly fine using the Node's LAN Address like 192.168.... It's just not available via 127.0.0.1 anymore and I can't figure out, why.

Additional info

  • I checked the corresponding changelog at https://github.com/k3s-io/k3s/releases/tag/v1.25.7%2Bk3s1 and found some changes regarding the servicelb but at least to me none of them explained the behavior I'm seeing.

  • I exported iptables while running 1.25.6 and while running 1.25.7 and tried to compare them somehow but I'm afraid my knowledge of iptables is not sufficient to assess whether the cause of the problem is to be found here or not.

Thanks a lot for any help in advance - Max

@brandond
Copy link
Member

I suspect that it's related to the flannel version bump. I see that @rbrtbnfgl has assigned this to himself so I'll let him reply once he's had a chance to do some investigation.

@rbrtbnfgl
Copy link
Contributor

rbrtbnfgl commented May 16, 2023

Yes I want to do a futher investigation on it. The only difference between the Kube-router and flannel versions should be the iptables rules order but they should be related only to the pods traffic and it seems that the issue is related to the traffic from the node to localhost.

@rbrtbnfgl
Copy link
Contributor

rbrtbnfgl commented May 23, 2023

I think the issue is related to externalTrafficPolicy=Local if it's configured to Cluster it works. I have to check what happened between the two versions that led Local to not work with localhost.

@rbrtbnfgl
Copy link
Contributor

I don't know if this is the right fix or this behave is the expected one. It could be related to this https://github.com/k3s-io/k3s/blob/master/pkg/cloudprovider/servicelb.go#L556
Maybe also localhost needs to be added as destination IP.

@brandond
Copy link
Member

We haven't changed anything about how traffic gets into the svclb pods, that still relies on NodePort pods - the kubelet and portmap CNI plugin handle all of that.

@rbrtbnfgl
Copy link
Contributor

rbrtbnfgl commented May 24, 2023

The behaviour of the CNI is the same on both versions the packets directed to localhost:80 are masqueraded with the IP of the svclb-nginx-ingress-ingress-nginx-controller pod and the packets arrive to that pod but are dropped by it.
Inspecting that pod I noticed that it's installing internally some iptables rules to redirect the traffic to the nginx-ingress-ingress-nginx-controller pod.
On 1.25.6

Defaulted container "lb-tcp-80" out of: lb-tcp-80, lb-tcp-443
+ trap exit TERM INT
+ echo 0.0.0.0/0
+ grep -Eq :
+ iptables -t filter -I FORWARD -s 0.0.0.0/0 -p TCP --dport 80 -j ACCEPT
+ echo 10.43.117.1
+ grep -Eq :
+ cat /proc/sys/net/ipv4/ip_forward
+ '[' 1 '==' 1 ]
+ iptables -t filter -A FORWARD -d 10.43.117.1/32 -p TCP --dport 80 -j DROP
+ iptables -t nat -I PREROUTING '!' -s 10.43.117.1/32 -p TCP --dport 80 -j DNAT --to 10.43.117.1:80
+ iptables -t nat -I POSTROUTING -d 10.43.117.1/32 -p TCP -j MASQUERADE
+ '[' '!' -e /pause ]
+ mkfifo /pause

On 1.25.7

Defaulted container "lb-tcp-80" out of: lb-tcp-80, lb-tcp-443
+ trap exit TERM INT
+ echo 0.0.0.0/0
+ grep -Eq :
+ iptables -t filter -I FORWARD -s 0.0.0.0/0 -p TCP --dport 80 -j ACCEPT
+ echo 10.1.1.4
+ grep -Eq :
+ cat /proc/sys/net/ipv4/ip_forward
+ '[' 1 '==' 1 ]
+ iptables -t filter -A FORWARD -d 10.1.1.4/32 -p TCP --dport 32545 -j DROP
+ iptables -t nat -I PREROUTING '!' -s 10.1.1.4/32 -p TCP --dport 80 -j DNAT --to 10.1.1.4:32545
+ iptables -t nat -I POSTROUTING -d 10.1.1.4/32 -p TCP -j MASQUERADE
+ '[' '!' -e /pause ]
+ mkfifo /pause

10.1.1.4 is the nodeIP. If I check the iptables rules matched internally by the pod in case of 1.25.7 the drop rule is matched.
This one iptables -t filter -A FORWARD -d 10.1.1.4/32 -p TCP --dport 32545 -j DROP. That rules is matched in this case because the port is changed and it doesn't match the default ACCEPT rule that uses --dport 80.

@brandond
Copy link
Member

brandond commented May 24, 2023

Yes, the difference comes from the ExternalTrafficPolicy, which was added in https://github.com/k3s-io/k3s/pull/6726/files#diff-38e1c51632f2d12566706b6d7f22cf82e1441ea19e27f787ff15e4b7b92dc197R566-R592

If the ExternalTrafficPolicy is set to local, the svclb pods target the host IP and NodePort to ensure that traffic only goes to local pods. If it is set to anything else, it targets the cluster service address and port.

Why would the drop rule only match if the connection is to the loopback address, instead of the node IP? Is it bypassing a port rewrite rule on the way in?

@rbrtbnfgl
Copy link
Contributor

When you the ip is the node IP the kube-proxy rules masquerade the packets with the ingress pod IP and not the svclb one.

@brandond
Copy link
Member

brandond commented May 24, 2023

Hmm. The intent of the FORWARD rules is to enforce LoadBalancerSourceRanges; I guess that ends up blocking local access that doesn't go through kube-proxy.

What's the source of the traffic in this case? Is it from the loopback address?

@rbrtbnfgl
Copy link
Contributor

The source address is the IP of cni0

@brandond brandond moved this from New to Working in K3s Development May 25, 2023
@brandond brandond added this to the v1.27.3+k3s1 milestone May 25, 2023
@brandond brandond self-assigned this May 25, 2023
@brandond
Copy link
Member

brandond commented May 25, 2023

Hmm OK. So in the case of targeting the local pod's node port for local traffic policy, the forward rules we put in place to enforce source ranges end up blocking access via localhost. I'll have to do some thinking on how to handle that.

Is it always the cni0 IP as the source, across all CNIs? or is this a flannel-specific thing?

@rbrtbnfgl
Copy link
Contributor

It should be related to portmap CNI. It changes the destination IP to the IP of the pod and the linux routing table decides to use the IP of cni0 that belongs to the same network of the pods.

@brandond
Copy link
Member

brandond commented May 26, 2023

I just realized that I was overcomplicating the path here, and the core issue is just that the source and destination port are not the same when the ExternalTrafficPolicy is set to local, and I used the wrong port in the allow rule.

There appears to be a community PR to fix this at k3s-io/klipper-lb#54

@brandond brandond moved this from Working to To Test in K3s Development May 26, 2023
@brandond brandond moved this from To Test to Peer Review in K3s Development May 26, 2023
@brandond brandond changed the title ServiceLB/Ingress ports 80 and 443 timeout on host's 127.0.0.1 in k3s versions > v1.25.6+k3s1 ServiceLB cannot be accessed via loopback when service ExternalTrafficPolicy=Local May 31, 2023
@fmoral2 fmoral2 self-assigned this Jun 1, 2023
@fmoral2
Copy link
Contributor

fmoral2 commented Jun 12, 2023

Validated on Version:

-k3s version v1.27.2+k3s-55db9b18 (55db9b18)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
Ubuntu

Cluster Configuration:
1 node

Config.yaml:

token: secret
write-kubeconfig-mode: 644
selinux: true
cluster-init: true

Steps to Repro the issue:

  1. Install k3s in previous version
  2. Deploy a workload that will reach local :8080
  3. Check the connection

Steps to Validate the fix:

  1. Install k3s in latest commit
  2. Deploy a workload that will reach local :8080
  3. Check the connection successfully
    #### Issue Rep: ####

~$ k3s -v
    k3s version v1.27.2+k3s-0485a56f (0485a56f)


 ~$ k apply
    `
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
    name: ingresstest-deploy
    labels:
    app: ingresstest
    spec:
    selector:
    matchLabels:
    app: ingresstest
    template:
    metadata:
    labels:
    app: ingresstest
    spec:
    containers:
    - name: ingresstest
    image: ranchertest/mytestcontainer:unprivileged
    imagePullPolicy: Always
    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: ingresstest-ingress-svc
    labels:
    app: ingresstest
    spec:
    externalTrafficPolicy: Local
    type: LoadBalancer
    ports:
    - port: 8080
    targetPort: 8080
    protocol: TCP
    name: http
    selector:
    app: ingresstest
    `


~$ curl http://127.0.0.1:8080
    curl: (28) Failed to connect to 127.0.0.1 port 8080 after 129254 ms: Connection timed out




    ###### Issue Validation: ########



 ~$ k3s -v
    k3s version v1.27.2+k3s-55db9b18 (55db9b18)


~$ curl http://127.0.0.1:8080
    <!DOCTYPE html>
    <html>
    <head>
        <title>Welcome to nginx!</title>
        <style>
            html { color-scheme: light dark; }
            body { width: 35em; margin: 0 auto;
                font-family: Tahoma, Verdana, Arial, sans-serif; }
        </style>
    </head>
    <body>
    <h1>Welcome to nginx!</h1>
    <p>If you see this page, the nginx web server is successfully installed and
        working. Further configuration is required.</p>

    <p>For online documentation and support please refer to
        <a href="http://nginx.org/">nginx.org</a>.<br/>
        Commercial support is available at
        <a href="http://nginx.com/">nginx.com</a>.</p>

    <p><em>Thank you for using nginx.</em></p>
    </body>
    </html>



~$  k describe pod svclb-ingresstest-ingress-svc-467619a8-6sm9j -n kube-system | grep "klipper"
    Image:          rancher/klipper-lb:v0.4.4
    Image ID:       docker.io/rancher/klipper-lb@sha256:d6780e97ac25454b56f88410b236d52572518040f11d0db5c6baaac0d2fcf860
  Normal  Pulled     5m18s  kubelet            Container image "rancher/klipper-lb:v0.4.4" already present on machine

@fmoral2 fmoral2 closed this as completed Jun 12, 2023
@coopstah13
Copy link

can this be backported?

@brandond
Copy link
Member

Backported to what? It was already backported to all branches that were active at the time the issue was closed. You can find the backport issues linked above.

@coopstah13
Copy link

Sorry I see that now, I read the last comment about it being verified on 1.27 and assumed that was the only place there was a fix

@coopstah13
Copy link

I'm on 1.25.15 and still facing this issue, however

@brandond
Copy link
Member

brandond commented Nov 16, 2023

Please open a new issue and fill out the issue template, describing specifically what you're running into. The issue described here was resolved in #7718 - I suspect you're running into something different.

@k3s-io k3s-io locked as resolved and limited conversation to collaborators Nov 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

5 participants