Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable public access to services on production cluster #16

Closed
larsks opened this issue Oct 13, 2022 · 13 comments · Fixed by OCP-on-NERC/nerc-ocp-config#146
Closed

Enable public access to services on production cluster #16

larsks opened this issue Oct 13, 2022 · 13 comments · Fixed by OCP-on-NERC/nerc-ocp-config#146
Assignees
Labels
help wanted Extra attention is needed openshift This issue pertains to NERC OpenShift

Comments

@larsks
Copy link
Contributor

larsks commented Oct 13, 2022

We need to be able to expose services on the production cluster on a public address. This will probably require configuring a new ingress controller and attaching the worker nodes to a public VLAN.

@joachimweyl
Copy link
Contributor

joachimweyl commented Oct 18, 2022

The effort/estimate required does not show that this requires going through multiple people to get the access needed. Important to keep in mind this could take a long time for a turnaround.

@larsks larsks changed the title Enable public access to services production cluster Enable public access to services on production cluster Oct 25, 2022
larsks added a commit to larsks/nerc-ocp-config that referenced this issue Oct 26, 2022
Install the metallb [1] operator on nerc-ocp-prod. Our immediate use case
for this operator is providing a public ip address to the public-facing
ingress service.

Part of: nerc-project/operations#16

[1]: https://metallb.universe.tf/
@larsks
Copy link
Contributor Author

larsks commented Oct 28, 2022

Some resources that have been useful; leaving these here as notes to myself for now:

@larsks
Copy link
Contributor Author

larsks commented Oct 28, 2022

Everything is set up and the configuration looks correct, but externally originated requests are getting dropped. I'm currently trying to figure out what's going on. As part of that, I've posted a message to an internal mailing list that I'm reproducing here because it summarizes where I am:


I've set up metallb on an OCP 4.11 cluster using OpenShiftSDN. The
cluster was installed on an internal network, but we want to expose
certain services at public addresses. Requests to addresses hosted by
MetalLB don't make it to their targets, and I haven't been able to
figure out where or why they're being dropped.


There are a set of nodes (identified by the label
nerc.mghpcc.org/external-ingress=true) that are connected to a
public network. The relevant interface configuration on these nodes
looks like:

8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 10:7d:1a:9c:7c:1d brd ff:ff:ff:ff:ff:ff
    inet 10.30.6.23/23 brd 10.30.7.255 scope global dynamic noprefixroute bond0
       valid_lft 541223sec preferred_lft 541223sec
    inet6 fe80::6d40:59df:9856:e6b0/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
9: bond0.2180@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 10:7d:1a:9c:7c:1d brd ff:ff:ff:ff:ff:ff
    inet 199.94.61.23/24 brd 199.94.61.255 scope global dynamic noprefixroute bond0.2180
       valid_lft 541222sec preferred_lft 541222sec

Where bond0 has the primary address and bond0.2180 has the public
address. The main routing table looks like this:

default via 10.30.6.1 dev bond0 proto dhcp src 10.30.6.23 metric 300
10.30.6.0/23 dev bond0 proto kernel scope link src 10.30.6.23 metric 300
10.30.10.0/23 dev bond0.2173 proto kernel scope link src 10.30.10.23 metric 402
10.128.0.0/14 dev tun0 scope link
10.255.116.0/23 via 10.30.10.1 dev bond0.2173 proto dhcp src 10.30.10.23 metric 402
172.30.0.0/16 dev tun0
199.94.61.0/24 dev bond0.2180 proto kernel scope link src 199.94.61.23 metric 401

In order to support connections to the public address, we are using
policy based routing with the following rules:

0:      from all lookup local
32764:  from 199.94.61.0/24 lookup main suppress_prefixlength 0
32765:  from 199.94.61.0/24 lookup 200
32766:  from all lookup main
32767:  from all lookup default

And the following routes in table 200:

 default via 199.94.61.1 dev bond0.2180

This all works: if I start a service directly on any of these nodes
(e.g, nc -l 199.94.61.23 80), I can connect to that service from the
public internet.

We have MetalLB installed on the cluster with the following
configuration:

apiVersion: metallb.io/v1beta1
kind: MetalLB
metadata:
  name: metallb
  namespace: metallb-system
spec:
  nodeSelector:
    nerc.mghpcc.org/external-ingress: "true"

We're using L2Advertisement mode with the following address pool:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: public
  namespace: metallb-system
spec:
  addresses:
  - 199.94.61.240/28

With this configuration in place, LoadBalancer type Services acquire
an address from the expected network range...but we are unable to
connect to these services from outside the network. We've deployed a
simple webserver pod and the following Service for testing:

apiVersion: v1
kind: Service
metadata:
  annotations:
    metallb.universe.tf/address-pool: public
    metallb.universe.tf/loadBalancerIPs: 199.94.61.241
  labels:
    app: example
  name: example
  namespace: default
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  selector:
    app: example
  type: LoadBalancer

The public address is owned by node "wrk-13", and running tcpdump on
interface bond0.2180 on that node shows the incoming request:

14:59:21.627969 IP 108.7.73.199.45940 > 199.94.61.241.http: Flags
  [S], seq 2062582668, win 64240, options [mss 1460,sackOK,TS val
  358098238 ecr 0,nop,wscale 7], length 0

Enabling nft tracing shows us that the packet ultimately hits the
correct DNAT rule:

trace id 5c72f445 ip mangle PREROUTING packet: iif "bond0.2180" ether saddr 00:09:0f:09:00:22 ether daddr 10:7d:1a:9c:7c:1d ip saddr 108.7.73.199 ip daddr 199.94.61.241 ip dscp af21 ip ecn not-ect ip ttl 49 ip id 19136 ip length 60 tcp sport 58066 tcp dport 80 tcp flags == syn tcp window 64240
trace id 5c72f445 ip mangle PREROUTING rule meta l4proto tcp ip daddr 199.94.61.241 meta nftrace set 1 (verdict continue)
trace id 5c72f445 ip mangle PREROUTING verdict continue
trace id 5c72f445 ip mangle PREROUTING policy accept
trace id 5c72f445 ip nat PREROUTING packet: iif "bond0.2180" ether saddr 00:09:0f:09:00:22 ether daddr 10:7d:1a:9c:7c:1d ip saddr 108.7.73.199 ip daddr 199.94.61.241 ip dscp af21 ip ecn not-ect ip ttl 49 ip id 19136 ip length 60 tcp sport 58066 tcp dport 80 tcp flags == syn tcp window 64240
trace id 5c72f445 ip nat PREROUTING rule  counter packets 63312 bytes 6392581 jump KUBE-SERVICES (verdict jump KUBE-SERVICES)
trace id 5c72f445 ip nat KUBE-SERVICES rule meta l4proto tcp ip daddr 199.94.61.241  tcp dport 80 counter packets 3 bytes 180 jump KUBE-FW-OVNCXUYVQAG4IM75 (verdict jump KUBE-FW-OVNCXUYVQAG4IM75)
trace id 5c72f445 ip nat KUBE-FW-OVNCXUYVQAG4IM75 rule  counter packets 3 bytes 180 jump KUBE-MARK-MASQ (verdict jump KUBE-MARK-MASQ)
trace id 5c72f445 ip nat KUBE-MARK-MASQ rule counter packets 139 bytes 12432 meta mark set mark or 0x1  (verdict continue)
trace id 5c72f445 ip nat KUBE-MARK-MASQ verdict continue meta mark 0x00000001
trace id 5c72f445 ip nat KUBE-FW-OVNCXUYVQAG4IM75 rule  counter packets 3 bytes 180 jump KUBE-SVC-OVNCXUYVQAG4IM75 (verdict jump KUBE-SVC-OVNCXUYVQAG4IM75)
trace id 5c72f445 ip nat KUBE-SVC-OVNCXUYVQAG4IM75 rule  counter packets 3 bytes 180 jump KUBE-SEP-7IYK4YORJLH3EP2Z (verdict jump KUBE-SEP-7IYK4YORJLH3EP2Z)
trace id 5c72f445 ip nat KUBE-SEP-7IYK4YORJLH3EP2Z rule meta l4proto tcp   counter packets 3 bytes 180 dnat to 10.129.2.9:8080 (verdict accept)

And apparently that's where it stops. The address in that final DNAT
rule is the address of the pod hosting a simple web service, but it
never reaches the pod. Running tcpdump in the pod network namespaces
confirms that the packet never arrives.

If on any of the worker nodes, we locally run curl 10.129.2.9:8080 it works as
expected.

@larsks
Copy link
Contributor Author

larsks commented Oct 28, 2022

I have also tried reaching out on the coreos slack #forum-sdn channel.

@larsks larsks added the help wanted Extra attention is needed label Oct 28, 2022
@larsks
Copy link
Contributor Author

larsks commented Oct 28, 2022

I've tried tracing this through OVS with the following command:

ovs-appctl ofproto/trace br0 in_port=2,tcp,nw_src=108.7.73.199,nw_dst=10.129.2.9,tcp_dst=8080

That produces:

Flow: tcp,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=108.7.73.199,nw_dst=10.129.2.9,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=8080,tcp_flags=0

bridge("br0")
-------------
 0. ct_state=-trk,ip, priority 1000
    ct(table=0)
    drop
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.

Final flow: unchanged
Megaflow: recirc_id=0,ct_state=-trk,eth,ip,in_port=2,nw_frag=no
Datapath actions: ct,recirc(0x1c204)

===============================================================================
recirc(0x1c204) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
===============================================================================

Flow: recirc_id=0x1c204,ct_state=new|trk,eth,tcp,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=108.7.73.199,nw_dst=10.129.2.9,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=8080,tcp_flags=0

bridge("br0")
-------------
    thaw
        Resuming from table 0
 0. ip,in_port=2, priority 200
    goto_table:30
30. priority 0
    goto_table:31
31. ip,nw_dst=10.128.0.0/14, priority 100
    goto_table:90
90. ip,nw_dst=10.129.2.0/23, priority 100, cookie 0x5340404c
    move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31]
     -> NXM_NX_TUN_ID[0..31] is now 0
    set_field:10.30.6.28->tun_dst
    output:1
     -> output to kernel tunnel

Final flow: recirc_id=0x1c204,ct_state=new|trk,eth,tcp,tun_src=0.0.0.0,tun_dst=10.30.6.28,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,gtpu_flags=0,gtpu_msgtype=0,tun_flags=0,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=108.7.73.199,nw_dst=10.129.2.9,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=8080,tcp_flags=0
Megaflow: recirc_id=0x1c204,ct_state=-rpl+trk,eth,ip,tun_id=0/0xffffffff,tun_dst=0.0.0.0,in_port=2,nw_src=64.0.0.0/2,nw_dst=10.129.2.0/23,nw_ecn=0,nw_frag=no
Datapath actions: set(tunnel(tun_id=0x0,dst=10.30.6.28,ttl=64,tp_dst=4789,flags(df|key))),2

That all seems fine: the packet gets directed to the tunnel interface, destined for 10.30.6.28, which is the node actually hosting the target pod.

@larsks
Copy link
Contributor Author

larsks commented Oct 28, 2022

I have opened case https://access.redhat.com/support/cases/#/case/03349209 on this issue as well.

@hpdempsey
Copy link

Requested help from the network engineering group as well.

@larsks
Copy link
Contributor Author

larsks commented Nov 1, 2022

The consensus is beginning to look like this behavior is a bug and that it should be resolved in the OVNKubernetes SDN driver, which is what we should be using in any case. The current cluster build (running OpenSHiftSDN and 4.11 instead of OVNKubernetes and 4.10) is a mistake, so that was part of our plan anyway.

@larsks
Copy link
Contributor Author

larsks commented Nov 1, 2022

I've initiated a migration from OpenShiftSDN to OVNKubernetes SDN drivers on the cluster.

@larsks
Copy link
Contributor Author

larsks commented Nov 1, 2022

Post-OVNKubernetes migration behavior

We've completed the migration to ovnkubernetes and it does not appear to have had any impact on the problem.

With these rules in place in the mangle table:

table ip mangle {
        chain PREROUTING {
                type filter hook prerouting priority mangle; policy accept;
                jump set_nftrace
        }

        chain INPUT {
                type filter hook input priority mangle; policy accept;
                jump set_nftrace
        }

        chain FORWARD {
                type filter hook forward priority mangle; policy accept;
                jump set_nftrace
        }

        chain OUTPUT {
                type route hook output priority mangle; policy accept;
                counter packets 115693 bytes 86504439 jump OVN-KUBE-ITP
                jump set_nftrace
        }

        chain POSTROUTING {
                type filter hook postrouting priority mangle; policy accept;
                jump set_nftrace
        }

        chain KUBE-IPTABLES-HINT {
        }

        chain KUBE-KUBELET-CANARY {
        }

        chain OVN-KUBE-ITP {
        }

        chain set_nftrace {
                meta l4proto tcp ip daddr 199.94.61.0/24 meta nftrace set 1
                meta l4proto tcp ip saddr 199.94.61.0/24 meta nftrace set 1
                tcp dport 9991 meta nftrace set 1
                tcp sport 9991 meta nftrace set 1
        }
}

If we make a request from a remote host to 199.94.61.241, we see the following output from nft monitor trace:

trace id c789e7c1 ip mangle set_nftrace packet: iif "bond0.2180" ether saddr 00:09:0f:09:00:22 ether daddr 10:7d:1a:9c:7c:1d ip saddr 108.7.73.199 ip daddr 199.94.61.241 ip dscp af21 ip ecn not-ect ip ttl 49 ip id 63993 ip length 60 tcp sport 45398 tcp dport 80 tcp flags == syn tcp window 64240
trace id c789e7c1 ip mangle set_nftrace rule meta l4proto tcp ip daddr 199.94.61.0/24 meta nftrace set 1 (verdict continue)
trace id c789e7c1 ip mangle set_nftrace verdict continue
trace id c789e7c1 ip mangle PREROUTING verdict continue
trace id c789e7c1 ip mangle PREROUTING policy accept
trace id c789e7c1 ip nat PREROUTING packet: iif "bond0.2180" ether saddr 00:09:0f:09:00:22 ether daddr 10:7d:1a:9c:7c:1d ip saddr 108.7.73.199 ip daddr 199.94.61.241 ip dscp af21 ip ecn not-ect ip ttl 49 ip id 63993 ip length 60 tcp sport 45398 tcp dport 80 tcp flags == syn tcp window 64240
trace id c789e7c1 ip nat PREROUTING rule counter packets 1144 bytes 86144 jump OVN-KUBE-ETP (verdict jump OVN-KUBE-ETP)
trace id c789e7c1 ip nat OVN-KUBE-ETP verdict continue
trace id c789e7c1 ip nat PREROUTING rule counter packets 1144 bytes 86144 jump OVN-KUBE-EXTERNALIP (verdict jump OVN-KUBE-EXTERNALIP)
trace id c789e7c1 ip nat OVN-KUBE-EXTERNALIP rule meta l4proto tcp ip daddr 199.94.61.241 tcp dport 80 counter packets 13 bytes 760 dnat to 172.30.61.190:80 (verdict accept)

Running tcpdump shows the same behavior as before; if we run:

tcpdump -nn -i any host 199.94.61.241 or host 172.30.61.190 or host 10.128.2.8

We see as output:

20:34:55.547753 eno1  In  IP0 (invalid)
20:34:55.547754 bond0 In  IP0 (invalid)
20:34:55.547755 bond0.2180 In  IP 108.7.73.199.34910 > 199.94.61.241.80: Flags [S], seq 3013248299, win 64240, options [mss 1460,sackOK,TS val 3071973530 ecr 0,nop,wscale 7], length 0

If we run tcpdump in the pod's network namespace, we never see the request arrive at the pod.

@larsks
Copy link
Contributor Author

larsks commented Nov 1, 2022

I've updated the case with the information from the previous comment and left a note that @naved001 will be handling the case while I'm out.

@larsks
Copy link
Contributor Author

larsks commented Nov 2, 2022

If you need network diagnostics tools like tcpdump, there is a netutils script in /usr/local/bin on all the nodes that will run this image. That includes tcpdump, openvswitch, nft, etc, and the wrapper script sets up all the appropriate options and bind mounts so that everything works. It sets your working directory to /host/root (which is /root on the host) so that any files you create there will persist across invocations.

I created this image because there is an issue with the toolbox command in 4.11 (and pulling the toolbox image requires authentication) (and even when it works, it doesn't have things like the openvswitch commands built-in).

@larsks
Copy link
Contributor Author

larsks commented Nov 11, 2022

Overview

We are deploying an OpenShift bare metal cluster for the New England Research Cloud (NERC). The cluster (nerc-ocp-prod) is deployed on a private, VPN-accessible network.

We would like to be able to explicitly expose some services on a public address range.

Broadly, our plan was:

  • Add a public network to the worker nodes as a VLAN interface
  • Configure MetalLB to provide addresses on this network
  • Configure a secondary Ingress controller that exposes the ingress service using a MetalLB-managed address

Network configuration

The primary network interface on the worker nodes is a bond interface, consisting of physical interfaces eno1 and eno2 (although eno2 is currently disconnected pending physical cabling in the data center). We are running OVNKubernetes, so bond0 is a member of the br-ex bridge, and the primary ip address lives on the bridge:

15: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 10:7d:1a:9c:7c:1d brd ff:ff:ff:ff:ff:ff
    inet 10.30.6.23/23 brd 10.30.7.255 scope global dynamic noprefixroute br-ex
       valid_lft 559149sec preferred_lft 559149sec
    inet6 fe80::26d6:4de9:617d:b73b/64 scope link noprefixroute
       valid_lft forever preferred_lft forever

In addition to the primary interface, the worker nodes have multiple VLAN interfaces:

  • bond0.2173 provides connectivity to an external Ceph cluster:

    22: bond0.2173@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
        link/ether 10:7d:1a:9c:7c:1d brd ff:ff:ff:ff:ff:ff
        inet 10.30.10.23/23 brd 10.30.11.255 scope global dynamic noprefixroute bond0.2173
           valid_lft 558265sec preferred_lft 558265sec
    
  • bond0.2180 is the public network on the 199.94.61.0/24 range:

    26: bond0.2180@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
        link/ether 10:7d:1a:9c:7c:1d brd ff:ff:ff:ff:ff:ff
        inet 199.94.61.23/24 brd 199.94.61.255 scope global dynamic noprefixroute bond0.2180
           valid_lft 559220sec preferred_lft 559220sec
    

Routing

The main routing table on these nodes look like this:

default via 10.30.6.1 dev br-ex proto dhcp src 10.30.6.23 metric 48
10.30.6.0/23 dev br-ex proto kernel scope link src 10.30.6.23 metric 48
10.30.10.0/23 dev bond0.2173 proto kernel scope link src 10.30.10.23 metric 401
10.128.0.0/14 via 10.130.0.1 dev ovn-k8s-mp0 mtu 1400
10.130.0.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.130.0.2
10.255.116.0/23 via 10.30.10.1 dev bond0.2173 proto dhcp src 10.30.10.23 metric 401
169.254.169.0/30 via 10.30.6.1 dev br-ex mtu 1400
169.254.169.3 via 10.130.0.1 dev ovn-k8s-mp0 mtu 1400
172.30.0.0/16 via 10.30.6.1 dev br-ex mtu 1400
199.94.61.0/24 dev bond0.2180 proto kernel scope link src 199.94.61.23 metric 402

The default route is via the 10.30.6.0/23 network. This will obviously cause issues for ingress traffic on the public network -- it would result in an asymmetric return path for these requests, which depending on network policy could result in the return traffic being blocked. We can see that if we create a listener and then attempt to connect to it from an external client:

  1. Create a listern:

    nc -l 199.94.61.23 30000
    
  2. Run tcpdump -nn -i any port 30000 on the host.

We see the requests coming in on bond0.2180, but we never send a reply:

14:50:05.314036 bond0.2180 In  IP 108.7.73.199.54646 > 199.94.61.23.30000: Flags [S], seq 3980403317, win 64240, options [mss 1460,sackOK,TS val 3589413438 ecr 0,nop,wscale 7], length 0
14:50:06.346148 bond0.2180 In  IP 108.7.73.199.54646 > 199.94.61.23.30000: Flags [S], seq 3980403317, win 64240, options [mss 1460,sackOK,TS val 3589414471 ecr 0,nop,wscale 7], length 0

In a typical situation, we would arrange for a symmetric return path by introducing some simple policy routing rules:

ip rule add priority 200 from 199.94.61.0/24 lookup 200
ip route add default via 199.94.61.1 table 200

This would cause traffic originating from a 199.94.61.0/24 address on the host to route via the appropriate network gateway. We can verify that this policy works in practice for services hosted directly on the worker nodes; if we repeat the earlier experiment, we see in our tcpdump output:

14:47:40.062955 bond0.2180 In  IP 108.7.73.199.40676 > 199.94.61.23.30000: Flags [S], seq 1041287990, win 64240, options [mss 1460,sackOK,TS val 3589268190 ecr 0,nop,wscale 7], length 0
14:47:40.063000 bond0.2180 Out IP 199.94.61.23.30000 > 108.7.73.199.40676: Flags [S.], seq 2134547490, ack 1041287991, win 28960, options [mss 1460,sackOK,TS val 2981122951 ecr 3589268190,nop,wscale 7], length 0

The request comes in on bond0.2180 and the reply goes out the same interface.

Problems with OpenShift hosted services

We deployed MetalLB on the cluster in L2Advertisement mode with the following address pool configuration:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  labels:
    app.kubernetes.io/instance: cluster-scope-prod
    nerc.mghpcc.org/kustomized: "true"
  name: public
  namespace: metallb-system
spec:
  addresses:
  - 199.94.61.6/32
  - 199.94.61.240/28
  autoAssign: true

We deployed simple web server and configured a LoadBalancer type service for it to test out MetalLB using the following manifests:

apiVersion: v1
kind: Service
metadata:
  annotations:
    metallb.universe.tf/address-pool: public
    metallb.universe.tf/loadBalancerIPs: 199.94.61.241
  labels:
    app: example
  name: example
  namespace: default
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  selector:
    app: example
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: example
  name: example
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: example
  template:
    metadata:
      labels:
        app: example
    spec:
      containers:
      - args:
        - --port
        - "9991"
        image: alpinelinux/darkhttpd
        name: darkhttpd
        ports:
        - containerPort: 9991
          name: http
          protocol: TCP

This resulted in:

$ kubectl get service example
NAME      TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
example   LoadBalancer   172.30.5.233   199.94.61.241   80:30463/TCP   8d

We found that we were unable to access the service at the 199.94.61.241 address.

Investigation and diagnostics

We knew that requests were reaching the appropriate worker node; running tcpdump -nn -i any host 199.94.61.241 showed the incoming requests:

15:06:30.236609 bond0.2180 In  IP 108.7.73.199.50376 > 199.94.61.241.80: Flags [S], seq 3435618778, win 64240, options [mss 1460,sackOK,TS val 3557547309 ecr 0,nop,wscale 7], length 0
15:06:31.247616 bond0.2180 In  IP 108.7.73.199.50376 > 199.94.61.241.80: Flags [S], seq 3435618778, win 64240, options [mss 1460,sackOK,TS val 3557548314 ecr 0,nop,wscale 7], length 0
15:06:33.289695 bond0.2180 In  IP 108.7.73.199.50376 > 199.94.61.241.80: Flags [S], seq 3435618778, win 64240, options [mss 1460,sackOK,TS val 3557550362 ecr 0,nop,wscale 7], length 0

Running tcpdump inside the namespace of the darkhttpd pod showed that these requests were never reaching the pod:

# ps -fe |grep darkhttpd
100        13240   13226  0 01:44 ?        00:00:00 darkhttpd /var/www/localhost/htdocs --no-server-id --port 9991
# nsenter -t 13240 -n tcpdump -nn -i eth0
<no results>

Netfilter tracing

We enabled tracing of netfilter rules by injecting the following nftables configuration:

table ip mangle {
        chain PREROUTING {
                type filter hook prerouting priority mangle; policy accept;
                jump set_nftrace
        }

        chain INPUT {
                type filter hook input priority mangle; policy accept;
                jump set_nftrace
        }

        chain FORWARD {
                type filter hook forward priority mangle; policy accept;
                jump set_nftrace
        }

        chain OUTPUT {
                type route hook output priority mangle; policy accept;
                jump set_nftrace
        }

        chain POSTROUTING {
                type filter hook postrouting priority mangle; policy accept;
                jump set_nftrace
        }

        chain set_nftrace {
                meta l4proto tcp ip daddr 199.94.61.0/24 nftrace set 1
                meta l4proto tcp ip saddr 199.94.61.0/24 nftrace set 1
        }
}

With these rules in place, running nft monitor trace showed us that the request was traversing the host firewall in the expected fashion:

trace id 7a2b0e61 ip mangle set_nftrace packet: iif "bond0.2180" ether saddr 00:09:0f:09:00:22 ether daddr 10:7d:1a:9c:7c:1d ip saddr 108.7.73.199 ip daddr 199.94.61.241 ip dscp af21 ip ecn not-ect ip ttl 49 ip id 38949 ip length 60 tcp sport 36484 tcp dport 80 tcp flags == syn tcp window 64240
trace id 7a2b0e61 ip mangle set_nftrace rule meta l4proto tcp ip daddr 199.94.61.0/24 meta nftrace set 1 (verdict continue)
trace id 7a2b0e61 ip mangle set_nftrace verdict continue
trace id 7a2b0e61 ip mangle PREROUTING verdict continue
trace id 7a2b0e61 ip mangle PREROUTING policy accept
trace id 7a2b0e61 ip nat PREROUTING packet: iif "bond0.2180" ether saddr 00:09:0f:09:00:22 ether daddr 10:7d:1a:9c:7c:1d ip saddr 108.7.73.199 ip daddr 199.94.61.241 ip dscp af21 ip ecn not-ect ip ttl 49 ip id 38949 ip length 60 tcp sport 36484 tcp dport 80 tcp flags == syn tcp window 64240
trace id 7a2b0e61 ip nat PREROUTING rule counter packets 36414 bytes 2645033 jump OVN-KUBE-ETP (verdict jump OVN-KUBE-ETP)
trace id 7a2b0e61 ip nat OVN-KUBE-ETP verdict continue
trace id 7a2b0e61 ip nat PREROUTING rule counter packets 36414 bytes 2645033 jump OVN-KUBE-EXTERNALIP (verdict jump OVN-KUBE-EXTERNALIP)
trace id 7a2b0e61 ip nat OVN-KUBE-EXTERNALIP rule meta l4proto tcp ip daddr 199.94.61.241 tcp dport 80 counter packets 1616 bytes 95792 dnat to 172.30.5.233:80 (verdict accept)

That final rule was correctly modifying the destination of the incoming request to the address of the example Service:

ip nat OVN-KUBE-EXTERNALIP rule meta l4proto tcp ip daddr 199.94.61.241 tcp
dport 80 counter packets 1616 bytes 95792 dnat to 172.30.5.233:80 (verdict accept)

Simplifying the problem

Our original theory had been that this problem was specific to addresses managed by MetalLB, but we were able to demonstrate the same behavior using the service's NodePort address when attempting to access it over the public network. That is, given:

$ kubectl describe service example
Name:                     example
Namespace:                default
Labels:                   app=example
Annotations:              metallb.universe.tf/address-pool: public
Selector:                 app=example
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.30.5.233
IPs:                      172.30.5.233
LoadBalancer Ingress:     199.94.61.241
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  30463/TCP
Endpoints:                10.130.0.17:9991
Session Affinity:         None
External Traffic Policy:  Cluster

We were able to access the service at 10.30.6.23:30463 from other nodes in the cluster and from VPN connected clients, but attempts to access 199.94.61.23:30463 from outside the cluster would fail in the same way as attempts to access the loadbalancer address.

This eliminated MetalLB as the source of the problem.

Packet, where are you?

Andrew Stoycos introduced us to the Packet, Where are you? tool, which uses eBPF to apply several hundred kprobe traces to the kernel, allowing us to trace packets through the kernel network stack.

Running pwru, we were able to positively identify the issue as a routing problem. By running:

pwru --filter-dst-ip 199.94.61.241

We saw the following sequence in the function trace:

0xffff8ee122d1a500     35        [<empty>]      fib_validate_source
0xffff8ee122d1a500     35        [<empty>]    __fib_validate_source
0xffff8ee122d1a500     35        [<empty>]                kfree_skb

This sequence means that fib_validate_source is rejecting the packet. We can explicitly confirm what is happening here with the following bpftrace program:

kprobe:fib_validate_source {
    $skb = (struct sk_buff*) arg0;
    @dev[tid] = (struct net_device*) arg5;
    @skb[tid] = $skb;
    @ipheader[tid] = ((struct iphdr *) ($skb->head + $skb->network_header));

}

kretprobe:fib_validate_source {
    $skb = @skb[tid];
    $ipheader = @ipheader[tid];
    $dev = @dev[tid];
    $version = $ipheader->version;

    // 0xe9051eac is 172.30.5.233 (the service address) in little endian
    // byte order.
    if ((uint32)$ipheader->daddr == 0xe9051eac) {
        printf("proto %d vers %d | %s:%s:%s -> (%d) %s\n",
            $ipheader->protocol,
            $version,
            $dev->name,
            ntop($ipheader->saddr),
            ntop($ipheader->daddr),
            retval,
            strerror(-retval));
    }

    delete(@dev[tid]);
    delete(@ipheader[tid]);
    delete(@skb[tid]);
}

END {
    clear(@ipheader);
    clear(@skb);
    clear(@dev);
}

If we make a request to the loadbalancer address while running the above program, we see:

proto 6 vers 4 | bond0.2180:108.7.73.199:172.30.5.233 -> (-18) Invalid cross-device link

That confirms that the request is being rejected due to the kernel's rp_filter policy. The relevant logic is defined in the __fib_validate_source function:

  if (FIB_RES_DEV(res) == dev)
    dev_match = true;

  if (dev_match) {
    ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
    return ret;
  }
  if (no_addr)
    goto last_resort;
  if (rpf == 1)
    goto e_rpf;
  fl4.flowi4_oif = dev->ifindex;

  ret = 0;
  if (fib_lookup(net, &fl4, &res, FIB_LOOKUP_IGNORE_LINKSTATE) == 0) {
    if (res.type == RTN_UNICAST)
      ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
  }
  return ret;

last_resort:
  if (rpf)
    goto e_rpf;
  *itag = 0;
  return 0;

e_inval:
  return -EINVAL;
e_rpf:
  return -EXDEV;

Analysis

It turns out that by default, the OpenShift nodes are configured with the net.ipv4.conf.all.rp_filter sysctl in "strict" mode; this means that if an incoming request will not route out the same interface on which it entered the system, it will get dropped. But with the policy routing rules in place, why is this happening?

The problem comes down to the interaction between the NAT rules in the host firewall and point at which the kernel is making routing decisions. While the incoming request is entirely valid, it hits a dnat rule in the POSTROUTING chain transforming:

108.7.73.199 -> 199.94.61.241

Into:

108.7.73.199 -> 172.30.5.233

That means the reply is going to look like:

172.30.5.233 -> 108.7.73.199

In order for that to work, we need to perform some form of source NAT on the way out. Unfortunately, this doesn't happen until the POSTROUTING chain, which means at the time the kernel is making a routing decision for the reply, the policy rules we have in place don't apply. Recall that we were matching source addresses from the public range:

ip rule add priority 200 from 199.94.61.0/24 lookup 200

Because the reply has a source address of 172.30.5.233, this rule doesn't match. The source NAT doesn't happen until after the kernel has already made the routing decision.

Solution

The workaround in this case is to configure rp_filter in "loose" mode. After making this change on the worker nodes, we are able to successfully access services exposed on public addresses.

This does result in an asymmetric routing configuration; we now see:

15:33:09.263976 bond0.2180 In  IP 108.7.73.199.43148 > 199.94.61.241.80: Flags [S], seq 1421737879, win 64240, options [mss 1460,sackOK,TS val 3559146335 ecr 0,nop,wscale 7], length 0
15:33:09.266165 br-ex Out IP 199.94.61.241.80 > 108.7.73.199.43148: Flags [S.], seq 1706875023, ack 1421737880, win 26960, options [mss 1360,sackOK,TS val 3199984760 ecr 3559146335,nop,wscale 7], length 0
15:33:09.266311 bond0 Out IP 199.94.61.241.80 > 108.7.73.199.43148: Flags [S.], seq 1706875023, ack 1421737880, win 26960, options [mss 1360,sackOK,TS val 3199984760 ecr 3559146335,nop,wscale 7], length 0
15:33:09.266313 eno1  Out IP 199.94.61.241.80 > 108.7.73.199.43148: Flags [S.], seq 1706875023, ack 1421737880, win 26960, options [mss 1360,sackOK,TS val 3199984760 ecr 3559146335,nop,wscale 7], length 0

That is, the request comes in on bond0.2180, but the reply goes out br-ex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed openshift This issue pertains to NERC OpenShift
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants