Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM access was blocked when eBPF dataplane used #6450

Closed
TrevorTaoARM opened this issue Jul 28, 2022 · 10 comments · Fixed by #6882
Closed

VM access was blocked when eBPF dataplane used #6450

TrevorTaoARM opened this issue Jul 28, 2022 · 10 comments · Fixed by #6882
Assignees

Comments

@TrevorTaoARM
Copy link
Contributor

TrevorTaoARM commented Jul 28, 2022

When I enabled the Calico eBPF dataplane for a K8s cluster, the VMs(for which the NIC was bridged on the physical NIC of the server) on the node which had been configured with the eBPF dataplane can't be accessed with normal ssh access.
When the kube-proxy was restored and eBPF DP disabled, the SSH access to VM was also restored.

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

The following script was used to enable eBPF dataplane:
#!/bin/bash
set -x

WORKDIR=$(pwd)
TMP_DIR=$(mktemp -d)
MARCH=$(uname -m)
CALICO_VERSION=${1:-3.23.2}

if [ $MARCH == "aarch64" ]; then ARCH=arm64;
elif [ $MARCH == "x86_64" ]; then ARCH=amd64;
else ARCH="unknown";
fi
echo ARCH=$ARCH

k8s_ep=$(kubectl get endpoints kubernetes -o wide | grep kubernetes | cut -d " " -f 4)
k8s_host=$(echo $k8s_ep | cut -d ":" -f 1)
k8s_port=$(echo $k8s_ep | cut -d ":" -f 2)

cat < ${WORKDIR}/k8s_service.yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: kubernetes-services-endpoint
namespace: kube-system
data:
KUBERNETES_SERVICE_HOST: "KUBERNETES_SERVICE_HOST"
KUBERNETES_SERVICE_PORT: "KUBERNETES_SERVICE_PORT"
EOF
sed -i "s/KUBERNETES_SERVICE_HOST/${k8s_host}/" ${WORKDIR}/k8s_service.yaml
sed -i "s/KUBERNETES_SERVICE_PORT/${k8s_port}/" ${WORKDIR}/k8s_service.yaml
kubectl apply -f ${WORKDIR}/k8s_service.yaml

echo "Disable kube-proxy:"
kubectl patch ds -n kube-system kube-proxy -p '{"spec":{"template":{"spec":{"nodeSelector":{"non-calico": "true"}}}}}'

if [ ! -f /usr/local/bin/calicoctl ]; then
echo "No calicoctl, install now:"
curl -L https://github.com/projectcalico/calico/releases/download/v${CALICO_VERSION}/calicoctl-linux-${ARCH} -o ${WORKDIR}/calicoctl;
chmod +x ${WORKDIR}/calicoctl;
sudo cp ${WORKDIR}/calicoctl /usr/local/bin;
rm ${WORKDIR}/calicoctl
fi

echo "Enable eBPF:"
calicoctl patch felixconfiguration default --patch='{"spec": {"bpfEnabled": true}}' --allow-version-mismatch

echo "Enable Direct Server Return(DSR) mode: optional"
#calicoctl patch felixconfiguration default --patch='{"spec": {"bpfExternalServiceMode": "DSR"}}'

Context

I try to access the VM(10.169.210.139) which was located in a server with Calico eBPF enabled from another server(10.169.242.130), only the first ping packet can be received, and other ping packets were lost.

The conntrack for the Calico node showed the ssh access (from 10.169.242.130) to VM(10.169.210.139):
# calico-node -bpf conntrack dump |grep "10.169.210.139"
2022-07-15 08:21:37.276 [INFO][13703] confd/maps.go 433: Loaded map file descriptor. fd=0x7 name="/sys/fs/bpf/tc/globals/cali_v4_ct2"
ConntrackKey{proto=6 10.169.242.130:61701 <-> 10.169.210.139:22} -> Entry{Type:0, Created:17278773931441431, LastSeen:17278777015499210, Flags: Data: {A2B:{Seqno:92691206 SynSeen:true AckSeen:true FinSeen:false RstSeen:false Whitelisted:true Opener:true Ifindex:2} B2A:{Seqno:959809259 SynSeen:true AckSeen:true FinSeen:false RstSeen:false Whitelisted:false Opener:false Ifindex:0} OrigDst:0.0.0.0 OrigPort:0 OrigSPort:0 TunIP:0.0.0.0}} Age: 3.143463957s Active ago 59.406178ms ESTABLISHED

Your Environment

  • Calico version: v3.23.2
  • Orchestrator version (e.g. kubernetes, mesos, rkt): K8s 1.22.1
  • Operating System and version: Ubuntu 20.04 focal Linux kernel 5.10.0
  • Link to your project (optional):
@caseydavenport
Copy link
Member

CC @tomastigera

@lmm lmm added the area/bpf eBPF Dataplane issues label Aug 9, 2022
TrevorTaoARM added a commit to TrevorTaoARM/calico that referenced this issue Sep 5, 2022
When we enable Calico eBPF dataplane, and a packet(e.g, a ping ICMP packet) destined
for a VM of the host(VMs are ususually connected to host's physical interface through
the macvtap/macvlan interface in either Bridge, VEPA or passthrough mode)
from the physical interface, would be falsely bypassed
by the eBPF program here and can't reach the target VM from the virtual interface
(macvtap/macvlan).

When a packet comes into the eBPF program of traffic control, its
destination address(daddr) should be checked if it's for a known route
by checking the route map, and if it's for an unknown route, it should
be thought as it's not destined for this system, so we should just let
it go through(skip) all our eBPF programs processing here by setting the
action to TC_ACT_OK, which would skip for subsequent eBPF checkings
and processings.

So here we also should not check the unknown route traffic against FIB by bpf_fib_lookup
(in forward_or_drop()), since in some systems, the lookup result would be successful
like this:
  <idle>-0       [088] d.s. 1810775.267240: bpf_trace_printk: enp9s0---I: Traffic is towards the host namespace, doing Linux FIB lookup
  <idle>-0       [088] d.s. 1810775.267243: bpf_trace_printk: enp9s0---I: FIB lookup succeeded - with neigh
  <idle>-0       [088] d.s. 1810775.267244: bpf_trace_printk: enp9s0---I: Got Linux FIB hit, redirecting to iface 2.
  <idle>-0       [088] d.s. 1810775.267245: bpf_trace_printk: enp9s0---I: Traffic is towards host namespace, marking with 0x3000000.
  <idle>-0       [088] d.s. 1810775.267247: bpf_trace_printk: enp9s0---I: Final result=ALLOW (0). Program execution time: 31307ns
  <idle>-0       [088] d.s. 1810775.267249: bpf_trace_printk: enp9s0---E: New packet at ifindex=2; mark=3000000
  <idle>-0       [088] d.s. 1810775.267250: bpf_trace_printk: enp9s0---E: Final result=ALLOW (3). Bypass mark bit set.

and it's a wrong processing here since for the packet of a mark of 3000000 at
the egress direction would be discarded by the system.

On the other side, we also noticed in some systems, the issue of VM
access blocking seems to be disappeared, and the packet can go through the
eBPF program and finally reach the target VM. In this case, it does not mean the
original action is correct, but just because the FIB lookup just fails here(see the log
below), so the packet would be bypass by the eBPF program here with a mark 0x1000000:
  <idle>-0       [014] ..s. 17619198.981285: 0: eno1np0--I: Traffic is towards the host namespace, doing Linux FIB lookup
  <idle>-0       [014] ..s. 17619198.981287: 0: eno1np0--I: FIB lookup failed (FIB problem): 7.
  <idle>-0       [014] ..s. 17619198.981287: 0: eno1np0--I: Traffic is towards host namespace, marking with 0x1000000.
  <idle>-0       [014] ..s. 17619198.981288: 0: eno1np0--I: Final result=ALLOW (0). Program execution time: 16040ns
So it can correctly skip the wrong marking action above.

At the same time, we would like to say there is a similar processing for
the unrelevant traffic in Cilium eBPF implementation:
        ep = lookup_ip4_endpoint(ip4);
https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L571

and
        if (!from_host)
                return CTX_ACT_OK;
https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L586

Here the endpoint of Cilium eBPF is similar to the route of Calico eBPF.

This patch is also a fix for the issue of
"VM access was blocked when eBPF dataplane used"
projectcalico#6450

Signed-off-by: trevor tao <[email protected]>
@TrevorTaoARM
Copy link
Contributor Author

TrevorTaoARM commented Sep 6, 2022

I first met this issue on an arm64 platform, but it seems there is no such issue on some other platforms or systems, e.g, for some x86 systems. I checked the eBPF output log by setting bpfLogLevel to Debug, the output showed the differences between the 2 kinds of cases.
We met this issue on an arm64 platform, but it seems there is no such issue on x86 platform. I checked the log output carefully for these 2 systems:

  1. For arm64 platform:
  4869           <idle>-0       [088] d.s. 1810775.267212: bpf_trace_printk: enp9s0---I: New packet at ifindex=2; mark=0
  4870
  4871           <idle>-0       [088] d.s. 1810775.267213: bpf_trace_printk: enp9s0---I: No metadata is shared by XDP
  4872
  4873           <idle>-0       [088] d.s. 1810775.267215: bpf_trace_printk: enp9s0---I: IP id=13695 s=aa9d0e5 d=aa9d287
  4874
  4875           <idle>-0       [088] d.s. 1810775.267217: bpf_trace_printk: enp9s0---I: ICMP; type=8 code=0
  4876
  4877           <idle>-0       [088] d.s. 1810775.267218: bpf_trace_printk: enp9s0---I: CT-1 lookup from aa9d0e5:0
  4878
  4879           <idle>-0       [088] d.s. 1810775.267219: bpf_trace_printk: enp9s0---I: CT-1 lookup to   aa9d287:0
  4880
  4881           <idle>-0       [088] d.s. 1810775.267221: bpf_trace_printk: enp9s0---I: CT-1 Hit! NORMAL entry.
  4882
  4883           <idle>-0       [088] d.s. 1810775.267222: bpf_trace_printk: enp9s0---I: CT-1 result: 0x2003
  4884
  4885           <idle>-0       [088] d.s. 1810775.267223: bpf_trace_printk: enp9s0---I: conntrack entry flags 0x100
  4886
  4887           <idle>-0       [088] d.s. 1810775.267223: bpf_trace_printk: enp9s0---I: CT Hit
  4888
  4889           <idle>-0       [088] d.s. 1810775.267224: bpf_trace_printk: enp9s0---I: Entering calico_tc_skb_accepted_entrypoint
  4890
  4891           <idle>-0       [088] d.s. 1810775.267226: bpf_trace_printk: enp9s0---I: IP id=13695 s=aa9d0e5 d=aa9d287
  4892
  4893           <idle>-0       [088] d.s. 1810775.267226: bpf_trace_printk: enp9s0---I: Entering calico_tc_skb_accepted
  4894
  4895           <idle>-0       [088] d.s. 1810775.267227: bpf_trace_printk: enp9s0---I: src=aa9d0e5 dst=aa9d287
  4896
  4897           <idle>-0       [088] d.s. 1810775.267228: bpf_trace_printk: enp9s0---I: post_nat=0:0
  4898
  4899           <idle>-0       [088] d.s. 1810775.267228: bpf_trace_printk: enp9s0---I: tun_ip=0
  4900
  4901           <idle>-0       [088] d.s. 1810775.267229: bpf_trace_printk: enp9s0---I: pol_rc=1
  4902
  4903           <idle>-0       [088] d.s. 1810775.267230: bpf_trace_printk: enp9s0---I: sport=0
  4904
  4905           <idle>-0       [088] d.s. 1810775.267230: bpf_trace_printk: enp9s0---I: flags=20
  4906
  4907           <idle>-0       [088] d.s. 1810775.267231: bpf_trace_printk: enp9s0---I: ct_rc=3
  4908
  4909           <idle>-0       [088] d.s. 1810775.267231: bpf_trace_printk: enp9s0---I: ct_related=0
  4910
  4911           <idle>-0       [088] d.s. 1810775.267232: bpf_trace_printk: enp9s0---I: mark=0x1000000
  4912 4912
  4913           <idle>-0       [088] d.s. 1810775.267233: bpf_trace_printk: enp9s0---I: ip->ttl 64
  4914
  4915           <idle>-0       [088] d.s. 1810775.267234: bpf_trace_printk: enp9s0---I: marking enp9_SKB_MARK_BYPASS
  4916
  4917           <idle>-0       [088] d.s. 1810775.267235: bpf_trace_printk: enp9s0---I: IP id=13695 s=aa9d0e5 d=aa9d287
  4918
  4919           <idle>-0       [088] d.s. 1810775.267235: bpf_trace_printk: enp9s0---I: FIB family=2
  4920
  4921           <idle>-0       [088] d.s. 1810775.267236: bpf_trace_printk: enp9s0---I: FIB tot_len=0
  4922
  4923           <idle>-0       [088] d.s. 1810775.267237: bpf_trace_printk: enp9s0---I: FIB ifindex=2
  4924
  4925           <idle>-0       [088] d.s. 1810775.267237: bpf_trace_printk: enp9s0---I: FIB l4_protocol=1
  4926
  4927           <idle>-0       [088] d.s. 1810775.267238: bpf_trace_printk: enp9s0---I: FIB sport=0
  4928
  4929           <idle>-0       [088] d.s. 1810775.267238: bpf_trace_printk: enp9s0---I: FIB dport=0
  4930
  4931           <idle>-0       [088] d.s. 1810775.267239: bpf_trace_printk: enp9s0---I: FIB ipv4_src=aa9d0e5
  4932
  4933           <idle>-0       [088] d.s. 1810775.267240: bpf_trace_printk: enp9s0---I: FIB ipv4_dst=aa9d287
  4934
  4935           <idle>-0       [088] d.s. 1810775.267240: bpf_trace_printk: enp9s0---I: Traffic is towards the host namespace, doing Linux FIB lookup
  4936
  4937           <idle>-0       [088] d.s. 1810775.267243: bpf_trace_printk: enp9s0---I: FIB lookup succeeded - with neigh
  4938
  4939           <idle>-0       [088] d.s. 1810775.267244: bpf_trace_printk: enp9s0---I: Got Linux FIB hit, redirecting to iface 2.
  4940
  4941           <idle>-0       [088] d.s. 1810775.267245: bpf_trace_printk: enp9s0---I: Traffic is towards host namespace, marking with 0x3000000.
  4942
  4943           <idle>-0       [088] d.s. 1810775.267247: bpf_trace_printk: enp9s0---I: Final result=ALLOW (0). Program execution time: 31307ns
  4944
  4945           <idle>-0       [088] d.s. 1810775.267249: bpf_trace_printk: enp9s0---E: New packet at ifindex=2; mark=3000000
  4946
  4947           <idle>-0       [088] d.s. 1810775.267250: bpf_trace_printk: enp9s0---E: Final result=ALLOW (3). Bypass mark bit set.
  4948

For other systems(x86 currently), the log showed:

      <idle>-0       [014] ..s. 17619198.981271: 0: eno1np0--I: New packet at ifindex=2; mark=0
      <idle>-0       [014] ..s. 17619198.981271: 0: eno1np0--I: No metadata is shared by XDP
      <idle>-0       [014] ..s. 17619198.981272: 0: eno1np0--I: IP id=53367 s=aa9d0e5 d=aa9d27f
      <idle>-0       [014] ..s. 17619198.981273: 0: eno1np0--I: ICMP; type=8 code=0
      <idle>-0       [014] ..s. 17619198.981273: 0: eno1np0--I: CT-1 lookup from aa9d0e5:0
      <idle>-0       [014] ..s. 17619198.981274: 0: eno1np0--I: CT-1 lookup to   aa9d27f:0
      <idle>-0       [014] ..s. 17619198.981275: 0: eno1np0--I: CT-1 Hit! NORMAL entry.
      <idle>-0       [014] ..s. 17619198.981275: 0: eno1np0--I: CT-1 result: 0x2
      <idle>-0       [014] ..s. 17619198.981276: 0: eno1np0--I: conntrack entry flags 0x100
      <idle>-0       [014] ..s. 17619198.981276: 0: eno1np0--I: CT Hit
      <idle>-0       [014] ..s. 17619198.981277: 0: eno1np0--I: Entering calico_tc_skb_accepted_entrypoint
      <idle>-0       [014] ..s. 17619198.981277: 0: eno1np0--I: IP id=53367 s=aa9d0e5 d=aa9d27f
      <idle>-0       [014] ..s. 17619198.981278: 0: eno1np0--I: Entering calico_tc_skb_accepted
      <idle>-0       [014] ..s. 17619198.981278: 0: eno1np0--I: src=aa9d0e5 dst=aa9d27f
      <idle>-0       [014] ..s. 17619198.981279: 0: eno1np0--I: post_nat=0:0
      <idle>-0       [014] ..s. 17619198.981279: 0: eno1np0--I: tun_ip=0
      <idle>-0       [014] ..s. 17619198.981279: 0: eno1np0--I: pol_rc=1
      <idle>-0       [014] ..s. 17619198.981280: 0: eno1np0--I: sport=0
      <idle>-0       [014] ..s. 17619198.981280: 0: eno1np0--I: flags=20
      <idle>-0       [014] ..s. 17619198.981280: 0: eno1np0--I: ct_rc=2
      <idle>-0       [014] ..s. 17619198.981281: 0: eno1np0--I: ct_related=0
      <idle>-0       [014] ..s. 17619198.981281: 0: eno1np0--I: mark=0x1000000
      <idle>-0       [014] ..s. 17619198.981281: 0: eno1np0--I: ip->ttl 64
      <idle>-0       [014] ..s. 17619198.981282: 0: eno1np0--I: IP id=53367 s=aa9d0e5 d=aa9d27f
      <idle>-0       [014] ..s. 17619198.981283: 0: eno1np0--I: FIB family=2
      <idle>-0       [014] ..s. 17619198.981283: 0: eno1np0--I: FIB tot_len=0
      <idle>-0       [014] ..s. 17619198.981283: 0: eno1np0--I: FIB ifindex=2
      <idle>-0       [014] ..s. 17619198.981283: 0: eno1np0--I: FIB l4_protocol=1
      <idle>-0       [014] ..s. 17619198.981284: 0: eno1np0--I: FIB sport=0
      <idle>-0       [014] ..s. 17619198.981284: 0: eno1np0--I: FIB dport=0
      <idle>-0       [014] ..s. 17619198.981284: 0: eno1np0--I: FIB ipv4_src=aa9d0e5
      <idle>-0       [014] ..s. 17619198.981284: 0: eno1np0--I: FIB ipv4_dst=aa9d27f
      <idle>-0       [014] ..s. 17619198.981285: 0: eno1np0--I: Traffic is towards the host namespace, doing Linux FIB lookup
      <idle>-0       [014] ..s. 17619198.981287: 0: eno1np0--I: FIB lookup failed (FIB problem): 7.
      <idle>-0       [014] ..s. 17619198.981287: 0: eno1np0--I: Traffic is towards host namespace, marking with 0x1000000.
      <idle>-0       [014] ..s. 17619198.981288: 0: eno1np0--I: Final result=ALLOW (0). Program execution time: 16040ns
       vhost-3084463-3084499 [008] .... 17619198.981418: 0: eno1np0--E: New packet at ifindex=2; mark=0
       vhost-3084463-3084499 [008] .... 17619198.981419: 0: eno1np0--E: IP id=42046 s=aa9d27f d=aa9d0e5

The test process is the same for 2 systems: we just ping a VM in a host which had enabled Calico/ebpf dataplane from another host.
For arm64 platform, the ping packet can't reach the VM since it had been falsely forwarded by the eBPF program (forward_or_drop function).
The differences here lies on the result of FIB lookup, for x86 platform, the FIB lookup failed with code 7, then marked with 0x1000000; for arm64 platform, the FIB lookup succeeded with neigh given, then marked with 0x3000000 and re-appeared on the egress direction of the same interface.

I think for the packet destined for VMs instead of the host itself, it should be checked if it's actually for the host itself by checking the eBPF route map first. If the lookup result for route is unknown, it should be thought as NOT destined for this host and to be ok(TC_ACT_OK) to skip subsequent eBPF processing here.

I saw there is a similar processing for the unrelevant traffic in Cilium eBPF implementation:
ep = lookup_ip4_endpoint(ip4);
https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L571

and
if (!from_host)
return CTX_ACT_OK;
https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L586

Here the endpoint of Cilium eBPF is similar to the route of Calico eBPF.

I will put up a PR to address this issue and thanks for your review.

The used versions of Calico:
v3.23.2, v3.24.1 and v3.25.0-0.dev.

@lmm
Copy link
Contributor

lmm commented Sep 6, 2022

@tomastigera @mazdakn could you guys please take a look?

TrevorTaoARM added a commit to TrevorTaoARM/calico that referenced this issue Oct 3, 2022
When we enable Calico eBPF dataplane, and a packet(e.g, a ping ICMP packet) destined
for a VM of the host(VMs are ususually connected to host's physical interface through
the macvtap/macvlan interface in either Bridge, VEPA or passthrough mode)
from the physical interface, would be falsely bypassed
by the eBPF program here and can't reach the target VM from the virtual interface
(macvtap/macvlan).

When a packet comes into the eBPF program of traffic control, its
destination address(daddr) should be checked if it's for a known route
by checking the route map, and if it's for an unknown route, it should
be thought as it's not destined for this system, so we should just let
it go through(skip) all our eBPF programs processing here by setting the
action to TC_ACT_OK, which would skip for subsequent eBPF checkings
and processings.

So here we also should not check the unknown route traffic against FIB by bpf_fib_lookup
(in forward_or_drop()), since in some systems, the lookup result would be successful
like this:
  <idle>-0       [088] d.s. 1810775.267240: bpf_trace_printk: enp9s0---I: Traffic is towards the host namespace, doing Linux FIB lookup
  <idle>-0       [088] d.s. 1810775.267243: bpf_trace_printk: enp9s0---I: FIB lookup succeeded - with neigh
  <idle>-0       [088] d.s. 1810775.267244: bpf_trace_printk: enp9s0---I: Got Linux FIB hit, redirecting to iface 2.
  <idle>-0       [088] d.s. 1810775.267245: bpf_trace_printk: enp9s0---I: Traffic is towards host namespace, marking with 0x3000000.
  <idle>-0       [088] d.s. 1810775.267247: bpf_trace_printk: enp9s0---I: Final result=ALLOW (0). Program execution time: 31307ns
  <idle>-0       [088] d.s. 1810775.267249: bpf_trace_printk: enp9s0---E: New packet at ifindex=2; mark=3000000
  <idle>-0       [088] d.s. 1810775.267250: bpf_trace_printk: enp9s0---E: Final result=ALLOW (3). Bypass mark bit set.

and it's a wrong processing here since for the packet of a mark of 3000000 at
the egress direction would be discarded by the system.

On the other side, we also noticed in some systems, the issue of VM
access blocking seems to be disappeared, and the packet can go through the
eBPF program and finally reach the target VM. In this case, it does not mean the
original action is correct, but just because the FIB lookup just fails here(see the log
below), so the packet would be bypass by the eBPF program here with a mark 0x1000000:
  <idle>-0       [014] ..s. 17619198.981285: 0: eno1np0--I: Traffic is towards the host namespace, doing Linux FIB lookup
  <idle>-0       [014] ..s. 17619198.981287: 0: eno1np0--I: FIB lookup failed (FIB problem): 7.
  <idle>-0       [014] ..s. 17619198.981287: 0: eno1np0--I: Traffic is towards host namespace, marking with 0x1000000.
  <idle>-0       [014] ..s. 17619198.981288: 0: eno1np0--I: Final result=ALLOW (0). Program execution time: 16040ns
So it can correctly skip the wrong marking action above.

At the same time, we would like to say there is a similar processing for
the unrelevant traffic in Cilium eBPF implementation:
        ep = lookup_ip4_endpoint(ip4);
https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L571

and
        if (!from_host)
                return CTX_ACT_OK;
https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L586

Here the endpoint of Cilium eBPF is similar to the route of Calico eBPF.

This patch is also a fix for the issue of
"VM access was blocked when eBPF dataplane used"
projectcalico#6450

Signed-off-by: trevor tao <[email protected]>
@tomastigera
Copy link
Contributor

@TrevorTaoARM sorry for not responding sooner, totally missed this, 👀 now! And thanks for a great analysis! 🙏

@tomastigera
Copy link
Contributor

@TrevorTaoARM I commented at your patch ⬆️

@tomastigera
Copy link
Contributor

The differences here lies on the result of FIB lookup, for x86 platform, the FIB lookup failed with code 7, then marked with 0x1000000; for arm64 platform, the FIB lookup succeeded with neigh given, then marked with 0x3000000 and re-appeared on the egress direction of the same interface.

It seems like the packets ultimately ended up on the egress of the same device regardless of whether the FIB failed or not. But I am not quite sure how the packet looks like in the ARM case as that is missing in the logs when the BYPASS mark is set. Perhaps the host mangled that packet?

tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Oct 20, 2022
Disable FIB, let the packet go through the host after it is
policed. It is ingress into the system and we do not know what
exactly is the packet's destination. It may be a local VM or
something similar and we let the host to route it or dump it.

projectcalico#6450
@TrevorTaoARM
Copy link
Contributor Author

The differences here lies on the result of FIB lookup, for x86 platform, the FIB lookup failed with code 7, then marked with 0x1000000; for arm64 platform, the FIB lookup succeeded with neigh given, then marked with 0x3000000 and re-appeared on the egress direction of the same interface.

It seems like the packets ultimately ended up on the egress of the same device regardless of whether the FIB failed or not. But I am not quite sure how the packet looks like in the ARM case as that is missing in the logs when the BYPASS mark is set. Perhaps the host mangled that packet?

The differences here lies on the result of FIB lookup, for x86 platform, the FIB lookup failed with code 7, then marked with 0x1000000; for arm64 platform, the FIB lookup succeeded with neigh given, then marked with 0x3000000 and re-appeared on the egress direction of the same interface.

It seems like the packets ultimately ended up on the egress of the same device regardless of whether the FIB failed or not. But I am not quite sure how the packet looks like in the ARM case as that is missing in the logs when the BYPASS mark is set. Perhaps the host mangled that packet?

@tomastigera Yes, the difference of fib lookup results between the 2 platforms really confused me. But it looks like only when eBPF is enabled, the packet flow for a certain VM would be blocked. I didn't know when the BYPASS mark is set, what the subsequent data path for the packet is. The only trace I saw was:
4945 -0 [088] d.s. 1810775.267249: bpf_trace_printk: enp9s0---E: New packet at ifindex=2; mark=3000000
4946
4947 -0 [088] d.s. 1810775.267250: bpf_trace_printk: enp9s0---E: Final result=ALLOW (3). Bypass mark bit set.

which showed the packet had been transfered to the egress direction, but for x86, the packet is still in the ingress direction:
-0 [014] ..s. 17619198.981287: 0: eno1np0--I: Traffic is towards host namespace, marking with 0x1000000.
-0 [014] ..s. 17619198.981288: 0: eno1np0--I: Final result=ALLOW (0). Program execution time: 16040ns

@lwr20 lwr20 added the kind/bug label Nov 1, 2022
tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Nov 3, 2022
Disable FIB, let the packet go through the host after it is
policed. It is ingress into the system and we do not know what
exactly is the packet's destination. It may be a local VM or
something similar and we let the host to route it or dump it.

projectcalico#6450
@Dimonyga
Copy link

Dimonyga commented Nov 23, 2022

@tomastigera Fixed but not complete
version v3.25.0-0.dev-490-g3b818a2f1494
schema
eth0(without IP) ---- bond0(10.208.201.15/24) ---- app(port 2200)

-0       [005] dNs3.   220.318140: bpf_trace_printk: eth0-----I: New packet at ifindex=2; mark=0          
-0       [005] dNs3.   220.318151: bpf_trace_printk: eth0-----I: No metadata is shared by XDP          
-0       [005] dNs3.   220.318152: bpf_trace_printk: eth0-----I: IP id=0 s=a97d428 d=ad0c90f          
-0       [005] dNs3.   220.318153: bpf_trace_printk: eth0-----I: TCP; ports: s=50634 d=2200          
-0       [005] dNs3.   220.318153: bpf_trace_printk: eth0-----I: CT-6 lookup from a97d428:50634          
-0       [005] dNs3.   220.318154: bpf_trace_printk: eth0-----I: CT-6 lookup to   ad0c90f:2200          
-0       [005] dNs3.   220.318155: bpf_trace_printk: eth0-----I: CT-6 Miss for TCP SYN, NEW flow.          
-0       [005] dNs3.   220.318156: bpf_trace_printk: eth0-----I: CT-6 result: NEW.          
-0       [005] dNs3.   220.318156: bpf_trace_printk: eth0-----I: conntrack entry flags 0x0          
-0       [005] dNs3.   220.318157: bpf_trace_printk: eth0-----I: NAT: 1st level lookup addr=ad0c90f port=2200 protocol=6.          
-0       [005] dNs3.   220.318158: bpf_trace_printk: eth0-----I: NAT: Miss.          
-0       [005] dNs3.   220.318160: bpf_trace_printk: eth0-----I: Host RPF check src=a97d428 skb iface=2 strict if 3          
-0       [005] dNs3.   220.318161: bpf_trace_printk: eth0-----I: Host RPF check src=a97d428 skb iface=2 fib rc 0          
-0       [005] dNs3.   220.318161: bpf_trace_printk: eth0-----I: Host RPF check src=a97d428 skb iface=2 result 0          
-0       [005] dNs3.   220.318162: bpf_trace_printk: eth0-----I: Final result=DENY (0). Program execution time: 10037ns

dropped by RPF check
with BPFEnforceRPF=Disabled

-0 [005] d.s3. 6710.121268: bpf_trace_printk: eth0-----I: TCP; ports: s=52905 d=2200
-0 [005] d.s3. 6710.121269: bpf_trace_printk: eth0-----I: CT-6 lookup from a97d428:52905
-0 [005] d.s3. 6710.121270: bpf_trace_printk: eth0-----I: CT-6 lookup to ad0c90f:2200
-0 [005] d.s3. 6710.121271: bpf_trace_printk: eth0-----I: CT-6 Miss for TCP SYN, NEW flow.
-0 [005] d.s3. 6710.121274: bpf_trace_printk: eth0-----I: CT-6 result: NEW.
-0 [005] d.s3. 6710.121275: bpf_trace_printk: eth0-----I: conntrack entry flags 0x0
-0 [005] d.s3. 6710.121277: bpf_trace_printk: eth0-----I: NAT: 1st level lookup addr=ad0c90f port=2200 protocol=6.
-0 [005] d.s3. 6710.121280: bpf_trace_printk: eth0-----I: NAT: Miss.
-0 [005] d.s3. 6710.121282: bpf_trace_printk: eth0-----I: Host RPF check disabled
-0 [005] d.s3. 6710.121284: bpf_trace_printk: eth0-----I: Post-NAT dest IP is local host.
-0 [005] d.s3. 6710.121285: bpf_trace_printk: eth0-----I: About to jump to policy program.
-0 [005] d.s3. 6710.121285: bpf_trace_printk: eth0-----I: HEP with no policy, allow.
-0 [005] d.s3. 6710.121287: bpf_trace_printk: eth0-----I: Entering calico_tc_skb_accepted_entrypoint
-0 [005] d.s3. 6710.121288: bpf_trace_printk: eth0-----I: Entering calico_tc_skb_accepted
-0 [005] d.s3. 6710.121289: bpf_trace_printk: eth0-----I: src=a97d428 dst=ad0c90f
-0 [005] d.s3. 6710.121290: bpf_trace_printk: eth0-----I: post_nat=ad0c90f:2200
-0 [005] d.s3. 6710.121291: bpf_trace_printk: eth0-----I: tun_ip=0
-0 [005] d.s3. 6710.121297: bpf_trace_printk: eth0-----I: pol_rc=1
-0 [005] d.s3. 6710.121298: bpf_trace_printk: eth0-----I: sport=52905
-0 [005] d.s3. 6710.121299: bpf_trace_printk: eth0-----I: flags=24
-0 [005] d.s3. 6710.121300: bpf_trace_printk: eth0-----I: ct_rc=0
-0 [005] d.s3. 6710.121301: bpf_trace_printk: eth0-----I: ct_related=0
-0 [005] d.s3. 6710.121302: bpf_trace_printk: eth0-----I: mark=0x1000000
-0 [005] d.s3. 6710.121304: bpf_trace_printk: eth0-----I: ip->ttl 57
-0 [005] d.s3. 6710.121307: bpf_trace_printk: eth0-----I: Allowed by policy: ACCEPT

@tomastigera
Copy link
Contributor

@Dimonyga Not sure whether this is related to the original issue, however, if you apply bpf programs to eth0 in this setup, then surely you cannot pass a strict RPF because routing says that the return path is via bond0 and not eth0. So the bpfDataIfacePattern must not include eth0 and must include bond0 Note that is also much more logically correct. However, there is an issue that if you change the pattern, programs from eth0 are not cleared. You can either remove them manually or reboot the nodes. This issue is addressed by #7008

@Dimonyga
Copy link

Dimonyga commented Dec 6, 2022

sorry my mistakes
The task sounded a little different
eth0(no IP) ---- bond0(SUBNET1) --- bond0.208@bond0(SUBNET2)---- application(port 2200)

When we start calico-node with
bpfdataifacepattern:^(bond.*|tunl0$|wireguard.cali$|vxlan.calico$)
Access to SUBNET2 denied
I am passing bpfenforcerpf:Disabled parameter
And access restored.
In this case, in the debug output, all packets that should be placed in bond0.208 are dropped at the bond0 level.
Suggestion to skip packages with vlanid!=0

tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Jun 17, 2024
* fix when CALI_ST_SKIP_FIB is set on the way to the host, set
  CALI_CT_FLAG_SKIP_FIB on conntrack - not just when from WEP

* add test for ^^^ and issue projectcalico#6450

* In addition to skipping FIB when there is no route to post-dnat
  destination, also skip FIB when there is a route, but it is not local
  while there was no service involved. In that case, we are not
  forwarding a service (NodePort) to another node and we should only
  forward locally. Let the host decide what to do with such a packet.

Fixes projectcalico#8918
aitorpazos pushed a commit to team-telnyx/infra-oci-calico-upstream that referenced this issue Jun 18, 2024
* fix when CALI_ST_SKIP_FIB is set on the way to the host, set
  CALI_CT_FLAG_SKIP_FIB on conntrack - not just when from WEP

* add test for ^^^ and issue projectcalico#6450

* In addition to skipping FIB when there is no route to post-dnat
  destination, also skip FIB when there is a route, but it is not local
  while there was no service involved. In that case, we are not
  forwarding a service (NodePort) to another node and we should only
  forward locally. Let the host decide what to do with such a packet.

Fixes projectcalico#8918

(cherry picked from commit 327c4fd)
tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Jun 18, 2024
* fix when CALI_ST_SKIP_FIB is set on the way to the host, set
  CALI_CT_FLAG_SKIP_FIB on conntrack - not just when from WEP

* add test for ^^^ and issue projectcalico#6450

* In addition to skipping FIB when there is no route to post-dnat
  destination, also skip FIB when there is a route, but it is not local
  while there was no service involved. In that case, we are not
  forwarding a service (NodePort) to another node and we should only
  forward locally. Let the host decide what to do with such a packet.

Fixes projectcalico#8918
tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Jun 18, 2024
* fix when CALI_ST_SKIP_FIB is set on the way to the host, set
  CALI_CT_FLAG_SKIP_FIB on conntrack - not just when from WEP

* add test for ^^^ and issue projectcalico#6450

* In addition to skipping FIB when there is no route to post-dnat
  destination, also skip FIB when there is a route, but it is not local
  while there was no service involved. In that case, we are not
  forwarding a service (NodePort) to another node and we should only
  forward locally. Let the host decide what to do with such a packet.

Fixes projectcalico#8918
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants