Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let route unknown traffic skip the eBPF processing #1

Merged
merged 1 commit into from
Oct 12, 2022

Conversation

Dimonyga
Copy link

When we enable Calico eBPF dataplane, and a packet(e.g, a ping ICMP packet) destined for a VM of the host(VMs are ususually connected to host's physical interface through the macvtap/macvlan interface in either Bridge, VEPA or passthrough mode) from the physical interface, would be falsely bypassed by the eBPF program here and can't reach the target VM from the virtual interface (macvtap/macvlan).

When a packet comes into the eBPF program of traffic control, its destination address(daddr) should be checked if it's for a known route by checking the route map, and if it's for an unknown route, it should be thought as it's not destined for this system, so we should just let it go through(skip) all our eBPF programs processing here by setting the action to TC_ACT_OK, which would skip for subsequent eBPF checkings and processings.

So here we also should not check the unknown route traffic against FIB by bpf_fib_lookup (in forward_or_drop()), since in some systems, the lookup result would be successful like this:
-0 [088] d.s. 1810775.267240: bpf_trace_printk: enp9s0---I: Traffic is towards the host namespace, doing Linux FIB lookup
-0 [088] d.s. 1810775.267243: bpf_trace_printk: enp9s0---I: FIB lookup succeeded - with neigh
-0 [088] d.s. 1810775.267244: bpf_trace_printk: enp9s0---I: Got Linux FIB hit, redirecting to iface 2.
-0 [088] d.s. 1810775.267245: bpf_trace_printk: enp9s0---I: Traffic is towards host namespace, marking with 0x3000000.
-0 [088] d.s. 1810775.267247: bpf_trace_printk: enp9s0---I: Final result=ALLOW (0). Program execution time: 31307ns
-0 [088] d.s. 1810775.267249: bpf_trace_printk: enp9s0---E: New packet at ifindex=2; mark=3000000
-0 [088] d.s. 1810775.267250: bpf_trace_printk: enp9s0---E: Final result=ALLOW (3). Bypass mark bit set.

and it's a wrong processing here since for the packet of a mark of 3000000 at the egress direction would be discarded by the system.

On the other side, we also noticed in some systems, the issue of VM access blocking seems to be disappeared, and the packet can go through the eBPF program and finally reach the target VM. In this case, it does not mean the original action is correct, but just because the FIB lookup just fails here(see the log below), so the packet would be bypass by the eBPF program here with a mark 0x1000000:
-0 [014] ..s. 17619198.981285: 0: eno1np0--I: Traffic is towards the host namespace, doing Linux FIB lookup
-0 [014] ..s. 17619198.981287: 0: eno1np0--I: FIB lookup failed (FIB problem): 7.
-0 [014] ..s. 17619198.981287: 0: eno1np0--I: Traffic is towards host namespace, marking with 0x1000000.
-0 [014] ..s. 17619198.981288: 0: eno1np0--I: Final result=ALLOW (0). Program execution time: 16040ns
So it can correctly skip the wrong marking action above.

At the same time, we would like to say there is a similar processing for the unrelevant traffic in Cilium eBPF implementation:
ep = lookup_ip4_endpoint(ip4);
https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L571

and
if (!from_host)
return CTX_ACT_OK;
https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L586

Here the endpoint of Cilium eBPF is similar to the route of Calico eBPF.

This patch is also a fix for the issue of
"VM access was blocked when eBPF dataplane used"
projectcalico#6450

Signed-off-by: trevor tao [email protected]

Description

Related issues/PRs

Todos

  • Tests
  • Documentation
  • Release note

Release Note

TBD

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

When we enable Calico eBPF dataplane, and a packet(e.g, a ping ICMP packet) destined
for a VM of the host(VMs are ususually connected to host's physical interface through
the macvtap/macvlan interface in either Bridge, VEPA or passthrough mode)
from the physical interface, would be falsely bypassed
by the eBPF program here and can't reach the target VM from the virtual interface
(macvtap/macvlan).

When a packet comes into the eBPF program of traffic control, its
destination address(daddr) should be checked if it's for a known route
by checking the route map, and if it's for an unknown route, it should
be thought as it's not destined for this system, so we should just let
it go through(skip) all our eBPF programs processing here by setting the
action to TC_ACT_OK, which would skip for subsequent eBPF checkings
and processings.

So here we also should not check the unknown route traffic against FIB by bpf_fib_lookup
(in forward_or_drop()), since in some systems, the lookup result would be successful
like this:
  <idle>-0       [088] d.s. 1810775.267240: bpf_trace_printk: enp9s0---I: Traffic is towards the host namespace, doing Linux FIB lookup
  <idle>-0       [088] d.s. 1810775.267243: bpf_trace_printk: enp9s0---I: FIB lookup succeeded - with neigh
  <idle>-0       [088] d.s. 1810775.267244: bpf_trace_printk: enp9s0---I: Got Linux FIB hit, redirecting to iface 2.
  <idle>-0       [088] d.s. 1810775.267245: bpf_trace_printk: enp9s0---I: Traffic is towards host namespace, marking with 0x3000000.
  <idle>-0       [088] d.s. 1810775.267247: bpf_trace_printk: enp9s0---I: Final result=ALLOW (0). Program execution time: 31307ns
  <idle>-0       [088] d.s. 1810775.267249: bpf_trace_printk: enp9s0---E: New packet at ifindex=2; mark=3000000
  <idle>-0       [088] d.s. 1810775.267250: bpf_trace_printk: enp9s0---E: Final result=ALLOW (3). Bypass mark bit set.

and it's a wrong processing here since for the packet of a mark of 3000000 at
the egress direction would be discarded by the system.

On the other side, we also noticed in some systems, the issue of VM
access blocking seems to be disappeared, and the packet can go through the
eBPF program and finally reach the target VM. In this case, it does not mean the
original action is correct, but just because the FIB lookup just fails here(see the log
below), so the packet would be bypass by the eBPF program here with a mark 0x1000000:
  <idle>-0       [014] ..s. 17619198.981285: 0: eno1np0--I: Traffic is towards the host namespace, doing Linux FIB lookup
  <idle>-0       [014] ..s. 17619198.981287: 0: eno1np0--I: FIB lookup failed (FIB problem): 7.
  <idle>-0       [014] ..s. 17619198.981287: 0: eno1np0--I: Traffic is towards host namespace, marking with 0x1000000.
  <idle>-0       [014] ..s. 17619198.981288: 0: eno1np0--I: Final result=ALLOW (0). Program execution time: 16040ns
So it can correctly skip the wrong marking action above.

At the same time, we would like to say there is a similar processing for
the unrelevant traffic in Cilium eBPF implementation:
        ep = lookup_ip4_endpoint(ip4);
https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L571

and
        if (!from_host)
                return CTX_ACT_OK;
https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L586

Here the endpoint of Cilium eBPF is similar to the route of Calico eBPF.

This patch is also a fix for the issue of
"VM access was blocked when eBPF dataplane used"
projectcalico#6450

Signed-off-by: trevor tao <[email protected]>
@Dimonyga Dimonyga merged commit 3c41f22 into wgkind:master Oct 12, 2022
Copy link

@tomastigera tomastigera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again for looking at this too late. Great patch Pls, see my comment. Would you like to submit it to calico upstream? Or should I do that. I think we may just set the skip fib flag where I pointed to. It would be also great to have an fv test of some sort.

@@ -63,6 +64,23 @@ static CALI_BPF_INLINE int parse_packet_ip(struct cali_tc_ctx *ctx) {
}
}

// For ingress packet, check if it has local route matched, for unknown packet,
// just allow it with CALI_REASON_RT_UNKNOWN
if (CALI_F_FROM_HEP) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd take this out of parsing, probably to https://github.com/projectcalico/calico/blob/master/felix/bpf-gpl/tc.c#L505

My reasoning would be that in most setups we do not need to spend time on the route lookup for every packet that comes from a HEP. We would like to do it after we miss in conntrack. Also the benefit would be that we create a conntrack entry and ever subsequent packet would take the same path. Also this would be subject to a policy. Not sure whether that is what you want, but it feels like it should be 🤔

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not so desired to do the route lookup after we miss in conntrack. For the incoming packet from HEP(physical interfaces I think), we need to do an initial check for the incoming packet if its dest is for our Calico system by check if its dst addr is route known, for the route unknown traffic, e.g, the traffic sent to VMs in this host, we should just bypass it (by set it TC_ACT_OK) to skip subsequent processings in the Linux kernel, then it would be forwarded to their interfaces(ususually macvlan/macvtap intf bridged to this physical interface) then. This type of route unknown traffic should not be processed by any Calico ebpf logic here.
Similar processing was seen in the Cilium code, as what I gave out in the commit msg.
The conntrack check is mostly for post_nat_ip_dst addr which had been processed Calico's eBPF logic, I think it may be too late to determine if this packet really belongs to this Calico system.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not so desired to do the route lookup after we miss in conntrack. For the incoming packet from HEP(physical interfaces I think), we need to do an initial check for the incoming packet if its dest is for our Calico system by check if its dst addr is route known...

We definitely do not want to do route lookup for every packet on HEP. In most cases that would be followed by conntrack lookup if that is for something within the cluster. We want to do it the other way around. If there is a miss, we classify the flow and then create conntrack so that we have the same classification for subsequent traffic. The difference here is performance and not what ultimately happens to that packet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here we also should not check the unknown route traffic against FIB by bpf_fib_lookup (in forward_or_drop()), since in some systems, the lookup result would be successful like this:
-0 [088] d.s. 1810775.267240: bpf_trace_printk: enp9s0---I: Traffic is towards the host namespace, doing Linux FIB lookup
-0 [088] d.s. 1810775.267243: bpf_trace_printk: enp9s0---I: FIB lookup succeeded - with neigh
-0 [088] d.s. 1810775.267244: bpf_trace_printk: enp9s0---I: Got Linux FIB hit, redirecting to iface 2.
-0 [088] d.s. 1810775.267245: bpf_trace_printk: enp9s0---I: Traffic is towards host namespace, marking with 0x3000000.
-0 [088] d.s. 1810775.267247: bpf_trace_printk: enp9s0---I: Final result=ALLOW (0). Program execution time: 31307ns
-0 [088] d.s. 1810775.267249: bpf_trace_printk: enp9s0---E: New packet at ifindex=2; mark=3000000
-0 [088] d.s. 1810775.267250: bpf_trace_printk: enp9s0---E: Final result=ALLOW (3). Bypass mark bit set.

and it's a wrong processing here since for the packet of a mark of 3000000 at the egress direction would be discarded by the system.

Why is "packet of a mark of 3000000 at the egress direction" discarded by the system? That mark says, don't look at it, just allow it:

4944
4945 -0 [088] d.s. 1810775.267249: bpf_trace_printk: enp9s0---E: New packet at ifindex=2; mark=3000000
4946
4947 -0 [088] d.s. 1810775.267250: bpf_trace_printk: enp9s0---E: Final result=ALLOW (3). Bypass mark bit set.

Copy link

@TrevorTaoARM TrevorTaoARM Oct 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not so desired to do the route lookup after we miss in conntrack. For the incoming packet from HEP(physical interfaces I think), we need to do an initial check for the incoming packet if its dest is for our Calico system by check if its dst addr is route known...

We definitely do not want to do route lookup for every packet on HEP. In most cases that would be followed by conntrack lookup if that is for something within the cluster. We want to do it the other way around. If there is a miss, we classify the flow and then create conntrack so that we have the same classification for subsequent traffic. The difference here is performance and not what ultimately happens to that packet.

Hi Tomas:
Thanks for your comment.
For the issue of the route lookup for every packet on HEP, I think what you concerned here is performance. It's just adding a bpf map(cali_v4_routes) lookup action here which I think should be very quick and have very little effect on the final performance since there are still so many other actions remained.

On the other side, I think what Cilium had done about this issue could be a good referrence for what we care about:
For Cilium, there is a function:
handle_ipv4(struct __ctx_buff ctx, __u32 secctx,
__u32 ipcache_srcid __maybe_unused, const bool from_host)
in https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L460
which calls an endpoint lookup action for every incoming packet :
/
Lookup IPv4 address in list of local endpoints and host IPs /
ep = lookup_ip4_endpoint(ip4);
in https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L565,
This lookup action is also done agains a map name ENDPOINTS_MAP, for which I think it's sth like our route map here.
and if there is not valid ep (ep is Nil) found here, it would let it go:
/
Below remainder is only relevant when traffic is pushed via cilium_host.
* For traffic coming from external, we're done here.
*/
if (!from_host)
return CTX_ACT_OK;
in https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L577, here !from_host means the packet comes from net devices(NETDEV), which is of the same meaning for Calico's from HEP, I think.
This handle_ipv4() function would be called by
static __always_inline int
tail_handle_ipv4(struct __ctx_buff *ctx, __u32 ipcache_srcid, const bool from_host)
in https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L656
then called by:
__section_tail(CILIUM_MAP_CALLS, CILIUM_CALL_IPV4_FROM_NETDEV)
int tail_handle_ipv4_from_netdev(struct __ctx_buff *ctx)
{
return tail_handle_ipv4(ctx, 0, false);
}
in https://github.com/cilium/cilium/blob/master/bpf/bpf_host.c#L683
Here the "false" means the packet comes from the netdev.

So I think Cilium also has similar processing for unknown traffic, and AFAIK, Cilium ebpf performance is also very good.

Copy link

@tomastigera tomastigera Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot quite comment on the cilium code as I am not familiar with it. However, if we can have the same feature with less overhead, why not to do that. Otherwise one lookup here and one lookup there adds up 😅

Would you be able to test whether the patch a propose would work for you? 🙏

@tomastigera
Copy link

we should just bypass it (by set it TC_ACT_OK) to skip subsequent processings in the Linux kernel, then it would be forwarded to their interfaces(ususually macvlan/macvtap intf bridged to this physical interface) then.

In contrary, that would make the packet follow through full processing within the host kernel. And that is probably necessary. There is no dispute here. But there are different ways how to achieve this. That is all I am suggesting :)

@tomastigera
Copy link

Would this work for you? projectcalico#6882

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants