Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bpf): implement stack bypass #458

Merged

Conversation

jschwinger233
Copy link
Member

@jschwinger233 jschwinger233 commented Feb 19, 2024

劫持路径的 stack bypass 实现

Background

之前的问题集中在:

  1. netfilter:如 wan ingress 入栈后需要额外配置 nft 允许 0x8000000、lan ingress 和 nft flow table 冲突、等。
  2. sysctl:如 wan ingress 需要设置 sysctl accept_local=1、dae0 需要设置 rp_filter=0、等。
  3. 二层邻居系统:如 dae0 的 lladdr 被 systemd 修改。

这个 PR 试图把“劫持路径”(“分流路径”)绕过内核栈,并且保持 datapath 对称,希望能解决大部分问题。

Datapath

0.5 wan: 注意劫持路径请求和回复是非对称的,而且从 wan0 到 dae 的网络栈造成了大部分问题


            bpf_redirect
            bpf_sk_assign
               ┌────┐
         ┌─────►wan0├─────┐
         │     └────┘     │
 request │                │request
         │                │
       ┌─┴──┐  reply    ┌─▼──┐
       │curl◄───────────┤dae │
       └────┘           └────┘

0.5 lan for udp: 请求和回复也是非对称的,而且 dae0 的 lladdr + sysctl 可能被 systemd 修改也造成了不少问题


  bpf_sk_assign
     ┌────┐  request  ┌───┐
     │lan0├───────────►dae│
     └─▲──┘           └─┬─┘
       │                │reply
 reply │       ┌────────┼───┐
       │       │        │   │
       │  ┌────┼────┐   │   │
       └──┤dae0│peer◄───┘   │
          └────┼────┘       │
               │   dae netns│
               └────────────┘

新的 wan datapath:注意 wan0 和 dae0 之间在双向都是通过 bpf_redirect 跳过内核栈,所以不需要配置 nft 和 sysctl。,图中数字标注解释如下:

  1. bpf_wan_egress: 做分流决策:直连流量放行出网,分流流量调用 bpf_redirect 重定向给 dae0
  2. bpf_peer_ingress: 只有分流流量才可能到达这里,直接调用 bpf_skc_assign 把流量指定给 dae socket
  3. bpf_dae0_ingress: 只有分流流量的回复才可能到达这里,直接调用 bpf_redirect 把它重定向回 wan0
                           ┌──────────────────┐ 
             1             │ 2                │ 
┌────┐     ┌────┐     ┌────┼────┐      ┌───┐  │ 
│    ├─────►    ├─────►    │    ├──────►   │  │ 
│curl│     │wan0│     │dae0│peer│      │dae│  │ 
│    ◄─────┤    ◄─────┤    │    ◄──────┤   │  │ 
└────┘     └────┘     └────┼────┘      └───┘  │ 
                        3  │     dae netns    │ 
                           └──────────────────┘ 

新 lan datapath: lan0 和 dae0 之间也是 bpf_redirect。图中数字标注解释如下:

  1. bpf_lan_ingress: 做分流决策:直连流量放行进入网络栈,分流流量调用 bpf_redirect 重定向给 dae0
  2. bpf_peer_ingress: (和上述 wan 场景是同一个 bpf 程序) 只有分流流量才可能到达这里,直接调用 bpf_skc_assign 把流量指定给 dae socket
  3. bpf_dae0_ingress: (和上述 wan 场景是同一个 bpf 程序) 只有分流流量的回复才可能到达这里,直接调用 bpf_redirect 把它重定向回 wan0
                ┌──────────────────┐ 
  1             │ 2                │ 
┌────┐     ┌────┼────┐      ┌───┐  │ 
│    ├─────►    │    ├──────►   │  │ 
│lan0│     │dae0│peer│      │dae│  │ 
│    ◄─────┤    │    ◄──────┤   │  │ 
└────┘     └────┼────┘      └───┘  │ 
             3  │     dae netns    │ 
                └──────────────────┘ 

新路径是完全对称路径,希望能尽量减少潜在的问题。

Implementation

  1. bpf prog

需要四个 tc bpf prog:

    1. lan_ingress: 调用 route() 做分流决策,对于分流流量调用 bpf_redirect 去 dae0。redirect 之前要记录一个 redirect_track,key 是 (sip, dip, l4proto),value 是 (smac, dmac, ifindex)。redirect 之前还要修改 ethhdr->dest 为 dae0-peer 的 lladdr
  • lan_egress: 不需要
    1. wan_egress: 调用 route() 做分流决策,对于分流流量调用 bpf_redirect 去 dae0。redirect 之前要记录一个 redirect_track,key 是 (sip, dip, l4proto),value 是 (smac, dmac, ifindex)。redirect 之前还要修改 ethhdr->dest 为 dae0-peer 的 lladdr
  • wan_ingress:不需要
  • dae0_egress:不需要
    1. dae0_ingress:处理 dae 进程的回复流量。查询 redirect_track,修改二层头,调用 bpf_redirect 重定向给 wan0 或 lan0。
    1. dae0peer_ingress:处理分流请求流量,调用 sk_lookup + sk_assign。
  1. control

a. 只在 dae netns 里监听 :12345
b. 不需要监听 dae0 lladdr,现在只需要 dae0-peer 的 lladdr,但它在 dae netns 里面,应该不会被修改(应该吧。。。)
c. 删除 autoConfigFirewall flag,因为不需要配置 nft
d. ip rule 只需要在 dae netns 里设置

Code Walkthrough

  • kernel-test.yaml: bump action version,新增了几个测试
  • cmd/run.go: 把 c.ListenAndServe() 放在 DaeNetns 里运行
  • config/config.go: 删除 AutoConfigFirewallRule
  • bpf_utils.go: 需要新注入几个常量给 bpf prog
  • control_plane.go: 把 DaeNetns.Setup() 提前运行,删除 nft AcceptInputMark, setupRoutingPolicy 放到 DaeNetns 里运行,新增一个 bindDaens 函数调用
  • control_plane_core.go: 主要是实现 bindDaens 函数,这个函数里把 dae0_ingress 和 dae0peer_ingress 两个 bpf attach 上去
  • tproxy.c: 略
  • netns_utils.go: 减少 setupSysctl 的设置,删除 monitorDae0LinkAddr
  • udp.go: 不需要检测 端口冲突,直接在 dae netns 里回复 udp

Hijack Path Stack Bypass Implementation

Background

Previous solutions focused on:

  1. Netfilter: For example, allowing 0x8000000 on wan ingress requires additional nft configuration, which conflicts with lan ingress and nft flow table.
  2. Sysctl: For example, wan ingress requires setting sysctl accept_local=1, and dae0 requires setting rp_filter=0.
  3. Layer 2 Neighbor System: For example, dae0's lladdr is modified by systemd.

Solution

This PR attempts to bypass the kernel stack for the "hijack path" ("diversion path") and keep the datapath symmetric, which is expected to solve most of the problems.

Datapath

0.5 wan:

Note

Note that hijack path requests and replies are asymmetric, and the network stack from wan0 to dae causes most of the problems.


            bpf_redirect
            bpf_sk_assign
               ┌────┐
         ┌─────►wan0├─────┐
         │     └────┘     │
 request │                │request
         │                │
       ┌─┴──┐  reply    ┌─▼──┐
       │curl◄───────────┤dae │
       └────┘           └────┘

0.5 lan for udp:

  • Requests and replies are also asymmetric.
  • dae0's lladdr + sysctl may be modified by systemd, which also causes many problems.

  bpf_sk_assign
     ┌────┐  request  ┌───┐
     │lan0├───────────►dae│
     └─▲──┘           └─┬─┘
       │                │reply
 reply │       ┌────────┼───┐
       │       │        │   │
       │  ┌────┼────┐   │   │
       └──┤dae0│peer◄───┘   │
          └────┼────┘       │
               │   dae netns│
               └────────────┘

New wan datapath:

  • Note that bpf_redirect is used to bypass the kernel stack in both directions between wan0 and dae0, so no nft or sysctl configuration is required.
                            ┌──────────────────┐
                            │                  │
 ┌────┐     ┌────┐     ┌────┼────┐      ┌───┐  │
 │    ├─────►    ├─────►    │    ├──────►   │  │
 │curl│     │wan0│     │dae0│peer│      │dae│  │
 │    ◄─────┤    ◄─────┤    │    ◄──────┤   │  │
 └────┘     └────┘     └────┼────┘      └───┘  │
                            │     dae netns    │
                            └──────────────────┘

New lan datapath:

  • bpf_redirect is also used between lan0 and dae0.
                 ┌──────────────────┐
                 │                  │
 ┌────┐     ┌────┼────┐      ┌───┐  │
 │    ├─────►    │    ├──────►   │  │
 │lan0│     │dae0│peer│      │dae│  │
 │    ◄─────┤    │    ◄──────┤   │  │
 └────┘     └────┼────┘      └───┘  │
                 │     dae netns    │
                 └──────────────────┘

Implementation

  1. bpf prog

    Four tc bpf progs are needed:

    • lan_ingress: Invokes route() to make routing decisions, redirects traffic to dae0 for routed traffic using bpf_redirect. Before redirection, it records a redirect_track where the key is (sip, dip, l4proto) and the value is (smac, dmac, ifindex). Before redirection, it also modifies ethhdr->dest to dae0-peer's lladdr.
    • lan_egress: Not needed
    • wan_egress: Invokes route() to make routing decisions, redirects traffic to dae0 for routed traffic using bpf_redirect. Before redirection, it records a redirect_track where the key is (sip, dip, l4proto) and the value is (smac, dmac, ifindex). Before redirection, it also modifies ethhdr->dest to dae0-peer's lladdr.
    • wan_ingress: Not needed
    • dae0_egress: Not needed
    • dae0_ingress: Handles reply traffic from dae process. Queries redirect_track, modifies layer 2 header, calls bpf_redirect to redirect to wan0 or lan0.
    • dae0peer_ingress: Handles routed request traffic, calls sk_lookup + sk_assign.
  2. control

    a. Listen only in dae netns on :12345.
    b. No need to listen to dae0 lladdr, now only dae0-peer's lladdr is needed, but it's within dae netns and shouldn't be modified (hopefully...).
    c. Remove autoConfigFirewall flag because nft configuration is not needed.
    d. Set ip rule only in dae netns.

Code Walkthrough

  • kernel-test.yaml: Bump action version, added a few tests.
  • cmd/run.go: Move c.ListenAndServe() to run within DaeNetns.
  • config/config.go: Remove AutoConfigFirewallRule.
  • bpf_utils.go: Need to inject a few new constants to bpf prog.
  • control_plane.go: Run DaeNetns.Setup() earlier, remove nft AcceptInputMark, move setupRoutingPolicy to run within DaeNetns, add a new function call bindDaens.
  • control_plane_core.go: Mainly implements bindDaens function, where dae0_ingress and dae0peer_ingress bpf programs are attached.
  • tproxy.c: Skipped.
  • netns_utils.go: Reduce setupSysctl settings, remove monitorDae0LinkAddr.
  • udp.go: No need to check port conflicts, reply udp directly within dae netns.

Checklist

Full Changelogs

  • [Implement ...]

Issue Reference

Closes #[issue number]

Test Result

@jschwinger233 jschwinger233 marked this pull request as ready for review February 19, 2024 16:41
@jschwinger233 jschwinger233 requested review from a team as code owners February 19, 2024 16:41
@QiuSimons
Copy link

QiuSimons commented Feb 20, 2024

没能成功实现代理(openwrt23.05_aarch64),log如下

time="2024-02-20T12:16:08+08:00" level=info msg="Loading eBPF programs and maps into the kernel..."
time="2024-02-20T12:16:08+08:00" level=info msg="The loading process takes about 120MB free memory, which will be released after loading. Insufficient memory will cause loading failure."
time="2024-02-20T12:16:24+08:00" level=info msg="Loaded eBPF programs and maps"
time="2024-02-20T12:16:24+08:00" level=info msg="Routing match set len: 1/64"
time="2024-02-20T12:16:24+08:00" level=warning msg="[Reload] Received reload signal; prepare to reload"
time="Feb 20 12:16:24" level=warning msg="[Reload] Load new control plane"
time="Feb 20 12:16:28" level=warning msg="[Reload] Stopped old control plane"
time="2024-02-20T12:16:28+08:00" level=warning msg="IpRuleDel: no such file or directory; no such file or directory"
time="2024-02-20T12:16:28+08:00" level=warning msg="IpRouteDel: no such process; no such file or directory"
time="Feb 20 12:16:28" level=warning msg="[Reload] Serve"
time="Feb 20 12:16:28" level=warning msg="[Reload] Finished"

update: 可能是我拿这个新core套了dae-wing的缘故

@Testeera
Copy link

新的劫持路径如果成功,会不会能代理pppoe-wan?

@douglarek
Copy link
Contributor

douglarek commented Feb 20, 2024

Awesome. Tested successfully on x86_64 (kernel 6.7.5), it seems to work well. BTW: To apply this pull request, the auto_config_firewall_rule configuration needs to be removed if you have set.

@Basstorm
Copy link

Basstorm commented Feb 20, 2024

immortalWrt 23.05.01 运行稳定,0.5的各种dns问题看起来也都解决了

@douglarek
Copy link
Contributor

douglarek commented Feb 20, 2024

immortalWrt 23.05.01 运行稳定,0.5的各种dns问题看起来也都解决了

Nice testing! Could you conveniently test which ones are unnecessary in this issue: #79. Let's see if dae can run without modifications.

@sdgrfe
Copy link

sdgrfe commented Feb 20, 2024

很好,成功分流

@douglarek
Copy link
Contributor

@douglarek this patch may be helpful to get rid of the warnings by dropping unknown traffic at dae0-peer: b7d836d

Thank you for your efforts to fix this issue, unfortunately the warning log has not been cleared.

@douglarek
Copy link
Contributor

douglarek commented Feb 28, 2024

@douglarek this patch may be helpful to get rid of the warnings by dropping unknown traffic at dae0-peer: b7d836d

Thank you for your efforts to fix this issue, unfortunately the warning log has not been cleared.

Sorry, I pulled the source code but forgot to compile it. Damn. There's no problem now. ✅ By the way, through PR 458, DAE will have a widespread impact on the OpenWRT world's proxy. Good job, big brother.

@piyoki piyoki removed the tested label Feb 29, 2024
@jschwinger233 jschwinger233 force-pushed the gray/exp/wan-redirect-to-dae0 branch from c533215 to e7525dc Compare February 29, 2024 03:57
@douglarek
Copy link
Contributor

why drop b7d836d ?

@sumire88
Copy link
Contributor

why drop b7d836d ?

After close inspection, we found out that the MTU patch will bring in additional DNS conflicts, so we propose removing it. When it comes to b7d836d, we are still testing it.

skb->mark will be reset when going across netns (skb_scrub_packet), so
this commit sets a special value in cb[0] which can survive bpf_redirect
and netns crossing.

This solves issues like:

level=warning msg="No AddrPort presented: reading map: key [[::ffff:0.0.0.0]:68, 17, 255.255.255.255:67]: lookup: key does not exist"
@jschwinger233 jschwinger233 force-pushed the gray/exp/wan-redirect-to-dae0 branch from e7525dc to a1a4012 Compare February 29, 2024 04:09
@sumire88
Copy link
Contributor

a1a4012 does not cause any conflicts - just confirmed. cc @douglarek @jschwinger233

@jschwinger233 jschwinger233 merged commit 6f1db5e into daeuniverse:main Mar 1, 2024
27 checks passed
Vigilans referenced this pull request in hack3ric/mimic Mar 31, 2024
Current implementation simply throws away these packets. This commit is the first step of implementing re-sending them after handshake.
@dae-prow dae-prow bot mentioned this pull request Apr 2, 2024
@dae-prow dae-prow bot mentioned this pull request Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.