Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patch/optimize(bpf): improve wan tcp hijack datapath performance #481

Merged
merged 8 commits into from
Mar 31, 2024

Conversation

jschwinger233
Copy link
Member

@jschwinger233 jschwinger233 commented Mar 25, 2024

Background

这个 PR 引入了两个新的 bpf 程序来加速 WAN TCP。

This PR introduces two new BPF programs to accelerate WAN TCP.

总体来说,原本的 WAN TCP 劫持路径的数据平面如下图:

In general, the data plane of the original WAN TCP interception path is as shown in the following diagram:

 ┌─────────┐                   ┌─────────┐ 
 │ process │                   │ process │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ socket  │                   │ socket  │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ tcp/ip  │                   │ tcp/ip  │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐    ┌────┬────┐    ┌────┴────┐ 
 │ routing ├────►veth│veth├────► routing │ 
 └─────────┘    └────┴────┘    └─────────┘ 

这个 PR 把上述路径优化为:

This PR optimizes the above path to:

 ┌─────────┐                   ┌─────────┐ 
 │ process │                   │ process │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ socket  ├───────────────────► socket  │ 
 └─────────┘                   └─────────┘ 
                                           
 ┌─────────┐                   ┌─────────┐ 
 │ tcp/ip  │                   │ tcp/ip  │ 
 └─────────┘                   └─────────┘ 
                                           
 ┌─────────┐    ┌────┬────┐    ┌─────────┐ 
 │ routing │    │veth│veth│    │ routing │ 
 └─────────┘    └────┴────┘    └─────────┘ 

优化成果见 Benchmark。

The optimization results can be seen in the Benchmark.

实现细节

需要联合使用两个 bpf:

  1. BPF_PROG_TYPE_SOCK_OPS:这个类型的 bpf 是 attach 在 cgroup 上,可以在 TCP socket 三次握手完成时被触发。我们通过检查 routing_tuples_map 来判断一个 socket 是否是 WAN 代理的 socket,如果是的话就用 bpf_sock_hash_update 把 socket 加入 sockmap。
  2. BPF_PROG_TYPE_SK_MSG:这个类型的 bpf 是 attach 一个 sockmap 上,就是第一步收集的 WAN 代理劫持的 sockets。它会在 socket 发送消息的时候触发,通过调用 bpf_msg_redirect_hash 实现 TCP segment 的直接投递。

注意 TCP 握手和挥手依然走内核栈,这部分是不加速的,只有建立连接后才可以

Implementation Details

Two BPF programs need to be used in conjunction:

  1. BPF_PROG_TYPE_SOCK_OPS: This type of BPF is attached to a cgroup and triggered upon completion of the TCP socket's three-way handshake. We check the routing_tuples_map to determine if a socket is a WAN proxy socket. If it is, we use bpf_sock_hash_update to add the socket to the sockmap.
  2. BPF_PROG_TYPE_SK_MSG: This type of BPF is attached to a sockmap, which contains the sockets collected in the first step of intercepting WAN proxies. It is triggered when a socket sends a message, and it uses bpf_msg_redirect_hash to directly deliver TCP segments.

Note that TCP handshakes and tear-downs still go through the kernel stack and are not accelerated. Only after the connection is established can acceleration take place.

Benchmark

使用 sockperf 测试 latency

To test latency using sockperf,

dae-0.4.0 结果是

dae-0.4.0 Results

# nsenter -t $(pidof dae-0.4.0) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10
sockperf: == version #3.7-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 172.18.0.3      PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=10.000 sec; Warm up time=400 msec; SentMessages=134874; ReceivedMessages=134873
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=128877; ReceivedMessages=128877
sockperf: ====> avg-latency=37.006 (std-dev=5.955)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 37.006 usec
sockperf: Total 128877 observations; each percentile contains 1288.77 observations
sockperf: ---> <MAX> observation =  420.339
sockperf: ---> percentile 99.999 =  313.563
sockperf: ---> percentile 99.990 =  206.996
sockperf: ---> percentile 99.900 =   79.486
sockperf: ---> percentile 99.000 =   50.174
sockperf: ---> percentile 90.000 =   42.508
sockperf: ---> percentile 75.000 =   39.476
sockperf: ---> percentile 50.000 =   36.514
sockperf: ---> percentile 25.000 =   34.145
sockperf: ---> <MIN> observation =   21.565

这个 PR 的结果是

Results with this PR

# nsenter -t $(pidof dae) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10
sockperf: == version #3.7-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 172.18.0.3      PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=10.000 sec; Warm up time=400 msec; SentMessages=143488; ReceivedMessages=143487
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=137069; ReceivedMessages=137069
sockperf: ====> avg-latency=34.788 (std-dev=6.701)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 34.788 usec
sockperf: Total 137069 observations; each percentile contains 1370.69 observations
sockperf: ---> <MAX> observation =  425.241
sockperf: ---> percentile 99.999 =  407.120
sockperf: ---> percentile 99.990 =  244.703
sockperf: ---> percentile 99.900 =   80.511
sockperf: ---> percentile 99.000 =   47.190
sockperf: ---> percentile 90.000 =   40.633
sockperf: ---> percentile 75.000 =   37.325
sockperf: ---> percentile 50.000 =   34.607
sockperf: ---> percentile 25.000 =   31.777
sockperf: ---> <MIN> observation =   20.779

TCP latency 提升 6%

TCP latency is improved by 6%

但 latency 只是性能的一部分,如果是 iperf 跑 tcp rr (round-trip) 在我虚拟机上会直接把内存跑炸

However, latency is just one aspect of performance. If running iperf for TCP round-trip (RR) tests on my virtual machine, it would directly cause excessive memory usage.

[Mon Mar 25 18:17:02 2024] Out of memory: Killed process 1233 (dae) total-vm:1315492kB, anon-rss:86784kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:296kB oom_score_adj:0
[Mon Mar 25 18:17:02 2024] TCP: out of memory -- consider tuning tcp_mem

在实际场景中,比如 redis-server 和 redis-benchmark 中的表现往往能达到 10%+ 的 p99 提升。

In real-world scenarios, such as in Redis-server and Redis-benchmark, performance improvements of over 10% in p99 latency are often achievable.

Checklist

Full Changelogs

  • [Implement ...]

Issue Reference

Closes #[issue number]

Test Result

@jschwinger233 jschwinger233 changed the title patch/optimize(bpf): improve wan hijack datapath performance patch/optimize(bpf): improve wan tcp hijack datapath performance Mar 25, 2024
@mzz2017
Copy link
Contributor

mzz2017 commented Mar 27, 2024

这个优化非常令人兴奋,这或许已经是当前 linux 系统下的最优性能方案(代理 wan 的情况下)。通过 socket 重定向直接将路径缩至最短,非常极致的优化!

针对这次优化,是否需要更高版本的内核?如果是,我们或许需要增加一些判断和提示(像之前的代码那样),以及更新一些文档。

@sumire88 sumire88 marked this pull request as ready for review March 27, 2024 04:44
@sumire88 sumire88 requested a review from a team as a code owner March 27, 2024 04:44
sotux

This comment was marked as abuse.

sotux

This comment was marked as abuse.

sotux

This comment was marked as abuse.

sotux

This comment was marked as abuse.

sotux

This comment was marked as abuse.

sotux

This comment was marked as abuse.

sotux

This comment was marked as abuse.

@jschwinger233
Copy link
Member Author

针对这次优化,是否需要更高版本的内核?如果是,我们或许需要增加一些判断和提示(像之前的代码那样),以及更新一些文档。

CI 测过了 5.10 貌似是好的。 dae 目前要求 >=5.8,我自己编译一个 5.8 试试

Copy link
Contributor

@wanlce wanlce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant code!

sumire88
sumire88 previously approved these changes Mar 27, 2024
Copy link
Contributor

@sumire88 sumire88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've tested the changes on my end, all works fine. Thanks for proposing the solution, it indeed optimizes the throughput.

dae-prow[bot]
dae-prow bot previously approved these changes Mar 27, 2024
Copy link
Contributor

@dae-prow dae-prow bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧪 Since the PR has been fully tested, please consider merging it.

@sumire88 sumire88 requested a review from mzz2017 March 27, 2024 15:44
@jschwinger233
Copy link
Member Author

针对这次优化,是否需要更高版本的内核?如果是,我们或许需要增加一些判断和提示(像之前的代码那样),以及更新一些文档。

编译了 5.8 (妈的这版本 EOL 了我手动改了 objtool/elf.c 才编过,还把我磁盘占满了),不能运行,报错 in-kernel BTF is malformed,但我觉得单纯是因为 5.8 又老又 EOL 在编译时 binutils 没有正确生成 BTF,不代表真的无法在 5.8 运行。

但是考虑到以后我可能很难测试 5.8,如果可以稍微提高内核要求到 5.10 就更好了,5.10 是一个 LTS 版本,要 31 Dec 2026 才停止支持 ( https://endoflife.date/linux ) ,目前的 CI Kernel-test 也有测它。

@amtoaer
Copy link

amtoaer commented Mar 28, 2024

使用该版本 dae 遇到一个问题。抽象出来应该是这种情况:
在 dae 宿主机运行两个 docker 容器 A、B 提供 web 服务,均使用 network_mode: bridge 运行。其中 A 的端口映射为 a:a,B 的端口映射为 b:b。
docker 的默认 bridge 如下:

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255

此时,在 A 容器中访问 http://172.17.0.1:b/ 理应能够访问到 B 容器的 web 服务,但使用该 PR 的 build,这个请求会无响应。


daily main,无论是否开启 dae 均对该类请求无影响:
image
该 PR,开启 dae 后请求无响应:
image

@jschwinger233
Copy link
Member Author

@amtoaer dae 是不是设置了 lan_interface: docker0

@amtoaer
Copy link

amtoaer commented Mar 28, 2024

@jschwinger233 是的,我的配置是:

    lan_interface: docker0,br0
    wan_interface: br0

@jschwinger233
Copy link
Member Author

@amtoaer 好 我忘了这个场景了 能处理

@jschwinger233 jschwinger233 dismissed stale reviews from dae-prow[bot] and sumire88 via 6f73108 March 28, 2024 05:31
Previous check `if (!bpf_map_lookup_elem(&routing_tuples_map,
&rev_tuple))` can also add local LAN connection via docker0, this
patches exclude these traffic by checking `!routing_result->pid`.
@amtoaer
Copy link

amtoaer commented Mar 28, 2024

@jschwinger233 正常工作了,感谢!
image

@mzz2017
Copy link
Contributor

mzz2017 commented Mar 28, 2024

@jschwinger233 可以的,提高到5.10没问题

@mzz2017
Copy link
Contributor

mzz2017 commented Mar 28, 2024

@jschwinger233 麻烦在相关的代码和文档中将要求提高到 5.10,谢谢

@jschwinger233 jschwinger233 requested a review from a team as a code owner March 28, 2024 07:24
@jschwinger233 jschwinger233 added the documentation Improvements or additions to documentation label Mar 28, 2024
@sumire88 sumire88 requested review from wanlce and a team March 28, 2024 11:07
Copy link
Contributor

@mzz2017 mzz2017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bpf documentation Improvements or additions to documentation feature optimize tested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants