perf: use recvmmsg in addition to GRO #2137

mxinden · 2024-09-28T08:10:57Z

Previously we would only do GRO.

Depends on #2093.
Part of #1693.

Draft only. Just testing on the benchmark runner for now.

This change is best summarized by the `process` function signature. On `main` branch the `process` function looks as such: ```rust pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output { ``` - It takes as **input** an optional reference to a `Datagram`. That `Datagram` owns an allocation of the UDP payload, i.e. a `Vec<u8>`. Thus for each incoming UDP datagram, its payload is allocated in a new `Vec`. - It returns as **output** an owned `Output`. Most relevantly the `Output` variant `Output::Datagram(Datagram)` contains a `Datagram` that again owns an allocation of the UDP payload, i.e. a `Vec<u8>`. Thus for each outgoing UDP datagram too, its payload is allocated in a new `Vec`. This commit changes the `process` function to: ```rust pub fn process_into<'a>( &mut self, input: Option<Datagram<&[u8]>>, now: Instant, write_buffer: &'a mut Vec<u8>, ) -> Output<&'a [u8]> { ``` (Note the rename to `process_into` is temporary.) - It takes as **input** an optional `Datagram<&[u8]>`. But contrary to before, `Datagram<&[u8]>` does not own an allocation of the UDP payload, but represents a view into a long-lived receive buffer containing the UDP payload. - It returns as **output** an `Output<&'a [u8]>` where the `Output::Datagram(Datagram<&'a [u8]>)` variant does not own an allocation of the UDP payload, but here as well represents a view into a long-lived write buffer the payload is written into. That write buffer lives outside of `neqo_transport::Connection` and is provided to `process` as `write_buffer: &'a mut Vec<u8>`. Note that both `write_buffer` and `Output` use the lifetime `'a`, i.e. the latter is a view into the former. This change to the `process` function enables the following: 1. A user of `neqo_transport` (e.g. `neqo_bin`) has the OS write incoming UDP datagrams into a long-lived receive buffer (via e.g. `recvmmsg`). 2. They pass that receive buffer to `neqo_transport::Connection::process` along with a long-lived write buffer. 3. `process` reads the UDP datagram from the long-lived receive buffer through the `Datagram<&[u8]>` view and writes outgoing datagrams into the provided long-lived `write_buffer`, returning a view into said buffer via a `Datagram<&'a [u8]>`. 4. The user, after having called `process` can then pass the write buffer to the OS (e.g. via `sendmsg`). To summarize a user can receive and send UDP datagrams, without allocation in the UDP IO path. As an aside, the above is compatible with GSO and GRO, where a send and receive buffer contains a consecutive number of UDP datagram segments.

This reverts commit 995c499.

One can just use process(None, ...)

…-no-alloc

This reverts commit 3df6660.

…-no-alloc

No need to play with fire (uninitialized memory). Simply initialize the recv buffer once at startup.

github-actions · 2024-09-28T08:56:05Z

Failed Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

chrome vs. neqo-latest: 3
lsquic vs. neqo-latest: run cancelled after 20 min
msquic vs. neqo-latest: U
mvfst vs. neqo-latest: Z A L1 C1
quinn vs. neqo-latest: L1 V2
xquic vs. neqo-latest: M

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: H DC LR C20 M S R 3 B U A L1 L2 C1 C2 6 V2
neqo-latest vs. go-x-net: H DC LR M B U A L2 C2 6
neqo-latest vs. haproxy: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
neqo-latest vs. kwik: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
neqo-latest vs. lsquic: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
neqo-latest vs. msquic: H DC LR C20 M S R B U L2 C1 C2 6 V2
neqo-latest vs. mvfst: H DC LR M R Z 3 B U L2 C2 6
neqo-latest vs. neqo: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
neqo-latest vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
neqo-latest vs. nginx: H DC LR C20 M S R Z 3 B U A L2 C1 C2 6
neqo-latest vs. ngtcp2: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
neqo-latest vs. picoquic: H DC LR C20 M S R Z 3 B U E A L2 C1 C2 6 V2
neqo-latest vs. quic-go: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6
neqo-latest vs. quiche: H DC LR C20 M S R Z 3 B U A L1 L2 C2 6
neqo-latest vs. quinn: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6
neqo-latest vs. s2n-quic: H DC LR C20 M S R 3 B U E A L1 L2 C1 C2 6
neqo-latest vs. xquic: H DC LR C20 M R Z 3 B U L1 L2 C1 C2 6

neqo-latest as server

aioquic vs. neqo-latest: H DC LR C20 M S R Z 3 B A L1 L2 C1 C2 6 V2
go-x-net vs. neqo-latest: H DC LR M B U A L2 C2 6
kwik vs. neqo-latest: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
msquic vs. neqo-latest: H DC LR C20 M S R Z B A L1 L2 C1 C2 6 V2
mvfst vs. neqo-latest: H DC LR M 3 B L2 C2 6
neqo vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
ngtcp2 vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
picoquic vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
quic-go vs. neqo-latest: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6
quiche vs. neqo-latest: H DC LR M S R Z 3 B A L1 L2 C1 C2 6
quinn vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L2 C1 C2 6
s2n-quic vs. neqo-latest: H DC LR M S R 3 B E A L1 L2 C1 C2 6
xquic vs. neqo-latest: H DC LR C20 S R Z 3 B U A L1 L2 C1 C2 6

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: E
neqo-latest vs. go-x-net: C20 S R Z 3 E L1 C1 V2
neqo-latest vs. haproxy: E
neqo-latest vs. kwik: E
neqo-latest vs. msquic: 3 E
neqo-latest vs. mvfst: C20 S E V2
neqo-latest vs. nginx: E V2
neqo-latest vs. quic-go: E V2
neqo-latest vs. quiche: E V2
neqo-latest vs. quinn: V2
neqo-latest vs. s2n-quic: Z V2
neqo-latest vs. xquic: S E V2

neqo-latest as server

aioquic vs. neqo-latest: U E
chrome vs. neqo-latest: H DC LR C20 M S R Z B U E A L1 L2 C1 C2 6 V2
go-x-net vs. neqo-latest: C20 S R Z 3 E L1 C1 V2
kwik vs. neqo-latest: E
msquic vs. neqo-latest: 3 E
mvfst vs. neqo-latest: C20 S R U E V2
quic-go vs. neqo-latest: E V2
quiche vs. neqo-latest: C20 U E V2
s2n-quic vs. neqo-latest: C20 Z U V2
xquic vs. neqo-latest: E V2

Previously we would only do GRO.

github-actions · 2024-09-28T17:34:59Z

Benchmark results

Performance differences relative to 55e3a93.

coalesce_acked_from_zero 1+1 entries: 💚 Performance has improved.

       time:   [98.803 ns 99.130 ns 99.472 ns]
       change: [-12.775% -12.368% -11.953%] (p = 0.00 < 0.05)
Found 15 outliers among 100 measurements (15.00%)

11 (11.00%) high mild

4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: 💚 Performance has improved.

       time:   [116.87 ns 117.20 ns 117.56 ns]
       change: [-33.461% -33.093% -32.696%] (p = 0.00 < 0.05)
Found 19 outliers among 100 measurements (19.00%)

2 (2.00%) low severe

1 (1.00%) low mild

4 (4.00%) high mild

12 (12.00%) high severe

coalesce_acked_from_zero 10+1 entries: 💚 Performance has improved.

       time:   [116.24 ns 116.64 ns 117.13 ns]
       change: [-39.896% -35.657% -33.117%] (p = 0.00 < 0.05)
Found 13 outliers among 100 measurements (13.00%)

4 (4.00%) low severe

3 (3.00%) high mild

6 (6.00%) high severe

coalesce_acked_from_zero 1000+1 entries: 💚 Performance has improved.

       time:   [97.407 ns 97.529 ns 97.668 ns]
       change: [-31.875% -31.315% -30.596%] (p = 0.00 < 0.05)
Found 9 outliers among 100 measurements (9.00%)

3 (3.00%) high mild

6 (6.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.

       time:   [111.62 ms 111.67 ms 111.72 ms]
       change: [+0.1861% +0.2531% +0.3192%] (p = 0.00 < 0.05)
Found 7 outliers among 100 measurements (7.00%)

6 (6.00%) low mild

1 (1.00%) high mild

transfer/pacing-false/varying-seeds: No change in performance detected.

       time:   [26.891 ms 27.987 ms 29.092 ms]
       change: [-7.3239% -2.0801% +3.3604%] (p = 0.45 > 0.05)

transfer/pacing-true/varying-seeds: No change in performance detected.

       time:   [36.651 ms 38.242 ms 39.826 ms]
       change: [-5.9905% -0.1635% +5.8478%] (p = 0.95 > 0.05)

transfer/pacing-false/same-seed: No change in performance detected.

       time:   [26.723 ms 27.507 ms 28.285 ms]
       change: [-3.3295% +0.5804% +4.8142%] (p = 0.78 > 0.05)

transfer/pacing-true/same-seed: No change in performance detected.

       time:   [41.198 ms 43.275 ms 45.393 ms]
       change: [-7.2694% -1.0401% +5.9147%] (p = 0.75 > 0.05)
Found 2 outliers among 100 measurements (2.00%)

2 (2.00%) high mild

1-conn/1-100mb-resp (aka. Download)/client: 💚 Performance has improved.

       time:   [106.99 ms 107.48 ms 108.11 ms]
       thrpt:  [924.95 MiB/s 930.44 MiB/s 934.71 MiB/s]
change:
       time:   [-7.7356% -7.2870% -6.6739%] (p = 0.00 < 0.05)
       thrpt:  [+7.1511% +7.8598% +8.3842%]
Found 2 outliers among 100 measurements (2.00%)

1 (1.00%) low mild

1 (1.00%) high severe

1-conn/10_000-parallel-1b-resp (aka. RPS)/client: No change in performance detected.

       time:   [319.80 ms 323.00 ms 326.21 ms]
       thrpt:  [30.655 Kelem/s 30.960 Kelem/s 31.270 Kelem/s]
change:
       time:   [-0.7331% +0.8246% +2.4125%] (p = 0.31 > 0.05)
       thrpt:  [-2.3557% -0.8178% +0.7385%]

1-conn/1-1b-resp (aka. HPS)/client: 💔 Performance has regressed.

       time:   [36.454 ms 36.648 ms 36.855 ms]
       thrpt:  [27.134  elem/s 27.286  elem/s 27.432  elem/s]
change:
       time:   [+7.7777% +8.6189% +9.3890%] (p = 0.00 < 0.05)
       thrpt:  [-8.5831% -7.9350% -7.2165%]
Found 4 outliers among 100 measurements (4.00%)

4 (4.00%) high mild

Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client	Server	CC	Pacing	Mean [ms]	Min [ms]	Max [ms]	Relative
msquic	msquic			174.2 ± 91.5	100.2	414.4	1.00
neqo	msquic	reno	on	211.1 ± 10.9	192.1	224.9	1.00
neqo	msquic	reno		220.7 ± 20.5	200.7	265.3	1.00
neqo	msquic	cubic	on	208.0 ± 14.7	191.8	234.0	1.00
neqo	msquic	cubic		226.8 ± 44.1	192.9	363.9	1.00
msquic	neqo	reno	on	126.1 ± 73.5	83.7	328.4	1.00
msquic	neqo	reno		129.7 ± 91.4	84.0	455.9	1.00
msquic	neqo	cubic	on	131.8 ± 81.8	82.6	336.3	1.00
msquic	neqo	cubic		103.9 ± 48.7	81.9	321.9	1.00
neqo	neqo	reno	on	125.1 ± 11.7	107.7	146.4	1.00
neqo	neqo	reno		183.7 ± 137.4	106.9	687.8	1.00
neqo	neqo	cubic	on	185.2 ± 91.9	101.0	360.2	1.00
neqo	neqo	cubic		127.9 ± 20.0	103.5	172.4	1.00

⬇️ Download logs

mxinden added 30 commits September 8, 2024 11:46

Merge Encoder impl blocks

b334e84

Fix tests

2db53a2

clippy

9fef795

fix some, ignore some

995c499

Always run bench

1c653de

First process_input then process_http3

c05bc64

Revert "fix some, ignore some"

08eba9d

This reverts commit 995c499.

Remove process_multiple_input

ae112c8

Cleanup classic process fn delegating to process_x_2

828da75

Consolidate process functions

dfa33b2

Rename process to process_alloc

763b391

Rename process_into to process

8dee7b3

New TODO

5576875

Copy only for Datagram &[u8]

b9457bb

Fix more tests

94d1a68

remove all public process_output

e2d1452

One can just use process(None, ...)

Merge branch 'main' of https://github.com/mozilla/neqo into send-recv…

2e2a76e

…-no-alloc

Remove process_2

1947d33

Intra doc links

8ea56b2

Thread local receive buffer

3df6660

Cleanup UdpSocket::recv_inner

52dfa91

Revert "Thread local receive buffer"

8699209

This reverts commit 3df6660.

Fix fuzzing

1b9259c

Reduce diff

07c2b3b

Runner::new

14a9643

Cleanup server

f0855e1

Cleanup datagram.rs

936ea2b

Rename new_with_buffer to new

19a82cd

simplify codec.rs

55adc20

mxinden added 23 commits September 19, 2024 19:58

Rename write_buffer to out

6deaef8

Re-introduce process_output

5244fc1

Address minor TODOs

15eb2f8

Encode update frame directly

ffc3708

Document panic

e5bc0e2

Remove outdated clippy allow

eb09a9a

allow too_many_arguments in build_packet_header

5ab4e01

Fix build_insufficient_space

08c49e6

Fix build_two

2b0103a

Merge branch 'main' of https://github.com/mozilla/neqo into send-recv…

c254319

…-no-alloc

Make PacketBuilder limit an Option<usize>

1bd20da

No unsafe in recv_inner

8137e73

No need to play with fire (uninitialized memory). Simply initialize the recv buffer once at startup.

Minor TODOs

166ae86

Update NLL borrow-issue comment

93c1fa5

Simplify server.rs

3a134d6

Polonius workflow

61565b4

Remove duplicate runs-on

f65dde5

newline

c6f58dc

Duplicate only

55e650f

Fix diff

92c48de

test(udp): fix recv_buf initialization

a4d1ef2

Clippy

75c1de2

Update diff

1ce5455

mxinden force-pushed the use-recvmmsg branch from 382234b to 63eb277 Compare September 28, 2024 08:27

mxinden force-pushed the use-recvmmsg branch from 63eb277 to 235559f Compare September 28, 2024 11:29

feat: use recvmmsg in addition to GRO

f9bd792

Previously we would only do GRO.

mxinden force-pushed the use-recvmmsg branch from 235559f to f9bd792 Compare September 28, 2024 11:33

mxinden changed the title ~~feat: use recvmmsg in addition to GRO~~ perf: use recvmmsg in addition to GRO Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use recvmmsg in addition to GRO #2137

perf: use recvmmsg in addition to GRO #2137

mxinden commented Sep 28, 2024 •

edited

Loading

github-actions bot commented Sep 28, 2024 •

edited

Loading

Succeeded Interop Tests

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

neqo-latest as client

neqo-latest as server

github-actions bot commented Sep 28, 2024

perf: use recvmmsg in addition to GRO #2137

Are you sure you want to change the base?

perf: use recvmmsg in addition to GRO #2137

Conversation

mxinden commented Sep 28, 2024 • edited Loading

github-actions bot commented Sep 28, 2024 • edited Loading

Failed Interop Tests

neqo-latest as client

neqo-latest as server

Succeeded Interop Tests

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

neqo-latest as client

neqo-latest as server

github-actions bot commented Sep 28, 2024

Benchmark results

Client/server transfer results

mxinden commented Sep 28, 2024 •

edited

Loading

github-actions bot commented Sep 28, 2024 •

edited

Loading