Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: use recvmmsg in addition to GRO #2137

Draft
wants to merge 69 commits into
base: main
Choose a base branch
from

Conversation

mxinden
Copy link
Collaborator

@mxinden mxinden commented Sep 28, 2024

Previously we would only do GRO.

Depends on #2093.
Part of #1693.

Draft only. Just testing on the benchmark runner for now.

This change is best summarized by the `process` function signature.

On `main` branch the `process` function looks as such:

```rust
pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output {
```

- It takes as **input** an optional reference to a `Datagram`. That `Datagram` owns
an allocation of the UDP payload, i.e. a `Vec<u8>`. Thus for each incoming UDP
datagram, its payload is allocated in a new `Vec`.
- It returns as **output** an owned `Output`. Most relevantly the `Output` variant
`Output::Datagram(Datagram)` contains a `Datagram` that again owns an allocation of
the UDP payload, i.e. a `Vec<u8>`. Thus for each outgoing UDP datagram too, its
payload is allocated in a new `Vec`.

This commit changes the `process` function to:

```rust
pub fn process_into<'a>(
    &mut self,
    input: Option<Datagram<&[u8]>>,
    now: Instant,
    write_buffer: &'a mut Vec<u8>,
) -> Output<&'a [u8]> {
```

(Note the rename to `process_into` is temporary.)

- It takes as **input** an optional `Datagram<&[u8]>`. But contrary to before,
`Datagram<&[u8]>` does not own an allocation of the UDP payload, but represents
a view into a long-lived receive buffer containing the UDP payload.
- It returns as **output** an `Output<&'a [u8]>` where the
`Output::Datagram(Datagram<&'a [u8]>)` variant does not own an allocation of the
UDP payload, but here as well represents a view into a long-lived write buffer
the payload is written into. That write buffer lives outside of
`neqo_transport::Connection` and is provided to `process` as `write_buffer: &'a
mut Vec<u8>`. Note that both `write_buffer` and `Output` use the lifetime `'a`,
i.e. the latter is a view into the former.

This change to the `process` function enables the following:

1. A user of `neqo_transport` (e.g. `neqo_bin`) has the OS write incoming UDP
datagrams into a long-lived receive buffer (via e.g. `recvmmsg`).
2. They pass that receive buffer to `neqo_transport::Connection::process` along
with a long-lived write buffer.
3. `process` reads the UDP datagram from the long-lived receive buffer through
the `Datagram<&[u8]>` view and writes outgoing datagrams into the provided
long-lived `write_buffer`, returning a view into said buffer via a `Datagram<&'a
[u8]>`.
4. The user, after having called `process` can then pass the write buffer to the
OS (e.g. via `sendmsg`).

To summarize a user can receive and send UDP datagrams, without allocation in
the UDP IO path.

As an aside, the above is compatible with GSO and GRO, where a send and receive
buffer contains a consecutive number of UDP datagram segments.
One can just use process(None, ...)
Copy link

github-actions bot commented Sep 28, 2024

Failed Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Previously we would only do GRO.
Copy link

Benchmark results

Performance differences relative to 55e3a93.

coalesce_acked_from_zero 1+1 entries: 💚 Performance has improved.
       time:   [98.803 ns 99.130 ns 99.472 ns]
       change: [-12.775% -12.368% -11.953%] (p = 0.00 < 0.05)

Found 15 outliers among 100 measurements (15.00%)
11 (11.00%) high mild
4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: 💚 Performance has improved.
       time:   [116.87 ns 117.20 ns 117.56 ns]
       change: [-33.461% -33.093% -32.696%] (p = 0.00 < 0.05)

Found 19 outliers among 100 measurements (19.00%)
2 (2.00%) low severe
1 (1.00%) low mild
4 (4.00%) high mild
12 (12.00%) high severe

coalesce_acked_from_zero 10+1 entries: 💚 Performance has improved.
       time:   [116.24 ns 116.64 ns 117.13 ns]
       change: [-39.896% -35.657% -33.117%] (p = 0.00 < 0.05)

Found 13 outliers among 100 measurements (13.00%)
4 (4.00%) low severe
3 (3.00%) high mild
6 (6.00%) high severe

coalesce_acked_from_zero 1000+1 entries: 💚 Performance has improved.
       time:   [97.407 ns 97.529 ns 97.668 ns]
       change: [-31.875% -31.315% -30.596%] (p = 0.00 < 0.05)

Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.
       time:   [111.62 ms 111.67 ms 111.72 ms]
       change: [+0.1861% +0.2531% +0.3192%] (p = 0.00 < 0.05)

Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) low mild
1 (1.00%) high mild

transfer/pacing-false/varying-seeds: No change in performance detected.
       time:   [26.891 ms 27.987 ms 29.092 ms]
       change: [-7.3239% -2.0801% +3.3604%] (p = 0.45 > 0.05)
transfer/pacing-true/varying-seeds: No change in performance detected.
       time:   [36.651 ms 38.242 ms 39.826 ms]
       change: [-5.9905% -0.1635% +5.8478%] (p = 0.95 > 0.05)
transfer/pacing-false/same-seed: No change in performance detected.
       time:   [26.723 ms 27.507 ms 28.285 ms]
       change: [-3.3295% +0.5804% +4.8142%] (p = 0.78 > 0.05)
transfer/pacing-true/same-seed: No change in performance detected.
       time:   [41.198 ms 43.275 ms 45.393 ms]
       change: [-7.2694% -1.0401% +5.9147%] (p = 0.75 > 0.05)

Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild

1-conn/1-100mb-resp (aka. Download)/client: 💚 Performance has improved.
       time:   [106.99 ms 107.48 ms 108.11 ms]
       thrpt:  [924.95 MiB/s 930.44 MiB/s 934.71 MiB/s]
change:
       time:   [-7.7356% -7.2870% -6.6739%] (p = 0.00 < 0.05)
       thrpt:  [+7.1511% +7.8598% +8.3842%]

Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high severe

1-conn/10_000-parallel-1b-resp (aka. RPS)/client: No change in performance detected.
       time:   [319.80 ms 323.00 ms 326.21 ms]
       thrpt:  [30.655 Kelem/s 30.960 Kelem/s 31.270 Kelem/s]
change:
       time:   [-0.7331% +0.8246% +2.4125%] (p = 0.31 > 0.05)
       thrpt:  [-2.3557% -0.8178% +0.7385%]
1-conn/1-1b-resp (aka. HPS)/client: 💔 Performance has regressed.
       time:   [36.454 ms 36.648 ms 36.855 ms]
       thrpt:  [27.134  elem/s 27.286  elem/s 27.432  elem/s]
change:
       time:   [+7.7777% +8.6189% +9.3890%] (p = 0.00 < 0.05)
       thrpt:  [-8.5831% -7.9350% -7.2165%]

Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild

Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 174.2 ± 91.5 100.2 414.4 1.00
neqo msquic reno on 211.1 ± 10.9 192.1 224.9 1.00
neqo msquic reno 220.7 ± 20.5 200.7 265.3 1.00
neqo msquic cubic on 208.0 ± 14.7 191.8 234.0 1.00
neqo msquic cubic 226.8 ± 44.1 192.9 363.9 1.00
msquic neqo reno on 126.1 ± 73.5 83.7 328.4 1.00
msquic neqo reno 129.7 ± 91.4 84.0 455.9 1.00
msquic neqo cubic on 131.8 ± 81.8 82.6 336.3 1.00
msquic neqo cubic 103.9 ± 48.7 81.9 321.9 1.00
neqo neqo reno on 125.1 ± 11.7 107.7 146.4 1.00
neqo neqo reno 183.7 ± 137.4 106.9 687.8 1.00
neqo neqo cubic on 185.2 ± 91.9 101.0 360.2 1.00
neqo neqo cubic 127.9 ± 20.0 103.5 172.4 1.00

⬇️ Download logs

@mxinden mxinden changed the title feat: use recvmmsg in addition to GRO perf: use recvmmsg in addition to GRO Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant