make noise transport perform less reallocations #4221

alindima · 2023-07-19T11:48:27Z

Currently, the noise protocol implementation spends a lot of time performing vec resizes and truncations, which are costly on a vector that hasn't got a preallocated capacity:

rust-libp2p/transports/noise/src/io.rs

Line 122 in a49ad77

this.send_buffer.resize(n, 0u8);

rust-libp2p/transports/noise/src/io/framed.rs

Line 336 in a49ad77

.resize(frame.len() + EXTRA_ENCRYPT_SPACE, 0u8);

rust-libp2p/transports/noise/src/io/framed.rs

Line 343 in a49ad77

this.write_buffer.truncate(n);

This is noticeable in a flamegraph gathered while running a polkadot node. About 15% of the overall sampled period is spent in vec resizes.

I suggest we modify the preallocated vector capacity to be of the maximum noise frame len 65535, which should be a pretty non-intrusive change.

thomaseizinger · 2023-07-19T11:51:29Z

Sounds good to me! Ideally, we could accompany this with a (micro)-benchmark.

Are you happy to prepare a PR?

thomaseizinger · 2023-07-19T11:52:10Z

It would be even better if we could measure this improvement using the new performance benchmark suite: https://github.com/libp2p/test-plans/tree/master/perf

alindima · 2023-07-19T12:15:56Z

Sure. I'd like to first test this in a polkadot test network, to see if it indeed is a noticeable improvement, because the flamegraph was gathered on a local machine.

I'm having a bit of trouble updating the libp2p version in substrate, I get some unexpected compilation errors. I'll see if I can sort them out

thomaseizinger · 2023-07-19T13:15:46Z

I'm having a bit of trouble updating the libp2p version in substrate, I get some unexpected compilation errors. I'll see if I can sort them out

Check the release blog-post for a summary of all updates in the latest release: https://github.com/libp2p/rust-libp2p/releases/tag/libp2p-v0.52.0

melekes · 2023-07-20T06:09:56Z

I'm having a bit of trouble updating the libp2p version in substrate, I get some unexpected compilation errors. I'll see if I can sort them out

paritytech/substrate#14429 should be merged sometime this week

alindima · 2023-07-20T13:54:45Z

I finally managed to cherry-pick my changes onto the old version of libp2p that substrate uses and integrate that into polkadot.
After testing in a polkadot testnet, I don't see any obvious changes in the CPU utilisation of the libp2p-node. So the reallocations could be a problem only on my isolated node.

Still, I think it's a useful optimisation to make, even if it's just a micro-optimisation.

I tried creating a micro-benchmark, using criterion and a scenario similar to the quickcheck smoke test present in the noise/tests folder.
However, I cannot see any improvement in performance. I'm guessing this is because the benchmark I created uses real TCP sockets, and the overhead of the network stack could render this small improvement unnoticeable. The noise code does not have good support for mocking, and adding this would be quite a big change.

I'll open a small PR with my changes if you think they're valuable.

thomaseizinger · 2023-07-20T17:51:29Z

If we can't prove that they are an improvement, it is difficult to argue for merging them :)

Is there any other advantage, like an easier to reason about implementation?

alindima · 2023-07-21T05:39:02Z

conceptually, it's quite obvious that it's an improvement. we just don't have the right benchmarking setup to prove it and building it is not trivial.

I mimicked in a benchmark the logic of Output::poll_write:

    let messages = (0..10000)
        .map(|_| {
            let mut rng = rand::thread_rng();

            let len = rng.gen_range(1..(MAX_FRAME_LEN / 10));
            Message::random(len)
        })
        .collect::<Vec<_>>();

    c.bench_function("no_resize", |b| {
        let _ = env_logger::try_init();

        b.iter(black_box(|| {
            let mut buffer = vec![0u8; MAX_FRAME_LEN];
            let mut index = 0;
            for msg in &messages {
                if index == MAX_FRAME_LEN {
                    index = 0;
                }

                let n = std::cmp::min(MAX_FRAME_LEN - index, msg.0.len());
                buffer[index..index + n].copy_from_slice(&msg.0[..n]);
                index += n;
            }
        }));
    });

    c.bench_function("resize", |b| {
        let _ = env_logger::try_init();

        b.iter(black_box(|| {
            let mut buffer = Vec::new();
            let mut index: usize = 0;
            for msg in &messages {
                if index == MAX_FRAME_LEN {
                    index = 0;
                    buffer.clear();
                }
                let n = std::cmp::min(MAX_FRAME_LEN, index.saturating_add(msg.0.len()));
                buffer.resize(n, 0u8);
                let n = std::cmp::min(MAX_FRAME_LEN - index, msg.0.len());
                buffer[index..index + n].copy_from_slice(&msg.0[..n]);
                index += n;
            }
        }));
    });

The resize one is 40% slower:

                  time:   [993.42 µs 993.73 µs 994.07 µs]
                  change: [+43.107% +43.438% +43.782%] (p = 0.00 < 0.05)
                  Performance has regressed.

thomaseizinger · 2023-07-21T10:00:45Z

Mind posting a PR with this? I'd like to see it with the context around it :)

Judging just from this, it looks simple enough to merge it.

alindima · 2023-07-21T11:15:59Z

here it is: #4230

alindima · 2023-07-28T10:39:55Z

Turns out this is not a visible problem when running a polkadot validator node.

The initial diagnostic was performed on a full polkadot node that was not a validator (performing less work, so the resizes had a higher percentage). When profiling a validator under load-testing, this overhead becomes negligible in comparison.

Therefore, I'll close this issue.

While I think that #4230 is a useful addition for lighter clients, you may close it if you disagree, since we're not impacted by this (we have much higher network CPU conspumtion coming from other places)

thomaseizinger · 2023-07-29T15:30:19Z

Thank you for the detailed report!

alindima mentioned this issue Jul 19, 2023

[DO NOT MERGE] patch libp2p-noise version to: alindima/rust-libp2p-limit-noise-resizes paritytech/polkadot#7520

Closed

alindima mentioned this issue Jul 21, 2023

refactor: limit noise transport vector reallocations #4230

Closed

4 tasks

alindima closed this as completed Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make noise transport perform less reallocations #4221

make noise transport perform less reallocations #4221

alindima commented Jul 19, 2023

thomaseizinger commented Jul 19, 2023

thomaseizinger commented Jul 19, 2023

alindima commented Jul 19, 2023

thomaseizinger commented Jul 19, 2023

melekes commented Jul 20, 2023

alindima commented Jul 20, 2023 •

edited

Loading

thomaseizinger commented Jul 20, 2023

alindima commented Jul 21, 2023

thomaseizinger commented Jul 21, 2023

alindima commented Jul 21, 2023

alindima commented Jul 28, 2023

thomaseizinger commented Jul 29, 2023

make noise transport perform less reallocations #4221

make noise transport perform less reallocations #4221

Comments

alindima commented Jul 19, 2023

thomaseizinger commented Jul 19, 2023

thomaseizinger commented Jul 19, 2023

alindima commented Jul 19, 2023

thomaseizinger commented Jul 19, 2023

melekes commented Jul 20, 2023

alindima commented Jul 20, 2023 • edited Loading

thomaseizinger commented Jul 20, 2023

alindima commented Jul 21, 2023

thomaseizinger commented Jul 21, 2023

alindima commented Jul 21, 2023

alindima commented Jul 28, 2023

thomaseizinger commented Jul 29, 2023

alindima commented Jul 20, 2023 •

edited

Loading