Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make noise transport perform less reallocations #4221

Closed
alindima opened this issue Jul 19, 2023 · 12 comments
Closed

make noise transport perform less reallocations #4221

alindima opened this issue Jul 19, 2023 · 12 comments

Comments

@alindima
Copy link

Currently, the noise protocol implementation spends a lot of time performing vec resizes and truncations, which are costly on a vector that hasn't got a preallocated capacity:

this.send_buffer.resize(n, 0u8);

.resize(frame.len() + EXTRA_ENCRYPT_SPACE, 0u8);

this.write_buffer.truncate(n);

This is noticeable in a flamegraph gathered while running a polkadot node. About 15% of the overall sampled period is spent in vec resizes.

I suggest we modify the preallocated vector capacity to be of the maximum noise frame len 65535, which should be a pretty non-intrusive change.

@thomaseizinger
Copy link
Contributor

Sounds good to me! Ideally, we could accompany this with a (micro)-benchmark.

Are you happy to prepare a PR?

@thomaseizinger
Copy link
Contributor

It would be even better if we could measure this improvement using the new performance benchmark suite: https://github.com/libp2p/test-plans/tree/master/perf

@alindima
Copy link
Author

Sure. I'd like to first test this in a polkadot test network, to see if it indeed is a noticeable improvement, because the flamegraph was gathered on a local machine.

I'm having a bit of trouble updating the libp2p version in substrate, I get some unexpected compilation errors. I'll see if I can sort them out

@thomaseizinger
Copy link
Contributor

I'm having a bit of trouble updating the libp2p version in substrate, I get some unexpected compilation errors. I'll see if I can sort them out

Check the release blog-post for a summary of all updates in the latest release: https://github.com/libp2p/rust-libp2p/releases/tag/libp2p-v0.52.0

@melekes
Copy link
Contributor

melekes commented Jul 20, 2023

I'm having a bit of trouble updating the libp2p version in substrate, I get some unexpected compilation errors. I'll see if I can sort them out

paritytech/substrate#14429 should be merged sometime this week

@alindima
Copy link
Author

alindima commented Jul 20, 2023

I finally managed to cherry-pick my changes onto the old version of libp2p that substrate uses and integrate that into polkadot.
After testing in a polkadot testnet, I don't see any obvious changes in the CPU utilisation of the libp2p-node. So the reallocations could be a problem only on my isolated node.

Still, I think it's a useful optimisation to make, even if it's just a micro-optimisation.

I tried creating a micro-benchmark, using criterion and a scenario similar to the quickcheck smoke test present in the noise/tests folder.
However, I cannot see any improvement in performance. I'm guessing this is because the benchmark I created uses real TCP sockets, and the overhead of the network stack could render this small improvement unnoticeable. The noise code does not have good support for mocking, and adding this would be quite a big change.

I'll open a small PR with my changes if you think they're valuable.

@thomaseizinger
Copy link
Contributor

If we can't prove that they are an improvement, it is difficult to argue for merging them :)

Is there any other advantage, like an easier to reason about implementation?

@alindima
Copy link
Author

conceptually, it's quite obvious that it's an improvement. we just don't have the right benchmarking setup to prove it and building it is not trivial.

I mimicked in a benchmark the logic of Output::poll_write:

    let messages = (0..10000)
        .map(|_| {
            let mut rng = rand::thread_rng();

            let len = rng.gen_range(1..(MAX_FRAME_LEN / 10));
            Message::random(len)
        })
        .collect::<Vec<_>>();

    c.bench_function("no_resize", |b| {
        let _ = env_logger::try_init();

        b.iter(black_box(|| {
            let mut buffer = vec![0u8; MAX_FRAME_LEN];
            let mut index = 0;
            for msg in &messages {
                if index == MAX_FRAME_LEN {
                    index = 0;
                }

                let n = std::cmp::min(MAX_FRAME_LEN - index, msg.0.len());
                buffer[index..index + n].copy_from_slice(&msg.0[..n]);
                index += n;
            }
        }));
    });

    c.bench_function("resize", |b| {
        let _ = env_logger::try_init();

        b.iter(black_box(|| {
            let mut buffer = Vec::new();
            let mut index: usize = 0;
            for msg in &messages {
                if index == MAX_FRAME_LEN {
                    index = 0;
                    buffer.clear();
                }
                let n = std::cmp::min(MAX_FRAME_LEN, index.saturating_add(msg.0.len()));
                buffer.resize(n, 0u8);
                let n = std::cmp::min(MAX_FRAME_LEN - index, msg.0.len());
                buffer[index..index + n].copy_from_slice(&msg.0[..n]);
                index += n;
            }
        }));
    });

The resize one is 40% slower:

                  time:   [993.42 µs 993.73 µs 994.07 µs]
                  change: [+43.107% +43.438% +43.782%] (p = 0.00 < 0.05)
                  Performance has regressed.

@thomaseizinger
Copy link
Contributor

Mind posting a PR with this? I'd like to see it with the context around it :)

Judging just from this, it looks simple enough to merge it.

@alindima
Copy link
Author

here it is: #4230

@alindima
Copy link
Author

Turns out this is not a visible problem when running a polkadot validator node.

The initial diagnostic was performed on a full polkadot node that was not a validator (performing less work, so the resizes had a higher percentage). When profiling a validator under load-testing, this overhead becomes negligible in comparison.

Therefore, I'll close this issue.

While I think that #4230 is a useful addition for lighter clients, you may close it if you disagree, since we're not impacted by this (we have much higher network CPU conspumtion coming from other places)

@thomaseizinger
Copy link
Contributor

Thank you for the detailed report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants