relative performance of Mongomery and Edwards scalarmul #588

oconnor663 · 2023-10-27T19:04:07Z

Apologies for filing an issue that's more of a question :) My very basic understanding of the performance relationship between Curve25519 and Edwards25519 is that the former is faster when you're scalarmult'ing an arbitrary point, like you do in the second half of Diffie-Hellman. But when I benchmark scalarmult in this library (i5-1145G7 Linux laptop), it looks like the Edwards curve is faster:

use criterion::{criterion_group, criterion_main, Criterion};

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("curve25519_dalek Montgomery scalarmul", |b| {
        let point = curve25519_dalek::montgomery::MontgomeryPoint::mul_base_clamped(rand::random());
        b.iter(|| point.mul_clamped(rand::random()))
    });

    c.bench_function("curve25519_dalek Edwards scalarmul", |b| {
        let point = curve25519_dalek::edwards::EdwardsPoint::mul_base_clamped(rand::random());
        b.iter(|| point.mul_clamped(rand::random()))
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

$ cargo bench
   Compiling scratch v0.1.0 (/tmp/tmp.x9omrxNKmT/scratch)
    Finished bench [optimized] target(s) in 1.17s
     Running unittests src/main.rs (target/release/deps/scratch-67fe8039c398f6ac)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/bench.rs (target/release/deps/bench-3a6cb2fec14f5ab6)
curve25519_dalek Montgomery scalarmul
                        time:   [41.320 µs 41.428 µs 41.561 µs]
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

curve25519_dalek Edwards scalarmul
                        time:   [23.729 µs 23.757 µs 23.789 µs]
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe

There's a big chance that my assumption going in was wrong, or that I'm not benchmarking what I think I'm benchmarking. Am I right that this is surprising? Does this suggest that doing Diffie-Hellman on Edwards25519 directly (not converting to the Mongtomery form first) would actually be faster than regular X25519?

The text was updated successfully, but these errors were encountered:

tarcieri · 2023-10-27T19:35:30Z

curve25519-dalek implements fixed-based Montgomery multiplication in terms of Edwards multiplication:

https://github.com/dalek-cryptography/curve25519-dalek/blob/8e0cef5/curve25519-dalek/src/montgomery.rs#L123

Notably this allows it to leverage the precomputed basepoint tables for the Edwards form, so we don't need to maintain two different copies of the basepoint tables for the two different curve forms.

You're free to use either form for ECDH, though for interop purposes you may find it easier to use X25519.

oconnor663 · 2023-10-27T19:40:50Z

Fascinating, thanks for the lightning fast reply.

oconnor663 · 2023-10-27T19:47:04Z

Actually, I'm not sure we're talking about the same thing. What I'm trying to benchmark above is arbitrary/variable base scalarmult, via mul_clamped, which I think bottoms out at mul_bits_be (which by complete coincidence was introduced in the commit you linked to?). The calls to mul_base_clamped aren't part of the benchmarking loop. Am I right to think that that does not go through the Edwards form?

tarcieri · 2023-10-27T20:00:34Z

Aah yes, sorry I misread your benchmark. mul_base_clamped goes through mul_base where indeed mul_clamped goes through mul_bits_be. So fixed-base multiplication converts to Edwards form, whereas variable-base multiplication uses the Montgomery ladder, in case that's unclear.

oconnor663 · 2023-10-27T20:45:50Z

Yes that's what it looks like to me, which brings me back to the performance question: Isn't the Mongtomery ladder supposed to be faster than scalar multiplication on the Edwards curve? Is it weird that Edwards seems faster in this benchmark?

tarcieri · 2023-10-27T20:55:13Z

Edwards arithmetic in curve25519-dalek has multiple backends including highly optimized architecture-specific SIMD backends including ones for AVX2 and AVX-512, whereas the Montgomery backend is relatively simplistic.

oconnor663 · 2023-10-27T20:57:20Z

Got it, that makes sense. And yeah, I do notice a decent speedup (~17µs instead of ~23) when I benchmark nightly and pick up the AVX-512 optimizations. Thanks again!

tarcieri closed this as completed Oct 27, 2023

elichai mentioned this issue Oct 8, 2024

MontgomeryPoint Basepoint multiplication Table and performance #714

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

relative performance of Mongomery and Edwards scalarmul #588

relative performance of Mongomery and Edwards scalarmul #588

oconnor663 commented Oct 27, 2023

tarcieri commented Oct 27, 2023

oconnor663 commented Oct 27, 2023

oconnor663 commented Oct 27, 2023 •

edited

Loading

tarcieri commented Oct 27, 2023 •

edited

Loading

oconnor663 commented Oct 27, 2023

tarcieri commented Oct 27, 2023

oconnor663 commented Oct 27, 2023 •

edited

Loading

relative performance of Mongomery and Edwards scalarmul #588

relative performance of Mongomery and Edwards scalarmul #588

Comments

oconnor663 commented Oct 27, 2023

tarcieri commented Oct 27, 2023

oconnor663 commented Oct 27, 2023

oconnor663 commented Oct 27, 2023 • edited Loading

tarcieri commented Oct 27, 2023 • edited Loading

oconnor663 commented Oct 27, 2023

tarcieri commented Oct 27, 2023

oconnor663 commented Oct 27, 2023 • edited Loading

oconnor663 commented Oct 27, 2023 •

edited

Loading

tarcieri commented Oct 27, 2023 •

edited

Loading

oconnor663 commented Oct 27, 2023 •

edited

Loading