Benchmarks #1

mratsim · 2020-07-13T18:09:40Z

x86-64

nim c -r --passC:-g -d:danger --hints:off --warnings:off --verbosity:0 --outdir:build benchmarks/bench_all.nim

Warmup: 0.9026 s, result 224 (displayed to avoid compiler optimizing warmup away)


Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz



⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
i.e. a 20% overclock will be about 20% off (assuming no dynamic frequency scaling)

=====================================================================================================================

Scalar multiplication G1 (255-bit)                             7649.939 ops/s       130720 ns/op       392165 cycles
Scalar multiplication G2 (255-bit)                             2973.783 ops/s       336272 ns/op      1008830 cycles
EC add G1                                                   1295336.788 ops/s          772 ns/op         2317 cycles
EC add G2                                                    452488.688 ops/s         2210 ns/op         6631 cycles
Pairing (Miller loop + Final Exponentiation)                   1315.289 ops/s       760289 ns/op      2280892 cycles
Hash to G2 (Draft #8)                                          3240.304 ops/s       308613 ns/op       925851 cycles

On Broadwell CPUs (Intel 2015) or Ryzen CPUs (AMD 2017) or later support the "ADX" instructions dedicated to big integer arithmetics
You might want to benchmark with --passC:-madx or --passC:"-march=native" to use them.

x86-64 + ADX instructions

nim c -r --passC:"-g -madx" -d:danger --hints:off --warnings:off --verbosity:0 --outdir:build benchmarks/bench_all.nim

Warmup: 0.9030 s, result 224 (displayed to avoid compiler optimizing warmup away)


Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz



⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
i.e. a 20% overclock will be about 20% off (assuming no dynamic frequency scaling)

=====================================================================================================================

Scalar multiplication G1 (255-bit)                             9631.777 ops/s       103823 ns/op       311473 cycles
Scalar multiplication G2 (255-bit)                             3768.863 ops/s       265332 ns/op       796006 cycles
EC add G1                                                   1706484.642 ops/s          586 ns/op         1758 cycles
EC add G2                                                    598444.045 ops/s         1671 ns/op         5015 cycles
Pairing (Miller loop + Final Exponentiation)                   1639.054 ops/s       610108 ns/op      1830347 cycles
Hash to G2 (Draft #8)                                          4270.876 ops/s       234144 ns/op       702442 cycles

On Broadwell CPUs (Intel 2015) or Ryzen CPUs (AMD 2017) or later support the "ADX" instructions dedicated to big integer arithmetics
You might want to benchmark with --passC:-madx or --passC:"-march=native" to use them.

Comparison

Compare with Milagro and MCL at status-im/nim-blscurve#47

(MCL JIT vs BLST)

Scalar mul G1: 200kcycles vs 300kcycles
Scalar mul G2: 400kcycles vs 800kcycles
Pairing: 2.200Mcycles vs 1.8Mcycles
Hash to G2: 467kcycles vs 702kcycles

Analysis:

The scalar mul is probably slower due to missing endomorphism acceleration (divides the number of doublings by 2 on G1 and 4 on G2) [Optim] Accelerated scalar multiplication supranational/blst#1
Pairing and so message verification is 18% faster
Hash to G2 is probably bottlenecked from SQRT FP2 [Optim] Square root on Fp2 supranational/blst#2 and clear_cofactor (which uses the same fast clearing method?)

Side-note on EC Add

MCL add is not constant-time, there are branches to detect infinity and adding the same or the opposite point while BLST always handle (add, double, infinity) cases.

The text was updated successfully, but these errors were encountered:

dot-asm · 2020-07-19T15:08:26Z

Just in case for reference. Among other things performance is also about "perspectives" and priorities. Most notably it's also about multi-processor scalability. This is why some components are not 100% yet. And the keyword is "yet." However, this is not to say that feedback is not appreciated. It certainly is! As well as new pointers and reminders :-) Thanks and cheers!

mratsim · 2020-07-20T12:23:52Z

Just in case for reference. Among other things performance is also about "perspectives" and priorities. Most notably it's also about multi-processor scalability. This is why some components are not 100% yet. And the keyword is "yet." However, this is not to say that feedback is not appreciated. It certainly is! As well as new pointers and reminders :-) Thanks and cheers!

Thanks, from discussion with some Consensys ZK team during EthCC, they indeed were investigating an issue where they couldn't scale Snarks beyond 16 cores and were looking for solutions to this. It seems to be an important issue for all zero-knowledge actors as LoopRing (which uses a completely different stack) was also scalable only with up to 16 cores. https://medium.com/loopring-protocol/zksnark-prover-optimizations-3e9a3e5578c0

I'm not sure what the current status is at the moment.

mratsim mentioned this issue Jul 16, 2020

[Optim] Runtime detection of MULX and ADX support supranational/blst#10

Closed

mratsim mentioned this issue Jul 24, 2020

Assembly backend mratsim/constantine#69

Merged

This was referenced Sep 3, 2020

Endomorphism G2 mratsim/constantine#79

Merged

Stop disabling AVX-512 in march=native compiler flag setup status-im/nimbus-eth2#843

Closed

mratsim mentioned this issue Dec 4, 2020

Bump BLST - backend perf improvements status-im/nim-blscurve#95

Merged

mratsim mentioned this issue Apr 29, 2024

ARMv7 optimization supranational/blst#213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks #1

Benchmarks #1

mratsim commented Jul 13, 2020 •

edited

Loading

dot-asm commented Jul 19, 2020

mratsim commented Jul 20, 2020

Benchmarks #1

Benchmarks #1

Comments

mratsim commented Jul 13, 2020 • edited Loading

x86-64

x86-64 + ADX instructions

Comparison

Analysis:

dot-asm commented Jul 19, 2020

mratsim commented Jul 20, 2020

mratsim commented Jul 13, 2020 •

edited

Loading