Benchmark vs MCL on x86 #47

mratsim · 2020-04-08T19:23:56Z

How to reproduce

MCL

git clone https://github.com/herumi/mcl
cd mcl

make -j ${ncpu}
make bin/bls12_test.exe # even on Linux

bin/bls12_test.exe

nim-blscurve

git clone https://github.com/status-im/nim-blscurve
cd nim-blscurve
nimble bench

Results

On i9-9980XE. Note: Overclocked at 4.1GHz while nominal clock is 3.0 GHz so the cycle count is off by a factor 4.1/3.0

Reading results:

G2 multiplications is sign
pairing is composed among others of Miller Loop and Final Exponentiation
pairing is verify

MCL using JIT (x86-only)

Highlighted the important parts

$  bin/bls12_test.exe
JIT 1
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      198.538Kclk <-------------------------
G1::mul        207.139Kclk <-------------------------
G1::add          1.252Kclk <-------------------------
G1::dbl        880.09 clk
G2::mulCT      383.361Kclk <-------------------------
G2::mul        407.538Kclk <-------------------------
G2::add          3.424Kclk <-------------------------
G2::dbl          2.169Kclk
GT::pow        802.285Kclk
G1::setStr chk 289.740Kclk
G1::setStr       2.606Kclk
G2::setStr chk 723.913Kclk
G2::setStr       5.467Kclk
hashAndMapToG1 225.867Kclk
hashAndMapToG2 467.456Kclk
Fp::add         14.27 clk
Fp::sub          9.28 clk
Fp::neg          8.05 clk
Fp::mul        107.01 clk
Fp::sqr         98.88 clk
Fp::inv          4.542Kclk
Fp2::add        19.23 clk
Fp2::sub        14.30 clk
Fp2::neg        13.90 clk
Fp2::mul       305.68 clk
Fp2::mul_xi     22.51 clk
Fp2::sqr       228.53 clk
Fp2::inv         5.111Kclk
FpDbl::addPre   13.24 clk
FpDbl::subPre   13.21 clk
FpDbl::add      19.52 clk
FpDbl::sub      12.88 clk
FpDbl::mulPre   46.70 clk
FpDbl::sqrPre   40.28 clk
FpDbl::mod      58.60 clk
Fp2Dbl::mulPre  203.47 clk
Fp2Dbl::sqrPre  113.19 clk
GT::add        114.57 clk
GT::mul          7.663Kclk
GT::sqr          5.490Kclk
GT::inv         16.616Kclk
FpDbl::mulPre   46.79 clk
pairing          2.237Mclk <-------------------------
millerLoop     906.605Kclk <-------------------------
finalExp         1.329Mclk <-------------------------
precomputeG2   201.104Kclk
precomputedML  686.266Kclk
millerLoopVec    4.616Mclk
ctest:module=finalExp
finalExp   1.338Mclk
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1  32.201Kclk
naiveG2  17.132Kclk
calcBN2  64.564Kclk
naiveG2  46.680Kclk
BLS12_381
calcBN1  77.113Kclk
naiveG1  56.032Kclk
calcBN2 157.989Kclk
naiveG2 129.702Kclk
ctest:module=eth2
mapToG2  org-cofactor 904.585Kclk
mapToG2 fast-cofactor 476.747Kclk
ctest:module=deserialize
verifyOrder(1)
deserializeG1 345.128Kclk
deserializeG2 842.046Kclk
verifyOrder(0)
deserializeG1  57.978Kclk
deserializeG2 121.257Kclk
ctest:name=bls12_test, module=8, total=3600, ok=3600, ng=0, exception=0

MCL using Assembly from LLVM i256 and i384 (x86 and ARM)

$  bin/bls12_test.exe -m llvm_mont
JIT 1
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      208.328Kclk <-------------------------
G1::mul        212.953Kclk <-------------------------
G1::add          1.279Kclk <-------------------------
G1::dbl        919.59 clk
G2::mulCT      522.604Kclk <-------------------------
G2::mul        537.523Kclk <-------------------------
G2::add          4.638Kclk <-------------------------
G2::dbl          2.704Kclk
GT::pow        916.182Kclk
G1::setStr chk 301.920Kclk
G1::setStr       2.684Kclk
G2::setStr chk 935.164Kclk
G2::setStr       5.723Kclk
hashAndMapToG1 239.817Kclk
hashAndMapToG2 582.365Kclk
Fp::add         13.37 clk
Fp::sub         13.86 clk
Fp::neg          8.78 clk
Fp::mul        110.30 clk
Fp::sqr        110.73 clk
Fp::inv          4.594Kclk
Fp2::add        25.68 clk
Fp2::sub        27.10 clk
Fp2::neg        19.76 clk
Fp2::mul       453.66 clk
Fp2::mul_xi     35.83 clk
Fp2::sqr       254.18 clk
Fp2::inv         5.168Kclk
FpDbl::addPre   14.72 clk
FpDbl::subPre   14.64 clk
FpDbl::add      19.02 clk
FpDbl::sub      17.55 clk
FpDbl::mulPre   54.56 clk
FpDbl::sqrPre   51.65 clk
FpDbl::mod      80.36 clk
Fp2Dbl::mulPre  243.72 clk
Fp2Dbl::sqrPre  145.11 clk
GT::add        168.85 clk
GT::mul          8.658Kclk
GT::sqr          6.252Kclk
GT::inv         20.416Kclk
FpDbl::mulPre   53.03 clk
pairing          2.717Mclk <-------------------------
millerLoop       1.105Mclk <-------------------------
finalExp         1.550Mclk <-------------------------
precomputeG2   269.778Kclk
precomputedML  852.854Kclk
millerLoopVec    5.598Mclk
ctest:module=finalExp
finalExp   1.547Mclk
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1  34.282Kclk
naiveG2  17.954Kclk
calcBN2  65.749Kclk
naiveG2  48.280Kclk
BLS12_381
calcBN1  82.024Kclk
naiveG1  60.722Kclk
calcBN2 167.170Kclk
naiveG2 139.621Kclk
ctest:module=eth2
mapToG2  org-cofactor   1.150Mclk
mapToG2 fast-cofactor 585.033Kclk
ctest:module=deserialize
verifyOrder(1)
deserializeG1 366.132Kclk
deserializeG2   1.152Mclk
verifyOrder(0)
deserializeG1  62.822Kclk
deserializeG2 130.251Kclk
ctest:name=bls12_test, module=8, total=3600, ok=3600, ng=0, exception=0

Nim-blscurve using Milagro

Warmup: 0.9038 s, result 224 (displayed to avoid compiler optimizing warmup away)


Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz



⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
i.e. a 20% overclock will be about 20% off (assuming no dynamic frequency scaling)

=================================================================================================================

Scalar multiplication G1                                   2399.307 ops/s       416787 ns/op      1250378 cycles
Scalar multiplication G2                                    890.692 ops/s      1122722 ns/op      3368208 cycles
EC add G1                                                911577.028 ops/s         1097 ns/op         3291 cycles
EC add G2                                                323310.702 ops/s         3093 ns/op         9281 cycles
Pairing (Milagro builtin double pairing)                    420.039 ops/s      2380729 ns/op      7142276 cycles
Pairing (Multi-Pairing with delayed Miller and Exp)         417.711 ops/s      2393999 ns/op      7182089 cycles

⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).
           This is an outdated draft.

Hash to G2 (Draft #5)                                       937.034 ops/s      1067197 ns/op      3201632 cycles

Conclusion

(Our cycles and MCL clocks/clk are the same unit.)

Scalar Multiplication G2 / Signing is about 8x slower
Pairing / Verification is about 3x slower

The text was updated successfully, but these errors were encountered:

mratsim · 2020-06-04T14:45:55Z

Explanation on scalar multiplication slowness:

The order r is of bit width 255, but the benchmark is using 381 bits

nim-blscurve/benchmarks/bls12381_curve.nim

Lines 27 to 41 in 1a18d0d

    
           proc benchScalarMultG1*(iters: int) = 
        
             var x = generator1() 
        
             var scal: BIG_384 
        
             random(scal) 
        
             bench("Scalar multiplication G1", iters): 
        
               x.mul(scal) 
        
           proc benchScalarMultG2*(iters: int) = 
        
             var x = generator2() 
        
             var scal: BIG_384 
        
             random(scal) 
        
             bench("Scalar multiplication G2", iters): 
        
               x.mul(scal)

This means 50% more operations

mratsim · 2020-06-14T12:25:11Z

Actually, looking in the code, scalar mul uses nbits which does check (and leaks) the number of exponent bits:

nim-blscurve/blscurve/csources/64/big_384_58.c

Lines 1040 to 1057 in e2ddcc4

    
           int BIG_384_58_nbits(BIG_384_58 a) 
        
           { 
        
               int bts,k=NLEN_384_58-1; 
        
           	BIG_384_58 t; 
        
               chunk c; 
        
           	BIG_384_58_copy(t,a); 
        
               BIG_384_58_norm(t); 
        
               while (k>=0 && t[k]==0) k--; 
        
               if (k<0) return 0; 
        
               bts=BASEBITS_384_58*k; 
        
               c=t[k]; 
        
               while (c!=0) 
        
               { 
        
                   c/=2; 
        
                   bts++; 
        
               } 
        
               return bts; 
        
           }

nim-blscurve/blscurve/csources/64/ecp_BLS381.c

Lines 1100 to 1110 in 1a18d0d

    
           nb=1+(BIG_384_58_nbits(t)+3)/4; 
        
           /* convert exponent to signed 4-bit window */ 
        
           for (i=0; i<nb; i++) 
        
           { 
        
               w[i]=BIG_384_58_lastbits(t,5)-16; 
        
               BIG_384_58_dec(t,w[i]); 
        
               BIG_384_58_norm(t); 
        
               BIG_384_58_fshr(t,4); 
        
           } 
        
           w[nb]=BIG_384_58_lastbits(t,5);

However there is a special path in pair_BLS381.c to use endomorphism acceleration if activated

nim-blscurve/blscurve/csources/64/pair_BLS381.c

Lines 763 to 810 in 1a18d0d

    
           /* Multiply P by e in group G2 */ 
        
           void PAIR_BLS381_G2mul(ECP2_BLS381 *P,BIG_384_58 e) 
        
           { 
        
           #ifdef USE_GS_G2_BLS381   /* Well I didn't patent it :) */ 
        
               int i,np,nn; 
        
               ECP2_BLS381 Q[4]; 
        
               FP2_BLS381 X; 
        
               FP_BLS381 fx,fy; 
        
               BIG_384_58 x,y,u[4]; 
        
               FP_BLS381_rcopy(&fx,Fra_BLS381); 
        
               FP_BLS381_rcopy(&fy,Frb_BLS381); 
        
               FP2_BLS381_from_FPs(&X,&fx,&fy); 
        
           #if SEXTIC_TWIST_BLS381==M_TYPE 
        
               FP2_BLS381_inv(&X,&X); 
        
               FP2_BLS381_norm(&X); 
        
           #endif 
        
               BIG_384_58_rcopy(y,CURVE_Order_BLS381); 
        
               gs(u,e); 
        
               ECP2_BLS381_copy(&Q[0],P); 
        
               for (i=1; i<4; i++) 
        
               { 
        
                   ECP2_BLS381_copy(&Q[i],&Q[i-1]); 
        
                   ECP2_BLS381_frob(&Q[i],&X); 
        
               } 
        
               for (i=0; i<4; i++) 
        
               { 
        
                   np=BIG_384_58_nbits(u[i]); 
        
                   BIG_384_58_modneg(x,u[i],y); 
        
                   nn=BIG_384_58_nbits(x); 
        
                   if (nn<np) 
        
                   { 
        
                       BIG_384_58_copy(u[i],x); 
        
                       ECP2_BLS381_neg(&Q[i]); 
        
                   } 
        
                   BIG_384_58_norm(u[i]); 
        
               } 
        
               ECP2_BLS381_mul4(P,Q,u); 
        
           #else 
        
               ECP2_BLS381_mul(P,e); 
        
           #endif 
        
           }

which should be benchmarked as well
and probably be used instead of the generic ecp_mul

mratsim · 2020-09-24T12:46:56Z

Updated MCL perf as of https://github.com/herumi/mcl/tree/b390d6d4ffded57727ff752fb0209e0d397ed946 (sept 21)

JIT 1
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      224.863Kclk
G1::mul        207.081Kclk
G1::add          1.225Kclk
G1::dbl        900.69 clk
G2::mulCT      464.260Kclk
G2::mul        402.097Kclk
G2::add          3.414Kclk
G2::dbl          2.120Kclk
GT::pow        684.362Kclk
G1::setStr chk 289.395Kclk
G1::setStr       2.485Kclk
G2::setStr chk 702.704Kclk
G2::setStr       5.159Kclk
hashAndMapToG1 221.856Kclk
hashAndMapToG2 453.844Kclk
Fp::add         13.62 clk
Fp::sub          9.28 clk
Fp::neg          8.05 clk
Fp::mul        101.33 clk
Fp::sqr         98.98 clk
Fp::inv          4.536Kclk
Fp::pow         52.610Kclk
Fr::add         10.24 clk
Fr::sub          9.57 clk
Fr::neg          5.88 clk
Fr::mul         53.96 clk
Fr::sqr         54.48 clk
Fr::inv          1.895Kclk
Fr::pow         19.889Kclk
Fp2::add        21.71 clk
Fp2::sub        16.92 clk
Fp2::neg        14.85 clk
Fp2::mul       310.27 clk
Fp2::mul_xi     29.50 clk
Fp2::sqr       225.28 clk
Fp2::inv         5.093Kclk
FpDbl::addPre    9.66 clk
FpDbl::subPre    9.60 clk
FpDbl::add      15.96 clk
FpDbl::sub      10.69 clk
FpDbl::mulPre   46.81 clk
FpDbl::sqrPre   40.39 clk
FpDbl::mod      58.99 clk
Fp2Dbl::mulPre  187.35 clk
Fp2Dbl::sqrPre  111.38 clk
GT::add        114.84 clk
GT::mul          6.364Kclk
GT::sqr          4.603Kclk
GT::inv         15.891Kclk
FpDbl::mulPre   46.84 clk
pairing          1.980Mclk
millerLoop     841.630Kclk
finalExp         1.115Mclk
precomputeG2   202.196Kclk
precomputedML  613.377Kclk
millerLoopVec    4.277Mclk
ctest:module=finalExp
finalExp   1.103Mclk
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1  32.363Kclk
naiveG2  17.237Kclk
calcBN2  64.593Kclk
naiveG2  46.927Kclk
BLS12_381
calcBN1  76.468Kclk
naiveG1  55.494Kclk
calcBN2 156.801Kclk
naiveG2 128.367Kclk
ctest:module=deserialize
verifyOrder(1)
deserializeG1 346.770Kclk
deserializeG2 818.290Kclk
verifyOrder(0)
deserializeG1  59.815Kclk
deserializeG2 119.463Kclk
ctest:module=verifyG1
ctest:module=verifyG2
ctest:name=bls12_test, module=9, total=3727, ok=3727, ng=0, exception=0

Pairing has been accelerated by 14%

mratsim mentioned this issue Apr 9, 2020

Benchmarks Consensys/goff#12

Closed

mratsim mentioned this issue Jun 5, 2020

Milagro Rust implementation seems very slow mikelodder7/bls12-381-comparison#2

Open

mratsim mentioned this issue Jun 14, 2020

Endomorphism acceleration for Scalar Multiplication mratsim/constantine#44

Merged

mratsim mentioned this issue Jul 13, 2020

Benchmarks status-im/nim-blst#1

Open

This was referenced Jul 22, 2020

Miracl core #66

Merged

Assembly backend mratsim/constantine#69

Merged

This was referenced Aug 18, 2020

Double-width tower extension part 1 mratsim/constantine#72

Merged

Windowed GLV acceleration - 25% faster signing on G1 mratsim/constantine#74

Merged

mratsim mentioned this issue Sep 3, 2020

Endomorphism G2 mratsim/constantine#79

Merged

mratsim closed this as completed Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark vs MCL on x86 #47

Benchmark vs MCL on x86 #47

mratsim commented Apr 8, 2020

mratsim commented Jun 4, 2020

mratsim commented Jun 14, 2020

mratsim commented Sep 24, 2020 •

edited

Loading

Benchmark vs MCL on x86 #47

Benchmark vs MCL on x86 #47

Comments

mratsim commented Apr 8, 2020

How to reproduce

MCL

nim-blscurve

Results

MCL using JIT (x86-only)

MCL using Assembly from LLVM i256 and i384 (x86 and ARM)

Nim-blscurve using Milagro

Conclusion

mratsim commented Jun 4, 2020

mratsim commented Jun 14, 2020

mratsim commented Sep 24, 2020 • edited Loading

mratsim commented Sep 24, 2020 •

edited

Loading