Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembly backend #69

Merged
merged 18 commits into from
Jul 24, 2020
Merged

Assembly backend #69

merged 18 commits into from
Jul 24, 2020

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Jul 23, 2020

This adds an inline assembly backend to Constantine, using the ADCX/ADOX and MULX instructions
Closes #39

Benchmarks

Field arithmetic

GCC

-------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                           Fp[BN254_Snarks]     166666666.667 ops/s             6 ns/op            18 CPU cycles (approx)
Substraction                                       Fp[BN254_Snarks]     333333333.333 ops/s             3 ns/op            10 CPU cycles (approx)
Negation                                           Fp[BN254_Snarks]     500000000.000 ops/s             2 ns/op             6 CPU cycles (approx)
Multiplication                                     Fp[BN254_Snarks]      30303030.303 ops/s            33 ns/op           101 CPU cycles (approx)
Squaring                                           Fp[BN254_Snarks]      30303030.303 ops/s            33 ns/op           101 CPU cycles (approx)
Inversion (constant-time Euclid)                   Fp[BN254_Snarks]        122428.991 ops/s          8168 ns/op         24506 CPU cycles (approx)
Inversion via exponentiation p-2 (Little Fermat)   Fp[BN254_Snarks]         85918.034 ops/s         11639 ns/op         34918 CPU cycles (approx)
Square Root + square check (constant-time)         Fp[BN254_Snarks]         86140.064 ops/s         11609 ns/op         34828 CPU cycles (approx)
Exp curve order (constant-time) - 254-bit          Fp[BN254_Snarks]         68240.753 ops/s         14654 ns/op         43964 CPU cycles (approx)
Exp curve order (Leak exponent bits) - 254-bit     Fp[BN254_Snarks]         85800.086 ops/s         11655 ns/op         34967 CPU cycles (approx)
-------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                           Fp[BLS12_381]        111111111.111 ops/s             9 ns/op            27 CPU cycles (approx)
Substraction                                       Fp[BLS12_381]        200000000.000 ops/s             5 ns/op            16 CPU cycles (approx)
Negation                                           Fp[BLS12_381]        333333333.333 ops/s             3 ns/op            10 CPU cycles (approx)
Multiplication                                     Fp[BLS12_381]         16129032.258 ops/s            62 ns/op           186 CPU cycles (approx)
Squaring                                           Fp[BLS12_381]         16393442.623 ops/s            61 ns/op           185 CPU cycles (approx)
Inversion (constant-time Euclid)                   Fp[BLS12_381]            43550.213 ops/s         22962 ns/op         68888 CPU cycles (approx)
Inversion via exponentiation p-2 (Little Fermat)   Fp[BLS12_381]            31477.226 ops/s         31769 ns/op         95307 CPU cycles (approx)
Square Root + square check (constant-time)         Fp[BLS12_381]            32491.796 ops/s         30777 ns/op         92332 CPU cycles (approx)
Exp curve order (constant-time) - 255-bit          Fp[BLS12_381]            39643.211 ops/s         25225 ns/op         75678 CPU cycles (approx)
Exp curve order (Leak exponent bits) - 255-bit     Fp[BLS12_381]            47205.438 ops/s         21184 ns/op         63553 CPU cycles (approx)
-------------------------------------------------------------------------------------------------------------------------------------------------

Clang

-------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                           Fp[BN254_Snarks]               inf ops/s             0 ns/op             0 CPU cycles (approx)
Substraction                                       Fp[BN254_Snarks]               inf ops/s             0 ns/op             0 CPU cycles (approx)
Negation                                           Fp[BN254_Snarks]               inf ops/s             0 ns/op             0 CPU cycles (approx)
Multiplication                                     Fp[BN254_Snarks]      47619047.619 ops/s            21 ns/op            64 CPU cycles (approx)
Squaring                                           Fp[BN254_Snarks]      47619047.619 ops/s            21 ns/op            64 CPU cycles (approx)
Inversion (constant-time Euclid)                   Fp[BN254_Snarks]        163719.712 ops/s          6108 ns/op         18324 CPU cycles (approx)
Inversion via exponentiation p-2 (Little Fermat)   Fp[BN254_Snarks]        107770.234 ops/s          9279 ns/op         27838 CPU cycles (approx)
Square Root + square check (constant-time)         Fp[BN254_Snarks]        108119.797 ops/s          9249 ns/op         27748 CPU cycles (approx)
Exp curve order (constant-time) - 254-bit          Fp[BN254_Snarks]         88214.538 ops/s         11336 ns/op         34010 CPU cycles (approx)
Exp curve order (Leak exponent bits) - 254-bit     Fp[BN254_Snarks]        107411.386 ops/s          9310 ns/op         27931 CPU cycles (approx)
-------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                           Fp[BLS12_381]                  inf ops/s             0 ns/op             0 CPU cycles (approx)
Substraction                                       Fp[BLS12_381]                  inf ops/s             0 ns/op             0 CPU cycles (approx)
Negation                                           Fp[BLS12_381]                  inf ops/s             0 ns/op             0 CPU cycles (approx)
Multiplication                                     Fp[BLS12_381]         22222222.222 ops/s            45 ns/op           137 CPU cycles (approx)
Squaring                                           Fp[BLS12_381]         22222222.222 ops/s            45 ns/op           137 CPU cycles (approx)
Inversion (constant-time Euclid)                   Fp[BLS12_381]            63991.809 ops/s         15627 ns/op         46882 CPU cycles (approx)
Inversion via exponentiation p-2 (Little Fermat)   Fp[BLS12_381]            39909.007 ops/s         25057 ns/op         75173 CPU cycles (approx)
Square Root + square check (constant-time)         Fp[BLS12_381]            39870.819 ops/s         25081 ns/op         75245 CPU cycles (approx)
Exp curve order (constant-time) - 255-bit          Fp[BLS12_381]            49096.622 ops/s         20368 ns/op         61105 CPU cycles (approx)
Exp curve order (Leak exponent bits) - 255-bit     Fp[BLS12_381]            58008.005 ops/s         17239 ns/op         51718 CPU cycles (approx)
-------------------------------------------------------------------------------------------------------------------------------------------------

Assembly (GCC compiler)

-------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                           Fp[BN254_Snarks]     333333333.333 ops/s             3 ns/op             9 CPU cycles (approx)
Substraction                                       Fp[BN254_Snarks]     500000000.000 ops/s             2 ns/op             8 CPU cycles (approx)
Negation                                           Fp[BN254_Snarks]               inf ops/s             0 ns/op             2 CPU cycles (approx)
Multiplication                                     Fp[BN254_Snarks]      71428571.429 ops/s            14 ns/op            43 CPU cycles (approx)
Squaring                                           Fp[BN254_Snarks]      71428571.429 ops/s            14 ns/op            43 CPU cycles (approx)
Inversion (constant-time Euclid)                   Fp[BN254_Snarks]        122549.020 ops/s          8160 ns/op         24482 CPU cycles (approx)
Inversion via exponentiation p-2 (Little Fermat)   Fp[BN254_Snarks]        153917.193 ops/s          6497 ns/op         19492 CPU cycles (approx)
Square Root + square check (constant-time)         Fp[BN254_Snarks]        154249.576 ops/s          6483 ns/op         19449 CPU cycles (approx)
Exp curve order (constant-time) - 254-bit          Fp[BN254_Snarks]        107273.117 ops/s          9322 ns/op         27966 CPU cycles (approx)
Exp curve order (Leak exponent bits) - 254-bit     Fp[BN254_Snarks]        154535.620 ops/s          6471 ns/op         19415 CPU cycles (approx)
-------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                           Fp[BLS12_381]        250000000.000 ops/s             4 ns/op            14 CPU cycles (approx)
Substraction                                       Fp[BLS12_381]        250000000.000 ops/s             4 ns/op            13 CPU cycles (approx)
Negation                                           Fp[BLS12_381]       1000000000.000 ops/s             1 ns/op             4 CPU cycles (approx)
Multiplication                                     Fp[BLS12_381]         35714285.714 ops/s            28 ns/op            86 CPU cycles (approx)
Squaring                                           Fp[BLS12_381]         33333333.333 ops/s            30 ns/op            91 CPU cycles (approx)
Inversion (constant-time Euclid)                   Fp[BLS12_381]            44012.147 ops/s         22721 ns/op         68165 CPU cycles (approx)
Inversion via exponentiation p-2 (Little Fermat)   Fp[BLS12_381]            64143.682 ops/s         15590 ns/op         46770 CPU cycles (approx)
Square Root + square check (constant-time)         Fp[BLS12_381]            64020.487 ops/s         15620 ns/op         46860 CPU cycles (approx)
Exp curve order (constant-time) - 255-bit          Fp[BLS12_381]            68714.354 ops/s         14553 ns/op         43661 CPU cycles (approx)
Exp curve order (Leak exponent bits) - 255-bit     Fp[BLS12_381]            93170.595 ops/s         10733 ns/op         32201 CPU cycles (approx)
-------------------------------------------------------------------------------------------------------------------------------------------------

On field multiplication, inline assembly with MULX/ADCX/ADOX instructions is 2.2x faster than GCC and 1.6x faster than Clang

Elliptic G1 Arithmetic

GCC

=================================================================================================================

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G1                                                    ECP_SWei_Proj[Fp[BN254_Snarks]]              1893939.394 ops/s           528 ns/op          1585 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G1                                                 ECP_SWei_Proj[Fp[BN254_Snarks]]              2923976.608 ops/s           342 ns/op          1026 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp[BN254_Snarks]]                 6159.380 ops/s        162354 ns/op        487070 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 4)                    ECP_SWei_Proj[Fp[BN254_Snarks]]                 4428.482 ops/s        225811 ns/op        677443 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 8)                    ECP_SWei_Proj[Fp[BN254_Snarks]]                 6246.486 ops/s        160090 ns/op        480277 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 16)                   ECP_SWei_Proj[Fp[BN254_Snarks]]                 7102.475 ops/s        140796 ns/op        422393 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (endomorphism accelerated)                   ECP_SWei_Proj[Fp[BN254_Snarks]]                 8578.904 ops/s        116565 ns/op        349699 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G1                                                    ECP_SWei_Proj[Fp[BLS12_381]]                 1074113.856 ops/s           931 ns/op          2794 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G1                                                 ECP_SWei_Proj[Fp[BLS12_381]]                 1680672.269 ops/s           595 ns/op          1787 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp[BLS12_381]]                    3590.522 ops/s        278511 ns/op        835545 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 4)                    ECP_SWei_Proj[Fp[BLS12_381]]                    2525.629 ops/s        395941 ns/op       1187839 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 8)                    ECP_SWei_Proj[Fp[BLS12_381]]                    3568.650 ops/s        280218 ns/op        840667 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 16)                   ECP_SWei_Proj[Fp[BLS12_381]]                    4062.035 ops/s        246182 ns/op        738557 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (endomorphism accelerated)                   ECP_SWei_Proj[Fp[BLS12_381]]                    4875.789 ops/s        205095 ns/op        615293 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Clang

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G1                                                    ECP_SWei_Proj[Fp[BN254_Snarks]]              3039513.678 ops/s           329 ns/op           987 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G1                                                 ECP_SWei_Proj[Fp[BN254_Snarks]]              4651162.791 ops/s           215 ns/op           645 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp[BN254_Snarks]]                10343.508 ops/s         96679 ns/op        290041 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 4)                    ECP_SWei_Proj[Fp[BN254_Snarks]]                 6994.230 ops/s        142975 ns/op        428932 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 8)                    ECP_SWei_Proj[Fp[BN254_Snarks]]                 9755.241 ops/s        102509 ns/op        307531 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 16)                   ECP_SWei_Proj[Fp[BN254_Snarks]]                11101.613 ops/s         90077 ns/op        270234 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (endomorphism accelerated)                   ECP_SWei_Proj[Fp[BN254_Snarks]]                13578.470 ops/s         73646 ns/op        220942 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G1                                                    ECP_SWei_Proj[Fp[BLS12_381]]                 1515151.515 ops/s           660 ns/op          1981 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G1                                                 ECP_SWei_Proj[Fp[BLS12_381]]                 2364066.194 ops/s           423 ns/op          1271 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp[BLS12_381]]                    4960.047 ops/s        201611 ns/op        604840 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 4)                    ECP_SWei_Proj[Fp[BLS12_381]]                    3545.898 ops/s        282016 ns/op        846058 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 8)                    ECP_SWei_Proj[Fp[BLS12_381]]                    4995.055 ops/s        200198 ns/op        600603 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 16)                   ECP_SWei_Proj[Fp[BLS12_381]]                    5692.232 ops/s        175678 ns/op        527040 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (endomorphism accelerated)                   ECP_SWei_Proj[Fp[BLS12_381]]                    6865.256 ops/s        145661 ns/op        436988 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Assembly

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G1                                                    ECP_SWei_Proj[Fp[BN254_Snarks]]              3571428.571 ops/s           280 ns/op           842 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G1                                                 ECP_SWei_Proj[Fp[BN254_Snarks]]              5882352.941 ops/s           170 ns/op           511 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp[BN254_Snarks]]                13011.177 ops/s         76857 ns/op        230574 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 4)                    ECP_SWei_Proj[Fp[BN254_Snarks]]                 8443.094 ops/s        118440 ns/op        355323 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 8)                    ECP_SWei_Proj[Fp[BN254_Snarks]]                11824.664 ops/s         84569 ns/op        253710 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 16)                   ECP_SWei_Proj[Fp[BN254_Snarks]]                13258.379 ops/s         75424 ns/op        226275 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (endomorphism accelerated)                   ECP_SWei_Proj[Fp[BN254_Snarks]]                15889.662 ops/s         62934 ns/op        188805 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G1                                                    ECP_SWei_Proj[Fp[BLS12_381]]                 2053388.090 ops/s           487 ns/op          1462 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G1                                                 ECP_SWei_Proj[Fp[BLS12_381]]                 3389830.508 ops/s           295 ns/op           887 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp[BLS12_381]]                    7599.650 ops/s        131585 ns/op        394761 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 4)                    ECP_SWei_Proj[Fp[BLS12_381]]                    4902.345 ops/s        203984 ns/op        611960 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 8)                    ECP_SWei_Proj[Fp[BLS12_381]]                    6900.883 ops/s        144909 ns/op        434732 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G1 (scratchsize = 16)                   ECP_SWei_Proj[Fp[BLS12_381]]                    7757.531 ops/s        128907 ns/op        386726 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G1 (endomorphism accelerated)                   ECP_SWei_Proj[Fp[BLS12_381]]                    9318.883 ops/s        107309 ns/op        321931 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

On scalar multiplication (signing and public key derivation from private key)
Inline assembly is 1.9x faster than GCC and 1.36x faster than Clang

Elliptic G2 Arithmetic

GCC

=================================================================================================================

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G2                                                    ECP_SWei_Proj[Fp2[BN254_Snarks]]              480769.231 ops/s          2080 ns/op          6240 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G2                                                 ECP_SWei_Proj[Fp2[BN254_Snarks]]              874890.639 ops/s          1143 ns/op          3430 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G2 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp2[BN254_Snarks]]                1804.032 ops/s        554314 ns/op       1662965 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 4)                    ECP_SWei_Proj[Fp2[BN254_Snarks]]                1193.246 ops/s        838050 ns/op       2514177 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 8)                    ECP_SWei_Proj[Fp2[BN254_Snarks]]                1747.815 ops/s        572143 ns/op       1716451 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 16)                   ECP_SWei_Proj[Fp2[BN254_Snarks]]                2014.638 ops/s        496367 ns/op       1489121 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G2                                                    ECP_SWei_Proj[Fp2[BLS12_381]]                 308546.745 ops/s          3241 ns/op          9724 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G2                                                 ECP_SWei_Proj[Fp2[BLS12_381]]                 534188.034 ops/s          1872 ns/op          5617 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G2 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp2[BLS12_381]]                   1132.412 ops/s        883071 ns/op       2649242 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 4)                    ECP_SWei_Proj[Fp2[BLS12_381]]                    758.775 ops/s       1317914 ns/op       3953792 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 8)                    ECP_SWei_Proj[Fp2[BLS12_381]]                   1091.443 ops/s        916218 ns/op       2748688 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 16)                   ECP_SWei_Proj[Fp2[BLS12_381]]                   1261.319 ops/s        792821 ns/op       2378491 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Clang

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G2                                                    ECP_SWei_Proj[Fp2[BN254_Snarks]]              743494.424 ops/s          1345 ns/op          4035 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G2                                                 ECP_SWei_Proj[Fp2[BN254_Snarks]]             1298701.299 ops/s           770 ns/op          2312 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G2 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp2[BN254_Snarks]]                2602.797 ops/s        384202 ns/op       1152620 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 4)                    ECP_SWei_Proj[Fp2[BN254_Snarks]]                1800.232 ops/s        555484 ns/op       1666474 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 8)                    ECP_SWei_Proj[Fp2[BN254_Snarks]]                2592.500 ops/s        385728 ns/op       1157199 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 16)                   ECP_SWei_Proj[Fp2[BN254_Snarks]]                2989.519 ops/s        334502 ns/op       1003517 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G2                                                    ECP_SWei_Proj[Fp2[BLS12_381]]                 437636.761 ops/s          2285 ns/op          6857 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G2                                                 ECP_SWei_Proj[Fp2[BLS12_381]]                 747384.155 ops/s          1338 ns/op          4015 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G2 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp2[BLS12_381]]                   1584.289 ops/s        631198 ns/op       1893617 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 4)                    ECP_SWei_Proj[Fp2[BLS12_381]]                   1059.301 ops/s        944019 ns/op       2832091 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 8)                    ECP_SWei_Proj[Fp2[BLS12_381]]                   1513.326 ops/s        660796 ns/op       1982414 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 16)                   ECP_SWei_Proj[Fp2[BLS12_381]]                   1743.123 ops/s        573683 ns/op       1721071 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Assembly

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G2                                                    ECP_SWei_Proj[Fp2[BN254_Snarks]]              757575.758 ops/s          1320 ns/op          3961 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G2                                                 ECP_SWei_Proj[Fp2[BN254_Snarks]]             1364256.480 ops/s           733 ns/op          2199 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G2 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp2[BN254_Snarks]]                2840.812 ops/s        352012 ns/op       1056050 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 4)                    ECP_SWei_Proj[Fp2[BN254_Snarks]]                1915.023 ops/s        522187 ns/op       1566581 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 8)                    ECP_SWei_Proj[Fp2[BN254_Snarks]]                2756.537 ops/s        362774 ns/op       1088336 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 16)                   ECP_SWei_Proj[Fp2[BN254_Snarks]]                3189.732 ops/s        313506 ns/op        940531 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Add G2                                                    ECP_SWei_Proj[Fp2[BLS12_381]]                 509164.969 ops/s          1964 ns/op          5892 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC Double G2                                                 ECP_SWei_Proj[Fp2[BLS12_381]]                 914913.083 ops/s          1093 ns/op          3280 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul G2 (unsafe reference DoubleAdd)                 ECP_SWei_Proj[Fp2[BLS12_381]]                   1917.307 ops/s        521565 ns/op       1564715 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 4)                    ECP_SWei_Proj[Fp2[BLS12_381]]                   1273.104 ops/s        785482 ns/op       2356472 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 8)                    ECP_SWei_Proj[Fp2[BLS12_381]]                   1829.843 ops/s        546495 ns/op       1639507 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
EC ScalarMul Generic G2 (scratchsize = 16)                   ECP_SWei_Proj[Fp2[BLS12_381]]                   2107.921 ops/s        474401 ns/op       1423220 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

On scalar multiplication (signing and public key derivation from private key)
Inline assembly is 1.71x faster than GCC and 1.21x faster than Clang

@mratsim
Copy link
Owner Author

mratsim commented Jul 24, 2020

We are now 18% faster than MCL on the base field Fp for BLS12-381:

For elliptic curve arithmetic, we are competitive for add and double despite being constant-time and using the projective coordinates which are more expensive than Jacobian. We are however 50% slower on scalar multiplication:

  • Introduce window method for the GLV-SAC representation Window method for GLV acceleration #45, according to the paper it should close the gap with the wNAF technique used in MCL.
  • Use constant-time Jacobian coordinates similar to BLST (unsure of the price of constant-time Jacobian)

Note that BLST with Jacobian coordinates and without endomorphism acceleration (and a size 5 window method) is only slightly slower status-im/nim-blst#1

@mratsim mratsim linked an issue Jul 24, 2020 that may be closed by this pull request
2 tasks
@mratsim mratsim linked an issue Jul 24, 2020 that may be closed by this pull request
@mratsim mratsim merged commit d97bc9b into master Jul 24, 2020
@mratsim mratsim deleted the assembly-backend branch July 25, 2020 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant