Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double-Precision towering #155

Merged
merged 22 commits into from
Feb 9, 2021
Merged

Double-Precision towering #155

merged 22 commits into from
Feb 9, 2021

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Feb 8, 2021

This PR reworks the towering to introduce double-precision finite fields.

This moderately accelerate G2 operations and significantly improves Fp12 and pairings computations.

  • Double-precision towering: now uses lazy reduction (related to Implement lazy carries and reductions #15 and continues FpDbl revisited #144)
  • Consistent naming diffUnr for unrediced Fp operations instead of diffNoReduce
  • double-width -> double-precision
  • Consistent naming diff2xMod and diff2xUnr for double-precision operations.
    • prod2x, square2x and redc2x for double-precision modular arithmetic
  • Removal of canUseNoCarryMontyMul and canUseNoCarryMontySquare to instead have getSpareBits to access the number of extra spare bits in the field element representation.
  • Slightly improved assembly for add/sub via interleaving copies and operations and dependency chain latency hiding.
  • Small addition chains for Bigint, Fp, FpDbl, extension fields and double-precision extension fields use less temporaries and copies
  • Alternate formulae for Fp4[BN254_Snarks] squaring which is 25%, fix BN254-Snarks: bad performance on Fp4 squaring #154

@mratsim
Copy link
Owner Author

mratsim commented Feb 9, 2021

Some benches:

Master Fp2 BLS12-381

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                                               Fp2[BLS12_381]       125000000.000 ops/s             8 ns/op            24 CPU cycles (approx)
Substraction                                                           Fp2[BLS12_381]       125000000.000 ops/s             8 ns/op            24 CPU cycles (approx)
Negation                                                               Fp2[BLS12_381]       200000000.000 ops/s             5 ns/op            17 CPU cycles (approx)
Conditional Copy                                                       Fp2[BLS12_381]       333333333.333 ops/s             3 ns/op            10 CPU cycles (approx)
Division by 2                                                          Fp2[BLS12_381]                 inf ops/s             0 ns/op             0 CPU cycles (approx)
Multiplication                                                         Fp2[BLS12_381]         9174311.927 ops/s           109 ns/op           328 CPU cycles (approx)
Squaring                                                               Fp2[BLS12_381]        13513513.514 ops/s            74 ns/op           223 CPU cycles (approx)
Inversion (constant-time default impl)                                 Fp2[BLS12_381]           68956.006 ops/s         14502 ns/op         43507 CPU cycles (approx)
Square Root + isSquare (constant-time default impl)                    Fp2[BLS12_381]           34209.086 ops/s         29232 ns/op         87700 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Tower 2x Fp2 BLS12-381

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                                               Fp2[BLS12_377]       125000000.000 ops/s             8 ns/op            24 CPU cycles (approx)
Substraction                                                           Fp2[BLS12_377]       125000000.000 ops/s             8 ns/op            24 CPU cycles (approx)
Negation                                                               Fp2[BLS12_377]       200000000.000 ops/s             5 ns/op            15 CPU cycles (approx)
Conditional Copy                                                       Fp2[BLS12_377]       333333333.333 ops/s             3 ns/op            10 CPU cycles (approx)
Division by 2                                                          Fp2[BLS12_377]                 inf ops/s             0 ns/op             0 CPU cycles (approx)
Multiplication                                                         Fp2[BLS12_377]         7751937.984 ops/s           129 ns/op           389 CPU cycles (approx)
Squaring                                                               Fp2[BLS12_377]         8547008.547 ops/s           117 ns/op           353 CPU cycles (approx)
Inversion (constant-time default impl)                                 Fp2[BLS12_377]           70264.193 ops/s         14232 ns/op         42698 CPU cycles (approx)
Square Root + isSquare (constant-time default impl)                    Fp2[BLS12_377]            8640.205 ops/s        115738 ns/op        347223 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                                               Fp2[BLS12_381]       125000000.000 ops/s             8 ns/op            24 CPU cycles (approx)
Substraction                                                           Fp2[BLS12_381]       125000000.000 ops/s             8 ns/op            24 CPU cycles (approx)
Negation                                                               Fp2[BLS12_381]       200000000.000 ops/s             5 ns/op            15 CPU cycles (approx)
Conditional Copy                                                       Fp2[BLS12_381]       333333333.333 ops/s             3 ns/op            10 CPU cycles (approx)
Division by 2                                                          Fp2[BLS12_381]                 inf ops/s             0 ns/op             0 CPU cycles (approx)
Multiplication                                                         Fp2[BLS12_381]        10204081.633 ops/s            98 ns/op           294 CPU cycles (approx)
Squaring                                                               Fp2[BLS12_381]        14084507.042 ops/s            71 ns/op           213 CPU cycles (approx)
Inversion (constant-time default impl)                                 Fp2[BLS12_381]           69386.622 ops/s         14412 ns/op         43237 CPU cycles (approx)
Square Root + isSquare (constant-time default impl)                    Fp2[BLS12_381]           34511.320 ops/s         28976 ns/op         86932 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

So 10% improvement in multiplication and squaring

Master GT / Fp12 BLS12-381

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                                               Fp12[BLS12_381]       22222222.222 ops/s            45 ns/op           136 CPU cycles (approx)
Substraction                                                           Fp12[BLS12_381]       22727272.727 ops/s            44 ns/op           133 CPU cycles (approx)
Negation                                                               Fp12[BLS12_381]       32258064.516 ops/s            31 ns/op            93 CPU cycles (approx)
Multiplication                                                         Fp12[BLS12_381]         389408.100 ops/s          2568 ns/op          7704 CPU cycles (approx)
Squaring                                                               Fp12[BLS12_381]         590318.772 ops/s          1694 ns/op          5082 CPU cycles (approx)
Inversion (constant-time default impl)                                 Fp12[BLS12_381]          52162.120 ops/s         19171 ns/op         57516 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Tower 2x GT / Fp12 BLS12-381

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Addition                                                               Fp12[BLS12_381]       22727272.727 ops/s            44 ns/op           134 CPU cycles (approx)
Substraction                                                           Fp12[BLS12_381]       22727272.727 ops/s            44 ns/op           132 CPU cycles (approx)
Negation                                                               Fp12[BLS12_381]       34482758.621 ops/s            29 ns/op            89 CPU cycles (approx)
Multiplication                                                         Fp12[BLS12_381]         465332.713 ops/s          2149 ns/op          6449 CPU cycles (approx)
Squaring                                                               Fp12[BLS12_381]         648508.431 ops/s          1542 ns/op          4628 CPU cycles (approx)
Inversion (constant-time default impl)                                 Fp12[BLS12_381]          54153.580 ops/s         18466 ns/op         55399 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

So a 16% improvement on multiplication and 9% on squarings

Master Pairing BLS12-381

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Line double                                                  BLS12_381            931966.449 ops/s          1073 ns/op          3216 CPU cycles (approx)
Line add                                                     BLS12_381            654450.262 ops/s          1528 ns/op          4583 CPU cycles (approx)
Mul 𝔽p12 by line xy000z                                      BLS12_381            586166.471 ops/s          1706 ns/op          5116 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Final Exponentiation Easy                                    BLS12_381             40466.170 ops/s         24712 ns/op         74134 CPU cycles (approx)
Final Exponentiation Hard BLS12                              BLS12_381              2409.360 ops/s        415048 ns/op       1245168 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Miller Loop BLS12                                            BLS12_381              3311.302 ops/s        301996 ns/op        906004 CPU cycles (approx)
Final Exponentiation BLS12                                   BLS12_381              2273.296 ops/s        439890 ns/op       1319693 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Pairing BLS12                                                BLS12_381              1295.107 ops/s        772137 ns/op       2316448 CPU cycles (approx)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Tower 2x Pairing BLS12-381

Line double                              BLS12_381            960614.793 ops/s          1041 ns/op          3121 CPU cycles (approx)
Line add                                 BLS12_381            709219.858 ops/s          1410 ns/op          4229 CPU cycles (approx)
Mul 𝔽p12 by line xy000z                  BLS12_381            682128.240 ops/s          1466 ns/op          4396 CPU cycles (approx)
------------------------------------------------------------------------------------------------------------------------------------
Final Exponentiation Easy                BLS12_381             43402.778 ops/s         23040 ns/op         69118 CPU cycles (approx)
Final Exponentiation Hard BLS12          BLS12_381              2621.177 ops/s        381508 ns/op       1144545 CPU cycles (approx)
------------------------------------------------------------------------------------------------------------------------------------
Miller Loop BLS12                        BLS12_381              3676.984 ops/s        271962 ns/op        815898 CPU cycles (approx)
Final Exponentiation BLS12               BLS12_381              2472.396 ops/s        404466 ns/op       1213422 CPU cycles (approx)
------------------------------------------------------------------------------------------------------------------------------------
Pairing BLS12                            BLS12_381              1419.436 ops/s        704505 ns/op       2113557 CPU cycles (approx)
------------------------------------------------------------------------------------------------------------------------------------

So a 10% improvement overall.

Future perspectives

There are some low-hanging fruits that can help accelerate further:

  • Use double-width in Cubic extension fields, right now we are only using it on Fp->Fp2 and Fp2->Fp4 but not on Fp4->Fp12 and we might actually get even larger savings than this PR.
  • There are 3 inversions needed for pairings, one on G1/Fp, one on G2/Fp2, one on GT/Fp12, this represents ~150000 cycles or about 50ms which is 7% of running time that can be squashed at least 5x with new (constant-time) inversion techniques from this summer and G1 and G2 inversion can be fused using Montgomery simultaneous inversion Simultaneous inversion #49.

@mratsim
Copy link
Owner Author

mratsim commented Feb 9, 2021

testutils brings chronos which brings BearSSL

   Success: testutils installed successfully.

 Installing chronos@any version

Downloading https://github.com/status-im/nim-chronos using git

  Verifying dependencies for [email protected]

      Info: Dependency on stew@any version already satisfied

  Verifying dependencies for [email protected]

 Installing bearssl@any version

Downloading https://github.com/status-im/nim-bearssl using git

       Tip: 136 messages have been suppressed, use --verbose to show them.

     Error: Execution failed with exit code 1

        ... Command: git clone --recursive --shallow-submodules --depth 1 https://github.com/status-im/nim-bearssl /tmp/nimble_5788/githubcom_statusimnimbearssl

        ... Output: Cloning into '/tmp/nimble_5788/githubcom_statusimnimbearssl'...

        ... Submodule 'bearssl/csources' (https://www.bearssl.org/git/BearSSL) registered for path 'bearssl/csources'

        ... Cloning into '/tmp/nimble_5788/githubcom_statusimnimbearssl/bearssl/csources'...

        ... error: Server does not allow request for unadvertised object dda1f8a0c46e15b4a235163470ff700b2f13dcc5

        ... Fetched in submodule path 'bearssl/csources', but it did not contain dda1f8a0c46e15b4a235163470ff700b2f13dcc5. Direct fetching of that commit failed.

@mratsim mratsim merged commit 5806cc4 into master Feb 9, 2021
@mratsim mratsim deleted the tower-2x branch February 9, 2021 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BN254-Snarks: bad performance on Fp4 squaring
1 participant