Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembly backend #69

Merged
merged 18 commits into from
Jul 24, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,9 @@ script:
- nimble refresh
- nimble install gmp stew
- nimble test_parallel
- if [[ "$ARCH" != "arm64" ]]; then
nimble test_parallel_no_assembler;
fi
branches:
except:
- gh-pages
123 changes: 75 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,18 @@ You can install the developement version of the library through nimble with the
nimble install https://github.com/mratsim/constantine@#master
```

For speed it is recommended to prefer Clang, MSVC or ICC over GCC.
GCC does not properly optimize add-with-carry and sub-with-borrow loops (see [Compiler-caveats](#Compiler-caveats)).
For speed it is recommended to prefer Clang, MSVC or ICC over GCC (see [Compiler-caveats](#Compiler-caveats)).

Further if using GCC, GCC 7 at minimum is required, previous versions
generated incorrect add-with-carry code.

On x86-64, inline assembly is used to workaround compilers having issues optimizing large integer arithmetic,
and also ensure constant-time code.
This can be deactivated with `"-d:ConstantineASM=false"`:
- at a significant performance cost with GCC (~50% slower than Clang).
- at misssed opportunity on recent CPUs that support MULX/ADCX/ADOX instructions (~60% faster than Clang).
- There is a 2.4x perf ratio between using plain GCC vs GCC with inline assembly.

## Target audience

The library aims to be a portable, compact and hardened library for elliptic curve cryptography needs, in particular for blockchain protocols and zero-knowledge proofs system.
Expand All @@ -39,10 +45,13 @@ in this order
## Curves supported

At the moment the following curves are supported, adding a new curve only requires adding the prime modulus
and its bitsize in [constantine/config/curves.nim](constantine/config/curves.nim).
and its bitsize in [constantine/config/curves.nim](constantine/config/curves_declaration.nim).

The following curves are configured:

> Note: At the moment, finite field arithmetic is fully supported
> but elliptic curve arithmetic is work-in-progress.

### ECDH / ECDSA curves

- NIST P-224
Expand All @@ -58,7 +67,8 @@ Families:
- FKM: Fotiadis-Konstantinou-Martindale

Curves:
- BN254 (Zero-Knowledge Proofs, Snarks, Starks, Zcash, Ethereum 1)
- BN254_Nogami
- BN254_Snarks (Zero-Knowledge Proofs, Snarks, Starks, Zcash, Ethereum 1)
- BLS12-377 (Zexe)
- BLS12-381 (Algorand, Chia Networks, Dfinity, Ethereum 2, Filecoin, Zcash Sapling)
- BN446
Expand Down Expand Up @@ -137,42 +147,65 @@ To measure the performance of Constantine

```bash
git clone https://github.com/mratsim/constantine
nimble bench_fp_clang
nimble bench_fp2_clang
nimble bench_fp # Using Assembly (+ GCC)
nimble bench_fp_clang # Using Clang only
nimble bench_fp_gcc # Using Clang only (very slow)
nimble bench_fp2
# ...
nimble bench_ec_g1
nimble bench_ec_g2
```

As mentioned in the [Compiler caveats](#compiler-caveats) section, GCC is up to 2x slower than Clang due to mishandling of carries and register usage.

On my machine, for selected benchmarks on the prime field for popular pairing-friendly curves.

```
⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

All benchmarks are using constant-time implementations to protect against side-channel attacks.

Compiled with Clang
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz (overclocked all-core Turbo @4.1GHz)

--------------------------------------------------------------------------------
Addition Fp[BN254] 0 ns 0 cycles
Substraction Fp[BN254] 0 ns 0 cycles
Negation Fp[BN254] 0 ns 0 cycles
Multiplication Fp[BN254] 21 ns 65 cycles
Squaring Fp[BN254] 18 ns 55 cycles
Inversion Fp[BN254] 6266 ns 18799 cycles
--------------------------------------------------------------------------------
Addition Fp[BLS12_381] 0 ns 0 cycles
Substraction Fp[BLS12_381] 0 ns 0 cycles
Negation Fp[BLS12_381] 0 ns 0 cycles
Multiplication Fp[BLS12_381] 45 ns 136 cycles
Squaring Fp[BLS12_381] 39 ns 118 cycles
Inversion Fp[BLS12_381] 15683 ns 47050 cycles
--------------------------------------------------------------------------------

Compiled with GCC
Optimization level =>
no optimization: false
release: true
danger: true
inline assembly: true
Using Constantine with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz

⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
i.e. a 20% overclock will be about 20% off (assuming no dynamic frequency scaling)

=================================================================================================================

-------------------------------------------------------------------------------------------------------------------------------------------------
Addition Fp[BN254_Snarks] 333333333.333 ops/s 3 ns/op 9 CPU cycles (approx)
Substraction Fp[BN254_Snarks] 500000000.000 ops/s 2 ns/op 8 CPU cycles (approx)
Negation Fp[BN254_Snarks] 1000000000.000 ops/s 1 ns/op 3 CPU cycles (approx)
Multiplication Fp[BN254_Snarks] 71428571.429 ops/s 14 ns/op 44 CPU cycles (approx)
Squaring Fp[BN254_Snarks] 71428571.429 ops/s 14 ns/op 44 CPU cycles (approx)
Inversion (constant-time Euclid) Fp[BN254_Snarks] 122579.063 ops/s 8158 ns/op 24474 CPU cycles (approx)
Inversion via exponentiation p-2 (Little Fermat) Fp[BN254_Snarks] 153822.489 ops/s 6501 ns/op 19504 CPU cycles (approx)
Square Root + square check (constant-time) Fp[BN254_Snarks] 153491.942 ops/s 6515 ns/op 19545 CPU cycles (approx)
Exp curve order (constant-time) - 254-bit Fp[BN254_Snarks] 104580.632 ops/s 9562 ns/op 28687 CPU cycles (approx)
Exp curve order (Leak exponent bits) - 254-bit Fp[BN254_Snarks] 153798.831 ops/s 6502 ns/op 19506 CPU cycles (approx)
-------------------------------------------------------------------------------------------------------------------------------------------------
Addition Fp[BLS12_381] 250000000.000 ops/s 4 ns/op 14 CPU cycles (approx)
Substraction Fp[BLS12_381] 250000000.000 ops/s 4 ns/op 13 CPU cycles (approx)
Negation Fp[BLS12_381] 1000000000.000 ops/s 1 ns/op 4 CPU cycles (approx)
Multiplication Fp[BLS12_381] 35714285.714 ops/s 28 ns/op 84 CPU cycles (approx)
Squaring Fp[BLS12_381] 35714285.714 ops/s 28 ns/op 85 CPU cycles (approx)
Inversion (constant-time Euclid) Fp[BLS12_381] 43763.676 ops/s 22850 ns/op 68552 CPU cycles (approx)
Inversion via exponentiation p-2 (Little Fermat) Fp[BLS12_381] 63983.620 ops/s 15629 ns/op 46889 CPU cycles (approx)
Square Root + square check (constant-time) Fp[BLS12_381] 63856.960 ops/s 15660 ns/op 46982 CPU cycles (approx)
Exp curve order (constant-time) - 255-bit Fp[BLS12_381] 68535.399 ops/s 14591 ns/op 43775 CPU cycles (approx)
Exp curve order (Leak exponent bits) - 255-bit Fp[BLS12_381] 93222.709 ops/s 10727 ns/op 32181 CPU cycles (approx)
-------------------------------------------------------------------------------------------------------------------------------------------------
Notes:
GCC is significantly slower than Clang on multiprecision arithmetic.
The simplest operations might be optimized away by the compiler.
- Compilers:
Compilers are severely limited on multiprecision arithmetic.
Inline Assembly is used by default (nimble bench_fp).
Bench without assembly can use "nimble bench_fp_gcc" or "nimble bench_fp_clang".
GCC is significantly slower than Clang on multiprecision arithmetic due to catastrophic handling of carries.
- The simplest operations might be optimized away by the compiler.
- Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)
```

### Compiler caveats
Expand Down Expand Up @@ -234,25 +267,15 @@ add256:
retq
```

### Inline assembly

Constantine uses inline assembly for a very restricted use-case: "conditional mov",
and a temporary use-case "hardware 128-bit division" that will be replaced ASAP (as hardware division is not constant-time).

Using intrinsics otherwise significantly improve code readability, portability, auditability and maintainability.
As a workaround key procedures use inline assembly.

#### Future optimizations
### Inline assembly

In the future more inline assembly primitives might be added provided the performance benefit outvalues the significant complexity.
In particular, multiprecision multiplication and squaring on x86 can use the instructions MULX, ADCX and ADOX
to multiply-accumulate on 2 carry chains in parallel (with instruction-level parallelism)
and improve performance by 15~20% over an uint128-based implementation.
As no compiler is able to generate such code even when using the `_mulx_u64` and `_addcarryx_u64` intrinsics,
either the assembly for each supported bigint size must be hardcoded
or a "compiler" must be implemented in macros that will generate the required inline assembly at compile-time.
While using intrinsics significantly improve code readability, portability, auditability and maintainability,
Constantine use inline assembly on x86-64 to ensure performance portability despite poor optimization (for GCC)
and also to use dedicated large integer instructions MULX, ADCX, ADOX that compilers cannot generate.

Such a compiler can also be used to overcome GCC codegen deficiencies, here is an example for add-with-carry:
https://github.com/mratsim/finite-fields/blob/d7f6d8bb/macro_add_carry.nim
The speed improvement on finite field arithmetic is up 60% with MULX, ADCX, ADOX on BLS12-381 (6 limbs).

## Sizes: code size, stack usage

Expand Down Expand Up @@ -286,3 +309,7 @@ or
* Apache License, Version 2.0, ([LICENSE-APACHEv2](LICENSE-APACHEv2) or http://www.apache.org/licenses/LICENSE-2.0)

at your option. This file may not be copied, modified, or distributed except according to those terms.

This library has **no external dependencies**.
In particular GMP is used only for testing and differential fuzzing
and is not linked in the library.
11 changes: 9 additions & 2 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -186,12 +186,19 @@ steps:
echo "PATH=${PATH}"
export ucpu=${UCPU}
nimble test_parallel
displayName: 'Testing the package (including GMP)'
displayName: 'Testing Constantine with Assembler and with GMP'
condition: ne(variables['Agent.OS'], 'Windows_NT')

- bash: |
echo "PATH=${PATH}"
export ucpu=${UCPU}
nimble test_parallel_no_assembler
displayName: 'Testing Constantine without Assembler and with GMP'
condition: ne(variables['Agent.OS'], 'Windows_NT')

- bash: |
echo "PATH=${PATH}"
export ucpu=${UCPU}
nimble test_no_gmp
displayName: 'Testing the package (without GMP)'
displayName: 'Testing the package (without Assembler or GMP)'
condition: eq(variables['Agent.OS'], 'Windows_NT')
6 changes: 1 addition & 5 deletions benchmarks/bench_ec_g1.nim
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,4 @@ proc main() =
separator()

main()

echo "\nNotes:"
echo " - GCC is significantly slower than Clang on multiprecision arithmetic."
echo " - The simplest operations might be optimized away by the compiler."
echo " - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
notes()
6 changes: 1 addition & 5 deletions benchmarks/bench_ec_g2.nim
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,4 @@ proc main() =
separator()

main()

echo "\nNotes:"
echo " - GCC is significantly slower than Clang on multiprecision arithmetic."
echo " - The simplest operations might be optimized away by the compiler."
echo " - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
notes()
18 changes: 16 additions & 2 deletions benchmarks/bench_elliptic_template.nim
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

import
# Internals
../constantine/config/curves,
../constantine/config/[curves, common],
../constantine/arithmetic,
../constantine/io/io_bigints,
../constantine/elliptic/[ec_weierstrass_projective, ec_scalar_mul, ec_endomorphism_accel],
Expand Down Expand Up @@ -57,7 +57,11 @@ elif defined(icc):
else:
echo "\nCompiled with an unknown compiler"

echo "Optimization level => no optimization: ", not defined(release), " | release: ", defined(release), " | danger: ", defined(danger)
echo "Optimization level => "
echo " no optimization: ", not defined(release)
echo " release: ", defined(release)
echo " danger: ", defined(danger)
echo " inline assembly: ", UseX86ASM

when (sizeof(int) == 4) or defined(Constantine32):
echo "⚠️ Warning: using Constantine with 32-bit limbs"
Expand All @@ -84,6 +88,16 @@ proc report(op, elliptic: string, start, stop: MonoTime, startClk, stopClk: int6
else:
echo &"{op:<60} {elliptic:<40} {throughput:>15.3f} ops/s {ns:>9} ns/op"

proc notes*() =
echo "Notes:"
echo " - Compilers:"
echo " Compilers are severely limited on multiprecision arithmetic."
echo " Inline Assembly is used by default (nimble bench_fp)."
echo " Bench without assembly can use \"nimble bench_fp_gcc\" or \"nimble bench_fp_clang\"."
echo " GCC is significantly slower than Clang on multiprecision arithmetic due to catastrophic handling of carries."
echo " - The simplest operations might be optimized away by the compiler."
echo " - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"

macro fixEllipticDisplay(T: typedesc): untyped =
# At compile-time, enums are integers and their display is buggy
# we get the Curve ID instead of the curve name.
Expand Down
18 changes: 16 additions & 2 deletions benchmarks/bench_fields_template.nim
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

import
# Internals
../constantine/config/curves,
../constantine/config/[curves, common],
../constantine/arithmetic,
../constantine/towers,
# Helpers
Expand Down Expand Up @@ -54,7 +54,11 @@ elif defined(icc):
else:
echo "\nCompiled with an unknown compiler"

echo "Optimization level => no optimization: ", not defined(release), " | release: ", defined(release), " | danger: ", defined(danger)
echo "Optimization level => "
echo " no optimization: ", not defined(release)
echo " release: ", defined(release)
echo " danger: ", defined(danger)
echo " inline assembly: ", UseX86ASM

when (sizeof(int) == 4) or defined(Constantine32):
echo "⚠️ Warning: using Constantine with 32-bit limbs"
Expand All @@ -81,6 +85,16 @@ proc report(op, field: string, start, stop: MonoTime, startClk, stopClk: int64,
else:
echo &"{op:<50} {field:<18} {throughput:>15.3f} ops/s {ns:>9} ns/op"

proc notes*() =
echo "Notes:"
echo " - Compilers:"
echo " Compilers are severely limited on multiprecision arithmetic."
echo " Inline Assembly is used by default (nimble bench_fp)."
echo " Bench without assembly can use \"nimble bench_fp_gcc\" or \"nimble bench_fp_clang\"."
echo " GCC is significantly slower than Clang on multiprecision arithmetic due to catastrophic handling of carries."
echo " - The simplest operations might be optimized away by the compiler."
echo " - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"

macro fixFieldDisplay(T: typedesc): untyped =
# At compile-time, enums are integers and their display is buggy
# we get the Curve ID instead of the curve name.
Expand Down
6 changes: 1 addition & 5 deletions benchmarks/bench_fp.nim
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,4 @@ proc main() =
separator()

main()

echo "Notes:"
echo " - GCC is significantly slower than Clang on multiprecision arithmetic."
echo " - The simplest operations might be optimized away by the compiler."
echo " - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
notes()
6 changes: 1 addition & 5 deletions benchmarks/bench_fp12.nim
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,4 @@ proc main() =
separator()

main()

echo "Notes:"
echo " - GCC is significantly slower than Clang on multiprecision arithmetic."
echo " - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
echo " - The tower of extension fields chosen can lead to a large difference of performance between primes of similar bitwidth."
notes()
6 changes: 1 addition & 5 deletions benchmarks/bench_fp2.nim
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,4 @@ proc main() =
separator()

main()

echo "Notes:"
echo " - GCC is significantly slower than Clang on multiprecision arithmetic."
echo " - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
echo " - The tower of extension fields chosen can lead to a large difference of performance between primes of similar bitwidth."
notes()
6 changes: 1 addition & 5 deletions benchmarks/bench_fp6.nim
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,4 @@ proc main() =
separator()

main()

echo "Notes:"
echo " - GCC is significantly slower than Clang on multiprecision arithmetic."
echo " - Fast Squaring and Fast Multiplication are possible if there are spare bits in the prime representation (i.e. the prime uses 254 bits out of 256 bits)"
echo " - The tower of extension fields chosen can lead to a large difference of performance between primes of similar bitwidth."
notes()
Loading