Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve codegen for carry/borrow #10

Closed
mratsim opened this issue Mar 29, 2018 · 3 comments
Closed

Improve codegen for carry/borrow #10

mratsim opened this issue Mar 29, 2018 · 3 comments

Comments

@mratsim
Copy link
Contributor

mratsim commented Mar 29, 2018

The current code add/sub/mul can be tuned to help GCC/Clang significantly improve the assembly generated.

Next commit will bring the ADD code from this:

2018-03-30_01-28-37

to this

2018-03-30_01-27-05

This is unrolled, add is just add + adc now instead of leaq, addq, cmpq, adcq

Notice how the leaq (corresponding to the "tmp" and the cmpq were removed.

Multiplications deserve the same.

Test case:

import ../src/mpint, times

let a = 42.initMpUint(128)

let b = 1000.initMpUint(128)

var c = 0.initMpUint(128)

proc carry(foo: var MpUint[128], a: MpUint[128]) =
  for _ in 0 ..< 1_000_000_000:
    foo += a


proc borrow(foo: var MpUint[128], b: MpUint[128]) =
  for _ in 0 ..< 1_000_000_000:
    foo -= a


var start = cpuTime()
carry(c, a)
var stop = cpuTime()

echo "Carry: " & $(stop - start) & "s"

start = cpuTime()
borrow(c, b)
stop = cpuTime()

echo "Borrow: " & $(stop - start) & "s"

echo c
echo (c - 1.initMpUint(128))
@mratsim
Copy link
Contributor Author

mratsim commented Mar 30, 2018

Todo: comprensive test suite.

Due to 2-complement the borrow test for x -= y should probably be x <= not y

i.e. with a 3-bit 0..8: a - b = a + (-b) +1 => 1 - 7 = 1 + 0 + 1 = 2

Small case with uint8

let a = 1'u8
let b = 255'u8


echo "a - b: " & $(a - b)
let c = not b

echo "c: " & $c
echo "borrow: " & $(c <= not b)

echo "################"

let x = 127'u8
let y = 128'u8

echo "x - y: " & $(x - y)
let z = not y

echo "z: " & $z
echo "borrow: " & $(z <= not y)

mratsim added a commit that referenced this issue Mar 30, 2018
@mratsim
Copy link
Contributor Author

mratsim commented Mar 30, 2018

Benchmarking, the new substraction code seems consistently 10% slower than Addition.

Also Clang manage the carry with xorl, cmpq, setbe, subq instead of just sbb.

10% less perf for truly portable code seems perfectly acceptable (provided the new version passes the future comprehensive test suite).

Previous ASM with a tmp. Notice how much time is spent on leaq (loading the value in register)

2018-03-30_14-26-08

New version is spending time doing actual computation, xor, cmpq, setbe have negligible cost.

2018-03-30_15-07-37

Furthermore, compiling with march=native brings more than 100% speed increase on both addition and substraction as GCC is vectorizing the loop and there are no vectorized ADC/SBB so we should not go down to assembly to leave the opportunity to vectorize code to the compiler (AVX2 on x86-64 or Neon on ARM)

Addition

2018-03-30_15-18-39

Substraction

2018-03-30_15-18-02

@mratsim
Copy link
Contributor Author

mratsim commented Apr 20, 2018

Latest change from the code.

Previously:

proc `-=`*(x: var MpUintImpl, y: MpUintImpl) {.noSideEffect, inline.}=
  type SubTy = type x.lo
  let original = x.lo
  x.lo -= y.lo
  x.hi -= (original < x.lo).toSubtype(SubTy) + y.hi

x86_64:
2018-04-20_11-46-23

AVX2:
2018-04-20_11-53-21

New:

proc `-=`*(x: var MpUintImpl, y: MpUintImpl) {.noSideEffect, inline.}=
  type SubTy = type x.lo
  x.hi -= (x.lo < y.lo).toSubtype(SubTy) + y.hi
  x.lo -= y.lo

x86_64:
2018-04-20_11-49-48

AVX2:
2018-04-20_11-51-31

Notice how the generated AVX 2 code is shorter.

For reference instruction altency of LEAQ vs ADDQ: http://www.agner.org/optimize/instruction_tables.pdf

2018-04-20_11-48-40
2018-04-20_11-49-00

mratsim added a commit that referenced this issue Apr 21, 2018
* Clean-up code, part 1

* Managed to get best borrow code for the not inlined substraction #10

* Implement in place substraction in terms of substraction #10

* Another unneed proc removal/temporary step

* more cleanup

* Upgrade benchmark to Uint256

* Special case when divisor is less than halfSize x2 speed 🔥 (still 4x slower than ttmath on Uint256)

* Division: special case if dividend can overflow. 10% improvement.

* forgot to undo normalization (why did the test pass :??)

* 1st part, special cases of fast division

* Change bitops, simplify bithacks to detect new fast division cases

* 25% speed increase. Within 3x of ttmath

* Reimplement multiplication with minimum allocation

* Fix call. Now only 2x slower than ttmath

* Prepare for optimizing comparison operators

* Comparison inlining and optimization. 25% speed increase. 50% slower than ttmath now 🔥

* Fix comparison, optimize one()

* inline initMpUintImpl for another 20% speed. Only 20% slower than ttmath without ASM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant