Improve codegen for carry/borrow #10

mratsim · 2018-03-29T23:39:56Z

The current code add/sub/mul can be tuned to help GCC/Clang significantly improve the assembly generated.

Next commit will bring the ADD code from this:

to this

This is unrolled, add is just add + adc now instead of leaq, addq, cmpq, adcq

Notice how the leaq (corresponding to the "tmp" and the cmpq were removed.

Multiplications deserve the same.

Test case:

import ../src/mpint, times

let a = 42.initMpUint(128)

let b = 1000.initMpUint(128)

var c = 0.initMpUint(128)

proc carry(foo: var MpUint[128], a: MpUint[128]) =
  for _ in 0 ..< 1_000_000_000:
    foo += a


proc borrow(foo: var MpUint[128], b: MpUint[128]) =
  for _ in 0 ..< 1_000_000_000:
    foo -= a


var start = cpuTime()
carry(c, a)
var stop = cpuTime()

echo "Carry: " & $(stop - start) & "s"

start = cpuTime()
borrow(c, b)
stop = cpuTime()

echo "Borrow: " & $(stop - start) & "s"

echo c
echo (c - 1.initMpUint(128))

The text was updated successfully, but these errors were encountered:

mratsim · 2018-03-30T12:23:17Z

Todo: comprensive test suite.

Due to 2-complement the borrow test for x -= y should probably be x <= not y

i.e. with a 3-bit 0..8: a - b = a + (-b) +1 => 1 - 7 = 1 + 0 + 1 = 2

Small case with uint8

let a = 1'u8
let b = 255'u8


echo "a - b: " & $(a - b)
let c = not b

echo "c: " & $c
echo "borrow: " & $(c <= not b)

echo "################"

let x = 127'u8
let y = 128'u8

echo "x - y: " & $(x - y)
let z = not y

echo "z: " & $z
echo "borrow: " & $(z <= not y)

mratsim · 2018-03-30T13:19:52Z

Benchmarking, the new substraction code seems consistently 10% slower than Addition.

Also Clang manage the carry with xorl, cmpq, setbe, subq instead of just sbb.

10% less perf for truly portable code seems perfectly acceptable (provided the new version passes the future comprehensive test suite).

Previous ASM with a tmp. Notice how much time is spent on leaq (loading the value in register)

New version is spending time doing actual computation, xor, cmpq, setbe have negligible cost.

Furthermore, compiling with march=native brings more than 100% speed increase on both addition and substraction as GCC is vectorizing the loop and there are no vectorized ADC/SBB so we should not go down to assembly to leave the opportunity to vectorize code to the compiler (AVX2 on x86-64 or Neon on ARM)

Addition

Substraction

mratsim · 2018-04-20T09:59:51Z

Latest change from the code.

Previously:

proc `-=`*(x: var MpUintImpl, y: MpUintImpl) {.noSideEffect, inline.}=
  type SubTy = type x.lo
  let original = x.lo
  x.lo -= y.lo
  x.hi -= (original < x.lo).toSubtype(SubTy) + y.hi

x86_64:

AVX2:

New:

proc `-=`*(x: var MpUintImpl, y: MpUintImpl) {.noSideEffect, inline.}=
  type SubTy = type x.lo
  x.hi -= (x.lo < y.lo).toSubtype(SubTy) + y.hi
  x.lo -= y.lo

x86_64:

AVX2:

Notice how the generated AVX 2 code is shorter.

For reference instruction altency of LEAQ vs ADDQ: http://www.agner.org/optimize/instruction_tables.pdf

* Clean-up code, part 1 * Managed to get best borrow code for the not inlined substraction #10 * Implement in place substraction in terms of substraction #10 * Another unneed proc removal/temporary step * more cleanup * Upgrade benchmark to Uint256 * Special case when divisor is less than halfSize x2 speed 🔥 (still 4x slower than ttmath on Uint256) * Division: special case if dividend can overflow. 10% improvement. * forgot to undo normalization (why did the test pass :??) * 1st part, special cases of fast division * Change bitops, simplify bithacks to detect new fast division cases * 25% speed increase. Within 3x of ttmath * Reimplement multiplication with minimum allocation * Fix call. Now only 2x slower than ttmath * Prepare for optimizing comparison operators * Comparison inlining and optimization. 25% speed increase. 50% slower than ttmath now 🔥 * Fix comparison, optimize one() * inline initMpUintImpl for another 20% speed. Only 20% slower than ttmath without ASM

mratsim added the Performance label Mar 29, 2018

mratsim added a commit that referenced this issue Mar 29, 2018

Significantly improve carry/borrow assembly generation #10

6c21b54

mratsim added a commit that referenced this issue Mar 30, 2018

Fix borrow border case: see #10

55dd38c

mratsim closed this as completed in 8a06bb2 Apr 20, 2018

mratsim added a commit that referenced this issue Apr 20, 2018

Managed to get best borrow code for the not inlined substraction #10

fab45af

mratsim added a commit that referenced this issue Apr 20, 2018

Implement in place substraction in terms of substraction #10

537f1da

mratsim mentioned this issue Sep 25, 2019

add/mul bad codegen for uint256+ #87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve codegen for carry/borrow #10

Improve codegen for carry/borrow #10

mratsim commented Mar 29, 2018 •

edited

Loading

mratsim commented Mar 30, 2018 •

edited

Loading

mratsim commented Mar 30, 2018 •

edited

Loading

mratsim commented Apr 20, 2018 •

edited

Loading

Improve codegen for carry/borrow #10

Improve codegen for carry/borrow #10

Comments

mratsim commented Mar 29, 2018 • edited Loading

mratsim commented Mar 30, 2018 • edited Loading

mratsim commented Mar 30, 2018 • edited Loading

mratsim commented Apr 20, 2018 • edited Loading

mratsim commented Mar 29, 2018 •

edited

Loading

mratsim commented Mar 30, 2018 •

edited

Loading

mratsim commented Mar 30, 2018 •

edited

Loading

mratsim commented Apr 20, 2018 •

edited

Loading