-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve codegen for carry/borrow #10
Comments
Todo: comprensive test suite. Due to 2-complement the borrow test for i.e. with a 3-bit 0..8: Small case with uint8 let a = 1'u8
let b = 255'u8
echo "a - b: " & $(a - b)
let c = not b
echo "c: " & $c
echo "borrow: " & $(c <= not b)
echo "################"
let x = 127'u8
let y = 128'u8
echo "x - y: " & $(x - y)
let z = not y
echo "z: " & $z
echo "borrow: " & $(z <= not y) |
Benchmarking, the new substraction code seems consistently 10% slower than Addition. Also Clang manage the carry with xorl, cmpq, setbe, subq instead of just sbb. 10% less perf for truly portable code seems perfectly acceptable (provided the new version passes the future comprehensive test suite). Previous ASM with a tmp. Notice how much time is spent on leaq (loading the value in register) New version is spending time doing actual computation, xor, cmpq, setbe have negligible cost. Furthermore, compiling with march=native brings more than 100% speed increase on both addition and substraction as GCC is vectorizing the loop and there are no vectorized ADC/SBB so we should not go down to assembly to leave the opportunity to vectorize code to the compiler (AVX2 on x86-64 or Neon on ARM) Addition Substraction |
Latest change from the code. Previously: proc `-=`*(x: var MpUintImpl, y: MpUintImpl) {.noSideEffect, inline.}=
type SubTy = type x.lo
let original = x.lo
x.lo -= y.lo
x.hi -= (original < x.lo).toSubtype(SubTy) + y.hi New: proc `-=`*(x: var MpUintImpl, y: MpUintImpl) {.noSideEffect, inline.}=
type SubTy = type x.lo
x.hi -= (x.lo < y.lo).toSubtype(SubTy) + y.hi
x.lo -= y.lo Notice how the generated AVX 2 code is shorter. For reference instruction altency of LEAQ vs ADDQ: http://www.agner.org/optimize/instruction_tables.pdf |
* Clean-up code, part 1 * Managed to get best borrow code for the not inlined substraction #10 * Implement in place substraction in terms of substraction #10 * Another unneed proc removal/temporary step * more cleanup * Upgrade benchmark to Uint256 * Special case when divisor is less than halfSize x2 speed 🔥 (still 4x slower than ttmath on Uint256) * Division: special case if dividend can overflow. 10% improvement. * forgot to undo normalization (why did the test pass :??) * 1st part, special cases of fast division * Change bitops, simplify bithacks to detect new fast division cases * 25% speed increase. Within 3x of ttmath * Reimplement multiplication with minimum allocation * Fix call. Now only 2x slower than ttmath * Prepare for optimizing comparison operators * Comparison inlining and optimization. 25% speed increase. 50% slower than ttmath now 🔥 * Fix comparison, optimize one() * inline initMpUintImpl for another 20% speed. Only 20% slower than ttmath without ASM
The current code add/sub/mul can be tuned to help GCC/Clang significantly improve the assembly generated.
Next commit will bring the ADD code from this:
to this
This is unrolled, add is just add + adc now instead of leaq, addq, cmpq, adcq
Notice how the
leaq
(corresponding to the "tmp" and thecmpq
were removed.Multiplications deserve the same.
Test case:
The text was updated successfully, but these errors were encountered: