You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Those loops should be merged together to compute T + u0*B + u1*N in one loop, introducing a basic primitive mpi_mul_dbl_hlp() (or similar) which performs two multiply-accumulates. This results in what the above paper calls "Finely Integrated Operand Scanning".
The impact is particularly large for M-Profile CPUs which implement the DSP extension and thus feature the UMAAL instruction. With this instruction, the bottleneck of the entire Montgomery multiplication are memory accesses. We have already reported on one optimization #5360 reducing the CC for a 2048-bit Montgomery multiplication from 37kC to 25kC by pairing loads and stores. Passing from CIOS to FIOS will further reduce the cycle count to around 20kC since a load/store pair for the accumulator T is removed.
The text was updated successfully, but these errors were encountered:
In the language of https://ieeexplore.ieee.org/document/502403, Mbed TLS' Montgomery Multiplication implements "Coarsely Integrated Operand Scanning" (CIOS): In each step of the Montgomery multiplication loop, we perform two separate big x small multiplications to compute first
T -> T + u0*B
and thenT -> T + u0*B + u1*N
whereu1 = (T + u0*B)_0 * Pinv_0
: https://github.com/ARMmbed/mbedtls/blob/e44d8e7eea886a472684bb830295a0ac1c283007/library/bignum.c#L1931-L1938Those loops should be merged together to compute
T + u0*B + u1*N
in one loop, introducing a basic primitivempi_mul_dbl_hlp()
(or similar) which performs two multiply-accumulates. This results in what the above paper calls "Finely Integrated Operand Scanning".The impact is particularly large for M-Profile CPUs which implement the DSP extension and thus feature the UMAAL instruction. With this instruction, the bottleneck of the entire Montgomery multiplication are memory accesses. We have already reported on one optimization #5360 reducing the CC for a 2048-bit Montgomery multiplication from 37kC to 25kC by pairing loads and stores. Passing from CIOS to FIOS will further reduce the cycle count to around 20kC since a load/store pair for the accumulator
T
is removed.The text was updated successfully, but these errors were encountered: