Perf: Assembly code generator for ARM and ARM64 #200

mratsim · 2022-08-06T11:44:10Z

#69 introduced an assembly ode generator for x86 and x86-64
at https://github.com/mratsim/constantine/blob/7d29cb9/constantine/platforms/isa/macro_assembler_x86.nim

We need the same for ARM for efficiency on Raspberry Pi, Phones, Apple Silicon and other resource-restricted devices.

Efficient multiplication on ARM:

slides: http://arith24.arithsymposium.org/slides/s2-liu.pdf
paper 1: https://orbilu.uni.lu/bitstream/10993/34104/1/ARMv8_KJ_zhe.pdf
paper 2: https://core.ac.uk/download/pdf/275655534.pdf
Multiprecision Multiplication on ARMv8

Related papers:

https://eprint.iacr.org/2021/185.pdf

No Silver Bullet: Optimized Montgomery
Multiplication on Various 64-bit ARM Platforms

Abstract

In this paper, we firstly presented optimized implementa-
tions of Montgomery multiplication on 64-bit ARM processors by taking
advantages of Karatsuba algorithm and efficient multiplication instruc-
tion sets for ARM64 architectures. The implementation of Montgomery
multiplication can improve the performance of (pre-quantum and post-
quantum) public key cryptography (e.g. CSIDH, ECC, and RSA) imple-
mentations on ARM64 architectures, directly. Last but not least, the per-
formance of Karatsuba algorithm does not ensure the fastest speed record
on various ARM architectures, while it is determined by the clock cycles
per multiplication instruction of target ARM architectures. In particular,
recent Apple processors based on ARM64 architecture show lower cycles
per instruction of multiplication than that of ARM Cortex-A series. For
this reason, the schoolbook method shows much better performance than
the sophisticated Karatsuba algorithm on Apple processors. With this
observation, we can determine the proper approach for multiplication
of cryptography library (e.g. Microsoft-SIDH) on Apple processors and
ARM Cortex-A process

mratsim · 2022-08-06T12:13:45Z

Relevant:

https://eprint.iacr.org/2022/439.pdf - Efficient Multiplication of Somewhat Small Integers using Number-Theoretic Transforms
https://eprint.iacr.org/2021/1355.pdf - Curve448 on 32-bit ARM Cortex-M4
https://tches.iacr.org/index.php/TCHES/article/view/9295/8861 - Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1
https://eprint.iacr.org/2021/561.pdf - Kyber on ARM64
https://eprint.iacr.org/2019/721.pdf - Optimized SIKE Round 2 on 64-bit ARM
Improve Montgomery multiplication strategy Mbed-TLS/mbedtls#5666 - Improve Montgomery multiplication strategy with UMAAL instruction for fused {C|D} <- A*B + C + D
Improve inline assembly for Cortex-M + DSP Mbed-TLS/mbedtls#5360 - Improve inline assembly for Cortex-M + DSP

mratsim · 2024-02-11T22:49:56Z

https://eprint.iacr.org/2018/700.pdf - SIDH on ARM: Faster Modular Multiplications for Faster Post-Quantum Supersingular Isogeny Key Exchange
- slides: https://ches.iacr.org/2018/slides/ches2018-session5-talk3-slides.pdf
https://eprint.iacr.org/2016/645.pdf - FourQNEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors
https://rielac.cujae.edu.cu/index.php/rieac/article/download/797/420 - Speeding up elliptic curve arithmetic on ARM processors using NEON instructions
https://eprint.iacr.org/2015/465.pdf - Efficient Arithmetic on ARM-NEON and Its Application for High-Speed RSA Implementation
https://eprint.iacr.org/2014/760.pdf - Montgomery Modular Multiplication on ARM-NEON Revisited

https://eprint.iacr.org/2021/185.pdf is particularly interesting regarding general ARM CPUs and Apple CPUs:

Multiplications are 3x slower than addition on Rpi4 but have sensibly the same speed on Apple CPUs.

mratsim added the performance 🏁 label Aug 6, 2022

mratsim mentioned this issue Jun 28, 2024

Constantine bindings for EIP196 hyperledger/besu-native#184

Merged

mratsim mentioned this issue Aug 14, 2024

LLVM: field addition with saturated fields #456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: Assembly code generator for ARM and ARM64 #200

Perf: Assembly code generator for ARM and ARM64 #200

mratsim commented Aug 6, 2022

mratsim commented Aug 6, 2022 •

edited

Loading

mratsim commented Feb 11, 2024

Perf: Assembly code generator for ARM and ARM64 #200

Perf: Assembly code generator for ARM and ARM64 #200

Comments

mratsim commented Aug 6, 2022

mratsim commented Aug 6, 2022 • edited Loading

mratsim commented Feb 11, 2024

mratsim commented Aug 6, 2022 •

edited

Loading