In encode_arm64.s, encodeBlock, two of the registers added during the port from
amd64 were not saved or restored for the memmove call. Instead of saving them,
just recalculate their values. Additionally, I made a few small changes to
improve things since I've learned a bit more about ARMv8 assembly.
- The CMP instruction accepts an immediate as the first argument
- use LDP/STP instead of SIMD instructions
The change to use the load-pair and store-pair instructions instead of the SIMD
instructions results in some modest performance improvements as meastured on
Neoverse N1 (Graviton 2).
name old time/op new time/op delta
WordsDecode1e1-2 25.9ns ± 1% 26.1ns ± 1% +0.66% (p=0.005 n=10+10)
WordsDecode1e2-2 107ns ± 0% 105ns ± 0% -1.87% (p=0.000 n=10+10)
WordsDecode1e3-2 953ns ± 0% 901ns ± 0% -5.50% (p=0.000 n=10+10)
WordsDecode1e4-2 10.6µs ± 0% 9.9µs ± 2% -6.60% (p=0.000 n=7+10)
WordsDecode1e5-2 170µs ± 1% 164µs ± 1% -3.12% (p=0.000 n=10+9)
WordsDecode1e6-2 1.71ms ± 0% 1.66ms ± 0% -2.98% (p=0.000 n=10+10)
WordsEncode1e1-2 22.0ns ± 1% 21.9ns ± 1% -0.67% (p=0.006 n=8+10)
WordsEncode1e2-2 248ns ± 0% 245ns ± 0% -1.21% (p=0.002 n=8+10)
WordsEncode1e3-2 2.50µs ± 0% 2.49µs ± 0% ~ (p=0.103 n=10+9)
WordsEncode1e4-2 27.8µs ± 3% 28.0µs ± 2% ~ (p=0.075 n=10+10)
WordsEncode1e5-2 339µs ± 0% 343µs ± 0% +1.18% (p=0.000 n=9+10)
WordsEncode1e6-2 3.39ms ± 0% 3.42ms ± 0% +0.94% (p=0.000 n=10+10)
RandomEncode-2 74.8µs ± 1% 77.1µs ± 1% +3.16% (p=0.000 n=10+10)
_UFlat0-2 68.8µs ± 1% 66.4µs ± 2% -3.54% (p=0.000 n=10+10)
_UFlat1-2 770µs ± 0% 740µs ± 1% -3.93% (p=0.000 n=10+10)
_UFlat2-2 6.57µs ± 0% 6.55µs ± 0% -0.25% (p=0.000 n=8+10)
_UFlat3-2 183ns ± 0% 178ns ± 1% -2.84% (p=0.000 n=9+10)
_UFlat4-2 9.76µs ± 1% 9.56µs ± 0% -2.07% (p=0.000 n=10+9)
_UFlat5-2 301µs ± 0% 293µs ± 0% -2.67% (p=0.000 n=9+10)
_UFlat6-2 280µs ± 1% 267µs ± 1% -4.63% (p=0.000 n=10+10)
_UFlat7-2 241µs ± 0% 230µs ± 1% -4.68% (p=0.000 n=9+10)
_UFlat8-2 745µs ± 0% 715µs ± 1% -4.11% (p=0.000 n=10+10)
_UFlat9-2 1.01ms ± 0% 0.96ms ± 0% -4.60% (p=0.000 n=10+10)
_UFlat10-2 62.3µs ± 1% 59.3µs ± 1% -4.72% (p=0.000 n=10+9)
_UFlat11-2 258µs ± 0% 252µs ± 1% -2.56% (p=0.000 n=10+10)
_ZFlat0-2 135µs ± 1% 132µs ± 1% -1.88% (p=0.000 n=10+8)
_ZFlat1-2 1.76ms ± 0% 1.74ms ± 0% -1.00% (p=0.000 n=9+9)
_ZFlat2-2 9.54µs ± 0% 9.84µs ± 5% +3.18% (p=0.000 n=10+10)
_ZFlat3-2 449ns ± 0% 447ns ± 0% -0.38% (p=0.000 n=10+9)
_ZFlat4-2 15.6µs ± 0% 16.0µs ± 4% ~ (p=0.118 n=9+10)
_ZFlat5-2 560µs ± 1% 555µs ± 1% -0.89% (p=0.000 n=9+9)
_ZFlat6-2 531µs ± 0% 534µs ± 0% +0.64% (p=0.000 n=10+10)
_ZFlat7-2 466µs ± 0% 468µs ± 0% +0.32% (p=0.003 n=10+10)
_ZFlat8-2 1.42ms ± 0% 1.42ms ± 0% +0.43% (p=0.000 n=10+10)
_ZFlat9-2 1.93ms ± 0% 1.94ms ± 0% +0.44% (p=0.000 n=10+10)
_ZFlat10-2 120µs ± 0% 121µs ± 3% ~ (p=0.436 n=9+9)
_ZFlat11-2 433µs ± 0% 437µs ± 0% +1.03% (p=0.000 n=10+10)
ExtendMatch-2 9.77µs ± 0% 9.76µs ± 0% -0.13% (p=0.050 n=10+10)
As measured on Cortex-A53 (Raspberry Pi 3)
name old time/op new time/op delta
WordsDecode1e1-4 152ns ± 2% 151ns ± 0% ~ (p=0.536 n=10+8)
WordsDecode1e2-4 639ns ± 0% 617ns ± 0% -3.54% (p=0.000 n=9+8)
WordsDecode1e3-4 6.74µs ± 2% 6.35µs ± 0% -5.75% (p=0.000 n=10+9)
WordsDecode1e4-4 66.7µs ± 0% 63.5µs ± 0% -4.69% (p=0.000 n=9+9)
WordsDecode1e5-4 715µs ± 0% 684µs ± 0% -4.38% (p=0.000 n=8+8)
WordsDecode1e6-4 6.87ms ± 2% 6.53ms ± 1% -4.99% (p=0.000 n=10+9)
WordsEncode1e1-4 127ns ± 2% 126ns ± 0% ~ (p=0.065 n=10+9)
WordsEncode1e2-4 1.58µs ± 0% 1.57µs ± 0% -0.99% (p=0.000 n=8+8)
WordsEncode1e3-4 15.1µs ± 0% 14.9µs ± 0% -1.46% (p=0.000 n=9+8)
WordsEncode1e4-4 148µs ± 0% 148µs ± 4% ~ (p=0.497 n=9+10)
WordsEncode1e5-4 1.54ms ± 0% 1.54ms ± 0% +0.12% (p=0.012 n=10+8)
WordsEncode1e6-4 14.4ms ± 0% 14.4ms ± 1% -0.47% (p=0.015 n=9+8)
RandomEncode-4 1.13ms ± 1% 1.13ms ± 1% ~ (p=0.529 n=10+10)
_UFlat0-4 294µs ± 0% 288µs ± 1% -2.08% (p=0.000 n=9+9)
_UFlat1-4 3.05ms ± 1% 2.98ms ± 1% -2.22% (p=0.000 n=9+9)
_UFlat2-4 37.3µs ± 0% 37.4µs ± 1% ~ (p=0.093 n=8+9)
_UFlat3-4 909ns ± 0% 914ns ± 2% ~ (p=0.526 n=8+10)
_UFlat4-4 58.7µs ± 0% 58.1µs ± 0% -1.09% (p=0.000 n=8+10)
_UFlat5-4 1.22ms ± 0% 1.19ms ± 1% -2.14% (p=0.000 n=8+8)
_UFlat6-4 1.03ms ± 0% 0.99ms ± 0% -3.28% (p=0.000 n=9+8)
_UFlat7-4 895µs ± 0% 861µs ± 0% -3.79% (p=0.000 n=8+8)
_UFlat8-4 2.83ms ± 0% 2.75ms ± 0% -2.88% (p=0.000 n=7+8)
_UFlat9-4 3.85ms ± 1% 3.73ms ± 1% -3.03% (p=0.000 n=8+9)
_UFlat10-4 286µs ± 0% 282µs ± 0% -1.59% (p=0.000 n=9+9)
_UFlat11-4 1.06ms ± 0% 1.02ms ± 0% -3.58% (p=0.000 n=8+9)
_ZFlat0-4 620µs ± 0% 620µs ± 1% ~ (p=0.963 n=9+8)
_ZFlat1-4 9.49ms ± 1% 9.67ms ± 3% +1.87% (p=0.000 n=9+10)
_ZFlat2-4 61.8µs ± 0% 62.3µs ± 3% ~ (p=0.829 n=8+10)
_ZFlat3-4 2.80µs ± 1% 2.79µs ± 0% -0.55% (p=0.000 n=8+8)
_ZFlat4-4 108µs ± 0% 109µs ± 0% +0.55% (p=0.000 n=10+8)
_ZFlat5-4 2.59ms ± 2% 2.58ms ± 1% ~ (p=0.274 n=10+8)
_ZFlat6-4 2.39ms ± 3% 2.40ms ± 1% ~ (p=0.631 n=10+10)
_ZFlat7-4 2.11ms ± 0% 2.08ms ± 1% -1.23% (p=0.000 n=10+9)
_ZFlat8-4 6.86ms ± 0% 6.92ms ± 1% +0.78% (p=0.000 n=9+8)
_ZFlat9-4 9.42ms ± 0% 9.40ms ± 1% ~ (p=0.606 n=8+9)
_ZFlat10-4 620µs ± 1% 621µs ± 4% ~ (p=0.173 n=8+10)
_ZFlat11-4 1.94ms ± 0% 1.93ms ± 0% -0.52% (p=0.001 n=9+8)
ExtendMatch-4 69.3µs ± 2% 69.2µs ± 0% ~ (p=0.515 n=10+8)