Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
neon base64: vectorize with vector factor 16
The performance with this change is slightly worse than VF8: the code generated by LLVM contains too many mov's instead of byte vzip and vuzp. GCC is also generating too many movs and dups which make the code slower than when compiled with LLVM. Experiments from an A72 firefly cpu freq set to 1.2GHz: $ sudo cat /sys/devices/system/cpu/cpu4/cpufreq/cpuinfo_cur_freq 1200000 Before the patch with trunk LLVM as of today: ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- neon_base64_decode_lena 214964 ns 214955 ns 3256 neon_base64_decode_peppers 19452 ns 19452 ns 35989 neon_base64_decode_mandril 502020 ns 502002 ns 1394 neon_base64_decode_moby_dick 2290 ns 2290 ns 305775 neon_base64_decode_googlelogo 4820 ns 4820 ns 145098 neon_base64_decode_bingsocialicon 2778 ns 2778 ns 251984 neon_base64_decode_all 748928 ns 748916 ns 934 with the patch: ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- neon_base64_decode_lena 316154 ns 316148 ns 2214 neon_base64_decode_peppers 28442 ns 28442 ns 24608 neon_base64_decode_mandril 738890 ns 738872 ns 947 neon_base64_decode_moby_dick 3362 ns 3362 ns 208250 neon_base64_decode_googlelogo 7056 ns 7056 ns 99171 neon_base64_decode_bingsocialicon 4087 ns 4087 ns 171265 neon_base64_decode_all 1097039 ns 1097017 ns 638
- Loading branch information