neon base64: vectorize with vector factor 16 #6

sebpop · 2018-02-20T20:16:48Z

The performance with this change is slightly worse than VF8: the code generated
by LLVM contains too many mov's instead of byte vzip and vuzp. GCC is also
generating too many movs and dups which make the code slower than when compiled
with LLVM.

Experiments from an A72 firefly cpu freq set to 1.2GHz:

$ sudo cat /sys/devices/system/cpu/cpu4/cpufreq/cpuinfo_cur_freq
1200000

Before the patch with trunk LLVM as of today:

-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
neon_base64_decode_lena               214964 ns     214955 ns       3256
neon_base64_decode_peppers             19452 ns      19452 ns      35989
neon_base64_decode_mandril            502020 ns     502002 ns       1394
neon_base64_decode_moby_dick            2290 ns       2290 ns     305775
neon_base64_decode_googlelogo           4820 ns       4820 ns     145098
neon_base64_decode_bingsocialicon       2778 ns       2778 ns     251984
neon_base64_decode_all                748928 ns     748916 ns        934

with the patch:

-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
neon_base64_decode_lena               316154 ns     316148 ns       2214
neon_base64_decode_peppers             28442 ns      28442 ns      24608
neon_base64_decode_mandril            738890 ns     738872 ns        947
neon_base64_decode_moby_dick            3362 ns       3362 ns     208250
neon_base64_decode_googlelogo           7056 ns       7056 ns      99171
neon_base64_decode_bingsocialicon       4087 ns       4087 ns     171265
neon_base64_decode_all               1097039 ns    1097017 ns        638

sebpop · 2018-02-20T20:20:17Z

Here are the numbers for the chromium_base64 implementation on the same A72 processor:

-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
chromium_base64_decode_lena               268360 ns     268355 ns       2608
chromium_base64_decode_peppers             23784 ns      23784 ns      29432
chromium_base64_decode_mandril            623872 ns     623852 ns       1121
chromium_base64_decode_moby_dick            2814 ns       2814 ns     248726
chromium_base64_decode_googlelogo           5936 ns       5935 ns     117943
chromium_base64_decode_bingsocialicon       3426 ns       3426 ns     204350
chromium_base64_decode_all                930364 ns     930343 ns        751

lemire · 2018-02-20T21:15:28Z

So this patch is a negative? If so, any chance you might add it (with new function names) instead of replacing the code we have in place? In this way, other people could inspect your work and improve upon it.

sebpop · 2018-02-20T23:31:23Z

VF16 should be a win over VF8. I will fix LLVM to produce byte vzip and vuzp instead of the 16 byte moves.
As you suggest I will send a patch that keeps both VF8 and VF16 in different functions.

Overall this patch is a win over the vectorization by 8. The benchmarks were compiled with LLVM trunk as of today comparing neon_base64_decode_all against chromium_base64_decode_all: - a positive number is a speedup - vf8 is the version before this patch with vector factor 8 - vf16 is with this patch vector factor 16 - chromium is the scalar reference implementation - S8 Exynos-M2 is a Galaxy-S8 phone with Exynos-M2 - S8 A-53 is the little core of the Galaxy-S8 CPU vf8/chromium vf16/chromium vf16/vf8 S8 Exynos-M2 -18.3% 17.7% 44.1% S8 A-53 45.3% 126.8% 56.1% A72 firefly 24.0% 29.0% 4.1%

lemire · 2018-03-01T00:19:09Z

Ping.

sebpop · 2018-03-01T00:44:50Z

Daniel, do you still want the two functions vf8 and vf16 side by side?
The amended patch for vf16 is now better than vf8 on all aarch64 cpus I have access to:
see commit message for 739e5fd.

lemire · 2018-03-01T00:46:59Z

Ah. Ok. So you are recommending an as-is merge then, right? Just to be clear.

sebpop · 2018-03-01T00:56:37Z

Yes, please merge in the patch as-is: the new implementation is better across the board now.

On the LLVM side: I have submitted one of the two patches needed to get the original slower patch faster https://reviews.llvm.org/D43903 the other change will be a bit more complicated as it disables a feature in instruction selection that was added for x86_64 blend instructions. I will have to make that feature specific for x86 and disabled for aarch64. The good thing is that the current shape of the code for vf16 does not expose the slow patterns anymore and we don't need the changes to LLVM to achieve a reasonable speedup.

lemire · 2018-03-01T00:57:54Z

+1

sebpop · 2018-03-05T21:39:11Z

For reference https://reviews.llvm.org/D44118 is the second patch to get LLVM to produce a decent code for aarch64: this avoids the generation of many byte movs and replaces them with a zip1 and zip2 instructions.

sebpop force-pushed the master branch from fc7c6b5 to c872ec7 Compare February 27, 2018 21:02

sebpop force-pushed the master branch from c872ec7 to 739e5fd Compare February 28, 2018 15:14

lemire merged commit 7e63ba3 into lemire:master Mar 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neon base64: vectorize with vector factor 16 #6

neon base64: vectorize with vector factor 16 #6

sebpop commented Feb 20, 2018 •

edited

Loading

sebpop commented Feb 20, 2018 •

edited

Loading

lemire commented Feb 20, 2018

sebpop commented Feb 20, 2018

lemire commented Mar 1, 2018

sebpop commented Mar 1, 2018

lemire commented Mar 1, 2018

sebpop commented Mar 1, 2018

lemire commented Mar 1, 2018

sebpop commented Mar 5, 2018

neon base64: vectorize with vector factor 16 #6

neon base64: vectorize with vector factor 16 #6

Conversation

sebpop commented Feb 20, 2018 • edited Loading

sebpop commented Feb 20, 2018 • edited Loading

lemire commented Feb 20, 2018

sebpop commented Feb 20, 2018

lemire commented Mar 1, 2018

sebpop commented Mar 1, 2018

lemire commented Mar 1, 2018

sebpop commented Mar 1, 2018

lemire commented Mar 1, 2018

sebpop commented Mar 5, 2018

sebpop commented Feb 20, 2018 •

edited

Loading

sebpop commented Feb 20, 2018 •

edited

Loading