Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neon base64: vectorize with vector factor 16 #6

Merged
merged 1 commit into from
Mar 1, 2018

Conversation

sebpop
Copy link
Contributor

@sebpop sebpop commented Feb 20, 2018

The performance with this change is slightly worse than VF8: the code generated
by LLVM contains too many mov's instead of byte vzip and vuzp. GCC is also
generating too many movs and dups which make the code slower than when compiled
with LLVM.

Experiments from an A72 firefly cpu freq set to 1.2GHz:

$ sudo cat /sys/devices/system/cpu/cpu4/cpufreq/cpuinfo_cur_freq
1200000

Before the patch with trunk LLVM as of today:

-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
neon_base64_decode_lena               214964 ns     214955 ns       3256
neon_base64_decode_peppers             19452 ns      19452 ns      35989
neon_base64_decode_mandril            502020 ns     502002 ns       1394
neon_base64_decode_moby_dick            2290 ns       2290 ns     305775
neon_base64_decode_googlelogo           4820 ns       4820 ns     145098
neon_base64_decode_bingsocialicon       2778 ns       2778 ns     251984
neon_base64_decode_all                748928 ns     748916 ns        934

with the patch:

-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
neon_base64_decode_lena               316154 ns     316148 ns       2214
neon_base64_decode_peppers             28442 ns      28442 ns      24608
neon_base64_decode_mandril            738890 ns     738872 ns        947
neon_base64_decode_moby_dick            3362 ns       3362 ns     208250
neon_base64_decode_googlelogo           7056 ns       7056 ns      99171
neon_base64_decode_bingsocialicon       4087 ns       4087 ns     171265
neon_base64_decode_all               1097039 ns    1097017 ns        638

@sebpop
Copy link
Contributor Author

sebpop commented Feb 20, 2018

Here are the numbers for the chromium_base64 implementation on the same A72 processor:

-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
chromium_base64_decode_lena               268360 ns     268355 ns       2608
chromium_base64_decode_peppers             23784 ns      23784 ns      29432
chromium_base64_decode_mandril            623872 ns     623852 ns       1121
chromium_base64_decode_moby_dick            2814 ns       2814 ns     248726
chromium_base64_decode_googlelogo           5936 ns       5935 ns     117943
chromium_base64_decode_bingsocialicon       3426 ns       3426 ns     204350
chromium_base64_decode_all                930364 ns     930343 ns        751

@lemire
Copy link
Owner

lemire commented Feb 20, 2018

So this patch is a negative? If so, any chance you might add it (with new function names) instead of replacing the code we have in place? In this way, other people could inspect your work and improve upon it.

@sebpop
Copy link
Contributor Author

sebpop commented Feb 20, 2018

VF16 should be a win over VF8. I will fix LLVM to produce byte vzip and vuzp instead of the 16 byte moves.
As you suggest I will send a patch that keeps both VF8 and VF16 in different functions.

Overall this patch is a win over the vectorization by 8.
The benchmarks were compiled with LLVM trunk as of today comparing
neon_base64_decode_all against chromium_base64_decode_all:
- a positive number is a speedup
- vf8 is the version before this patch with vector factor 8
- vf16 is with this patch vector factor 16
- chromium is the scalar reference implementation
- S8 Exynos-M2 is a Galaxy-S8 phone with Exynos-M2
- S8 A-53 is the little core of the Galaxy-S8

CPU            vf8/chromium   vf16/chromium   vf16/vf8
S8 Exynos-M2         -18.3%           17.7%      44.1%
S8 A-53               45.3%          126.8%      56.1%
A72 firefly           24.0%           29.0%       4.1%
@lemire
Copy link
Owner

lemire commented Mar 1, 2018

Ping.

@sebpop
Copy link
Contributor Author

sebpop commented Mar 1, 2018

Daniel, do you still want the two functions vf8 and vf16 side by side?
The amended patch for vf16 is now better than vf8 on all aarch64 cpus I have access to:
see commit message for 739e5fd.

@lemire
Copy link
Owner

lemire commented Mar 1, 2018

Ah. Ok. So you are recommending an as-is merge then, right? Just to be clear.

@sebpop
Copy link
Contributor Author

sebpop commented Mar 1, 2018

Yes, please merge in the patch as-is: the new implementation is better across the board now.

On the LLVM side: I have submitted one of the two patches needed to get the original slower patch faster https://reviews.llvm.org/D43903 the other change will be a bit more complicated as it disables a feature in instruction selection that was added for x86_64 blend instructions. I will have to make that feature specific for x86 and disabled for aarch64. The good thing is that the current shape of the code for vf16 does not expose the slow patterns anymore and we don't need the changes to LLVM to achieve a reasonable speedup.

@lemire
Copy link
Owner

lemire commented Mar 1, 2018

+1

@lemire lemire merged commit 7e63ba3 into lemire:master Mar 1, 2018
@sebpop
Copy link
Contributor Author

sebpop commented Mar 5, 2018

For reference https://reviews.llvm.org/D44118 is the second patch to get LLVM to produce a decent code for aarch64: this avoids the generation of many byte movs and replaces them with a zip1 and zip2 instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants