-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neon base64: vectorize with vector factor 16 #6
Conversation
Here are the numbers for the chromium_base64 implementation on the same A72 processor:
|
So this patch is a negative? If so, any chance you might add it (with new function names) instead of replacing the code we have in place? In this way, other people could inspect your work and improve upon it. |
VF16 should be a win over VF8. I will fix LLVM to produce byte vzip and vuzp instead of the 16 byte moves. |
Overall this patch is a win over the vectorization by 8. The benchmarks were compiled with LLVM trunk as of today comparing neon_base64_decode_all against chromium_base64_decode_all: - a positive number is a speedup - vf8 is the version before this patch with vector factor 8 - vf16 is with this patch vector factor 16 - chromium is the scalar reference implementation - S8 Exynos-M2 is a Galaxy-S8 phone with Exynos-M2 - S8 A-53 is the little core of the Galaxy-S8 CPU vf8/chromium vf16/chromium vf16/vf8 S8 Exynos-M2 -18.3% 17.7% 44.1% S8 A-53 45.3% 126.8% 56.1% A72 firefly 24.0% 29.0% 4.1%
Ping. |
Daniel, do you still want the two functions vf8 and vf16 side by side? |
Ah. Ok. So you are recommending an as-is merge then, right? Just to be clear. |
Yes, please merge in the patch as-is: the new implementation is better across the board now. On the LLVM side: I have submitted one of the two patches needed to get the original slower patch faster https://reviews.llvm.org/D43903 the other change will be a bit more complicated as it disables a feature in instruction selection that was added for x86_64 blend instructions. I will have to make that feature specific for x86 and disabled for aarch64. The good thing is that the current shape of the code for vf16 does not expose the slow patterns anymore and we don't need the changes to LLVM to achieve a reasonable speedup. |
+1 |
For reference https://reviews.llvm.org/D44118 is the second patch to get LLVM to produce a decent code for aarch64: this avoids the generation of many byte movs and replaces them with a zip1 and zip2 instructions. |
The performance with this change is slightly worse than VF8: the code generated
by LLVM contains too many mov's instead of byte vzip and vuzp. GCC is also
generating too many movs and dups which make the code slower than when compiled
with LLVM.
Experiments from an A72 firefly cpu freq set to 1.2GHz:
Before the patch with trunk LLVM as of today:
with the patch: