-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] NDK r23b neon intrinsics too slow #1607
Comments
Can you get the preprocessed input source file (and show your compiler flags) for this file/function? You can get the preprocessed file by adding |
@DaydreamCoding Can you please provide a full code that call the mentioned method? Some arguments value still unknown, such as srcw, srch, strstride, w, h, stride, type, etc. i.e. The minimal reproduce code https://stackoverflow.com/help/minimal-reproducible-example |
Let me give an reproducible example. Benchmark resultsrc image: width=1920, height=1080
Complete code and building script
|
ndk-r23b && ndk-r24-beta1 all neon intrinsics too slow Benchmark result
benchmark in qualcomm sm7250 chip
Complete code and building scripthttps://github.com/DaydreamCoding/neon-intrinsics-test #export ANDROID_NDK=$ANDROID_NDK_R21e ./build/android_build.sh && ./build/run_android64.sh |
Reproduced this with a pixel 6 pro: NDK 22.1.7171670, clang 11.0.5:
NDK 23.1.7779620, clang 12.0.8:
|
This is probably the same as Issue #1619 but we can confirm that by trying an updated toolchain with the cherry-pick applied. |
Discussed this with stephenhines, Dan, and Pirama. This is not the same issue as #1619. |
Ruled out a header issue: Preprocessed the test case with NDK r22, then compiled with r23. The performance is still slow. |
Compile commands: r22:
r23:
|
Using the r22 version of arm_neon.h with the r23 NDK does not change the performance. |
The performance regression still exists at HEAD of the ndk, which reports clang 14.0.0 |
FYI Android toolchain version numbers don't exactly match upstream LLVM version numbers. In particular, the version number becomes 14.0.0 once the 13.x release ships in TOT, so we label our toolchain as 14.0.0. When Clang 14.0.0 releases, it will likely be further along than our 14.0.0. This is why we also keep track of the SHA in our version output. |
@jfgoog I can't see a clear asm regression in trunk output that brings to 8x slowdown; the instructions are reordered but that's probably fine. Could you run a sampling profiler and see if hot instruction paths differ? Streamline from Arm Mobile Studio worked for me. |
Yury, did you look at the Godbolt link or look at the .s files I attached? If Godbolt, I recommend taking a look at the .s files, as they are the output of the actual NDK. I am new to ARM asm, and compilers in general, but it looked to me like we are emitting way more asm for r23 than r22: 9k lines vs 3k lines, according to And, when I look at, for example, the asm associated with
Whereas for r23 I get:
The branch to the allocator function is odd. In r22, we never do that. But in r23:
|
@jfgoog We may just replace the // std::vector<int> adelta(w);
// std::vector<int> bdelta(w);
int adelta[w];
int bdelta[w]; And replace But on my Android Device (XiaoMI 11) it still keep same poor performance with Android NDK r23b, same as the vector version. |
Well, I'm an idiot for missing this earlier. This is a problem with build configuration. If you look at the compile commands up above, NDK 22 has When I add
The build script, https://github.com/DaydreamCoding/neon-intrinsics-test/blob/develop/build/android_build.sh is apparently making some assumptions about the cmake configuration that changed between NDK22 and 23. I tried commenting out the following lines: diff --git a/build/android_build.sh b/build/android_build.sh
index 6c650af..1a58d9e 100755
--- a/build/android_build.sh
+++ b/build/android_build.sh
@@ -50,8 +50,8 @@ COMMON_C_FLAGS="$COMMON_C_FLAGS "
COMMON_CXX_FLAGS="$COMMON_CXX_FLAGS "
# 公用FLAGS_RELEASE, 可根据实际项目需求增加-Ofast、-O3、-O2等选项, release默认-Os
-COMMON_C_FLAGS_RELEASE="$COMMON_C_FLAGS_RELEASE "
-COMMON_CXX_FLAGS_RELEASE="$COMMON_CXX_FLAGS_RELEASE "
+#COMMON_C_FLAGS_RELEASE="$COMMON_C_FLAGS_RELEASE "
+#COMMON_CXX_FLAGS_RELEASE="$COMMON_CXX_FLAGS_RELEASE "
# hidden symbol
if [ "$BUILD_HIDDEN_SYMBOL" != "OFF" ]; then
@@ -99,8 +99,8 @@ do_build()
CMAKE_ARGS+=("-DCMAKE_C_FLAGS=$COMMON_C_FLAGS")
CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS=$COMMON_CXX_FLAGS")
- CMAKE_ARGS+=("-DCMAKE_C_FLAGS_RELEASE=$COMMON_C_FLAGS_RELEASE")
- CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS_RELEASE=$COMMON_CXX_FLAGS_RELEASE")
+ #CMAKE_ARGS+=("-DCMAKE_C_FLAGS_RELEASE=$COMMON_C_FLAGS_RELEASE")
+ #CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS_RELEASE=$COMMON_CXX_FLAGS_RELEASE")
# 编译安装目录
if [ "$BUILD_BASE_DIR" = "" ]; then Then the code is compiled with
|
Description
NDK r23b compile neon intrinsics is very slow:
https://github.com/Tencent/ncnn/blob/master/src/mat_pixel_affine.cpp#L1173
warpaffine_bilinear_c4
params:
constexpr int width = 160;
constexpr int height = 160;
constexpr float image_matrix[6] = {
-0.00673565036, 0.146258384, 4.34562492,
-0.146258384, -0.00673565036, 162.753372,
};
NDK r23b this funciton cost 8.40 ms.
NDK r22b this function cost 0.302 ms.
Environment Details
The text was updated successfully, but these errors were encountered: