Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NDK r23b neon intrinsics too slow #1607

Closed
DaydreamCoding opened this issue Nov 17, 2021 · 17 comments
Closed

[BUG] NDK r23b neon intrinsics too slow #1607

DaydreamCoding opened this issue Nov 17, 2021 · 17 comments
Assignees
Labels

Comments

@DaydreamCoding
Copy link

Description

NDK r23b compile neon intrinsics is very slow:

https://github.com/Tencent/ncnn/blob/master/src/mat_pixel_affine.cpp#L1173

warpaffine_bilinear_c4

params:
constexpr int width = 160;
constexpr int height = 160;
constexpr float image_matrix[6] = {
-0.00673565036, 0.146258384, 4.34562492,
-0.146258384, -0.00673565036, 162.753372,
};

NDK r23b this funciton cost 8.40 ms.
NDK r22b this function cost 0.302 ms.

Environment Details

  • NDK Version: r23b
  • Build system: CMake
  • Host OS: Mac
  • ABI: arm64-v8a
  • NDK API level: android-21
  • Device API level: android-30
@stephenhines
Copy link
Collaborator

Can you get the preprocessed input source file (and show your compiler flags) for this file/function? You can get the preprocessed file by adding -save-temps to the command line used for this file, and then attaching the *.ii source file, which should contain everything without needing additional headers. Please also let us know the rest of the command line flags you are using, as that also helps with reproducing the problem.

@zchrissirhcz
Copy link

zchrissirhcz commented Nov 17, 2021

@DaydreamCoding Can you please provide a full code that call the mentioned method? Some arguments value still unknown, such as srcw, srch, strstride, w, h, stride, type, etc.

i.e. The minimal reproduce code https://stackoverflow.com/help/minimal-reproducible-example

@zchrissirhcz
Copy link

zchrissirhcz commented Nov 17, 2021

Let me give an reproducible example.

Benchmark result

src image: width=1920, height=1080
dst image: width=960, height=540

ndk version time cost
ndk-r21b 3.621770 ms
ndk-r22b 3.475886 ms
ndk-r23b 177.813490 ms
ndk-r24-beta1 3.544636 ms

Complete code and building script

https://github.com/zchrissirhcz/test_ncnn_warp
https://github.com/zchrissirhcz/min-repros/tree/master/test_ncnn_warp

DaydreamCoding pushed a commit to DaydreamCoding/neon-intrinsics-test that referenced this issue Nov 18, 2021
@DaydreamCoding
Copy link
Author

DaydreamCoding commented Nov 18, 2021

@zchrissirhcz
@stephenhines

ndk-r23b && ndk-r24-beta1 all neon intrinsics too slow

Benchmark result

    constexpr int src_width = 1080;
    constexpr int src_height = 1440;
    constexpr int dst_width = 160;
    constexpr int dst_height = 160;
    constexpr float image_matrix[6] = {
            -0.00673565036, 0.146258384,    4.34562492,
            -0.146258384,   -0.00673565036, 162.753372,
    };

benchmark in qualcomm sm7250 chip

ndk version time cost
ndk-r21e 1.60 ms
ndk-r22b 0.590 ms
ndk-r23b 8.43 ms
ndk-r24-beta1 7.86 ms

Complete code and building script

https://github.com/DaydreamCoding/neon-intrinsics-test

#export ANDROID_NDK=$ANDROID_NDK_R21e
#export ANDROID_NDK=$ANDROID_NDK_R22b
export ANDROID_NDK=$ANDROID_NDK_R23b
#export ANDROID_NDK=$ANDROID_NDK_R24_beta1

./build/android_build.sh && ./build/run_android64.sh

@jfgoog
Copy link
Collaborator

jfgoog commented Dec 14, 2021

Reproduced this with a pixel 6 pro:

NDK 22.1.7171670, clang 11.0.5:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10       1.32 ms         1.29 ms           10

NDK 23.1.7779620, clang 12.0.8:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10       10.4 ms         10.3 ms           10

@stephenhines
Copy link
Collaborator

This is probably the same as Issue #1619 but we can confirm that by trying an updated toolchain with the cherry-pick applied.

@jfgoog
Copy link
Collaborator

jfgoog commented Dec 14, 2021

Discussed this with stephenhines, Dan, and Pirama. This is not the same issue as #1619.

@jfgoog
Copy link
Collaborator

jfgoog commented Dec 14, 2021

Ruled out a header issue: Preprocessed the test case with NDK r22, then compiled with r23. The performance is still slow.

@jfgoog
Copy link
Collaborator

jfgoog commented Dec 14, 2021

Compile commands:

r22:

cd /usr/local/google/home/jamesfarrell/neon-intrinsics-test/build/libsample/build/android/arm64-v8a/test && /usr/local/google/home/jamesfarrell/android_sdk/ndk/22.1.7171670/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++ --target=aarch64-none-linux-android21 --gcc-toolchain=/usr/local/google/home/jamesfarrell/android_sdk/ndk/22.1.7171670/toolchains/llvm/prebuilt/linux-x86_64 --sysroot=/usr/local/google/home/jamesfarrell/android_sdk/ndk/22.1.7171670/toolchains/llvm/prebuilt/linux-x86_64/sysroot -DNDEBUG -isystem /usr/local/google/home/jamesfarrell/neon-intrinsics-test/3rdparty/libbenchmark/android/arm64-v8a/usr/include -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fvisibility=hidden -fvisibility-inlines-hidden -fno-rtti -fno-exceptions -O2 -DNDEBUG -fPIE -fno-rtti -ffunction-sections -fdata-sections -g -ffast-math -Wall -gline-tables-only -std=c++17 -MD -MT test/CMakeFiles/test.dir/WarpAffine.cpp.o -MF CMakeFiles/test.dir/WarpAffine.cpp.o.d -o CMakeFiles/test.dir/WarpAffine.cpp.o -c /usr/local/google/home/jamesfarrell/neon-intrinsics-test/test/WarpAffine.cpp -save-temps

r23:

cd /usr/local/google/home/jamesfarrell/neon-intrinsics-test/build/libsample/build/android/arm64-v8a/test && /usr/local/google/home/jamesfarrell/android_sdk/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++ --target=aarch64-none-linux-android21 --sysroot=/usr/local/google/home/jamesfarrell/android_sdk/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/sysroot -DNDEBUG -isystem /usr/local/google/home/jamesfarrell/neon-intrinsics-test/3rdparty/libbenchmark/android/arm64-v8a/usr/include -fvisibility=hidden -fvisibility-inlines-hidden -fno-rtti -fno-exceptions -fPIE -fno-rtti -ffunction-sections -fdata-sections -g -ffast-math -Wall -gline-tables-only -std=c++17 -MD -MT test/CMakeFiles/test.dir/WarpAffine.cpp.o -MF CMakeFiles/test.dir/WarpAffine.cpp.o.d -o CMakeFiles/test.dir/WarpAffine.cpp.o -c /usr/local/google/home/jamesfarrell/neon-intrinsics-test/test/WarpAffine.cpp -save-temps

@jfgoog
Copy link
Collaborator

jfgoog commented Dec 14, 2021

@jfgoog
Copy link
Collaborator

jfgoog commented Dec 15, 2021

Using the r22 version of arm_neon.h with the r23 NDK does not change the performance.

@jfgoog
Copy link
Collaborator

jfgoog commented Dec 15, 2021

The performance regression still exists at HEAD of the ndk, which reports clang 14.0.0

@stephenhines
Copy link
Collaborator

The performance regression still exists at HEAD of the ndk, which reports clang 14.0.0

FYI Android toolchain version numbers don't exactly match upstream LLVM version numbers. In particular, the version number becomes 14.0.0 once the 13.x release ships in TOT, so we label our toolchain as 14.0.0. When Clang 14.0.0 releases, it will likely be further along than our 14.0.0. This is why we also keep track of the SHA in our version output.

@Over17
Copy link

Over17 commented Dec 16, 2021

@jfgoog I can't see a clear asm regression in trunk output that brings to 8x slowdown; the instructions are reordered but that's probably fine. Could you run a sampling profiler and see if hot instruction paths differ? Streamline from Arm Mobile Studio worked for me.

@jfgoog
Copy link
Collaborator

jfgoog commented Dec 16, 2021

Yury, did you look at the Godbolt link or look at the .s files I attached? If Godbolt, I recommend taking a look at the .s files, as they are the output of the actual NDK.

I am new to ARM asm, and compilers in general, but it looked to me like we are emitting way more asm for r23 than r22: 9k lines vs 3k lines, according to wc -l.

And, when I look at, for example, the asm associated with int32x4_t _Xl = vaddq_s32(vdupq_n_s32(X0), vld1q_s32(adelta.data() + x)); (the line after the first #if __ARM_NEON, which is line 103 in my modified source), for r22 I see:

	.loc	1 103 114 is_stmt 1             // test/WarpAffine.cpp:103:114
	ldr	q18, [x17]
--
	.loc	1 103 33                        // test/WarpAffine.cpp:103:33
	add	v20.4s, v18.4s, v16.4s

Whereas for r23 I get:

        .loc    1 103 55 is_stmt 1              // test/WarpAffine.cpp:103:55
        ldr     w9, [sp, #4052]
        .loc    1 103 43 is_stmt 0              // test/WarpAffine.cpp:103:43
        str     w9, [sp, #4284]
        add     x9, sp, #1, lsl #12             // =4096
        add     x9, x9, #188                    // =188
        ld1r    { v0.4s }, [x9]
        str     q0, [x8, #3696]
        ldr     q0, [x8, #3696]
        str     q0, [x8, #3712]
        ldr     q0, [x8, #3712]
        str     q0, [sp, #96]                   // 16-byte Folded Spill
        add     x0, sp, #1, lsl #12             // =4096
        add     x0, x0, #16                     // =16
        str     x0, [sp, #120]                  // 8-byte Folded Spill
        .loc    1 103 144                       // test/WarpAffine.cpp:103:144
        bl      _ZNSt6__ndk16vectorIiNS_9allocatorIiEEE4dataEv
        ldr     q1, [sp, #96]                   // 16-byte Folded Reload
        ldr     x8, [sp, #320]                  // 8-byte Folded Reload
        mov     x9, x0
        ldr     x0, [sp, #120]                  // 8-byte Folded Reload
        .loc    1 103 153                       // test/WarpAffine.cpp:103:153
        ldrsw   x10, [sp, #4020]
        .loc    1 103 151                       // test/WarpAffine.cpp:103:151
        lsl     x10, x10, #2
        .loc    1 103 114                       // test/WarpAffine.cpp:103:114
        ldr     q0, [x9, x10]
        .loc    1 103 100                       // test/WarpAffine.cpp:103:100
        str     q0, [x8, #3360]
        .loc    1 103 161                       // test/WarpAffine.cpp:103:161
        ldr     q0, [x8, #3360]
        .loc    1 103 94                        // test/WarpAffine.cpp:103:94
        str     q0, [x8, #3344]
        .loc    1 103 168                       // test/WarpAffine.cpp:103:168
        ldr     q0, [x8, #3344]
        .loc    1 103 33                        // test/WarpAffine.cpp:103:33
        str     q1, [x8, #3776]
        str     q0, [x8, #3760]
        ldr     q0, [x8, #3776]
        ldr     q1, [x8, #3760]
        add     v0.4s, v0.4s, v1.4s
        str     q0, [x8, #3744]
        ldr     q0, [x8, #3744]
        .loc    1 103 27                        // test/WarpAffine.cpp:103:27
        str     q0, [x8, #3376]

The branch to the allocator function is odd. In r22, we never do that. But in r23:

$ grep allocator WarpAffine-r23.s.txt | grep bl | wc -l
75

@zchrissirhcz
Copy link

@jfgoog
The demangle result of _ZNSt6__ndk16vectorIiNS_9allocatorIiEEE4dataEv is std::__ndk1::vector<int, std::__ndk1::allocator<int> >::data(), corresponding to adelta.data().

We may just replace the std::vector<int> to C/C++ dynamic allocated array

    // std::vector<int> adelta(w);
    // std::vector<int> bdelta(w);
    int adelta[w];
    int bdelta[w];

And replace adelta.data() with adelta, bdelta.data() with bdelta.

But on my Android Device (XiaoMI 11) it still keep same poor performance with Android NDK r23b, same as the vector version.

@jfgoog
Copy link
Collaborator

jfgoog commented May 17, 2022

Well, I'm an idiot for missing this earlier.

This is a problem with build configuration. If you look at the compile commands up above, NDK 22 has -O2, but NDK 23 does not.

When I add -O2 to NDK 23, the performance difference goes away:

ANDROID_NDK=~/Android/Sdk/ndk/22.1.7171670/ ./android_build.sh && ./run_android64.sh
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10       1.25 ms         1.22 ms           10

ANDROID_NDK=~/Android/Sdk/ndk/23.1.7779620/ ./android_build.sh && ./run_android64.sh
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10       9.79 ms         9.64 ms           10

COMMON_CXX_FLAGS="-O2" ANDROID_NDK=~/Android/Sdk/ndk/23.1.7779620/ ./android_build.sh && ./run_android64.sh
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10       1.25 ms         1.21 ms           10

The build script, https://github.com/DaydreamCoding/neon-intrinsics-test/blob/develop/build/android_build.sh is apparently making some assumptions about the cmake configuration that changed between NDK22 and 23.

I tried commenting out the following lines:

diff --git a/build/android_build.sh b/build/android_build.sh
index 6c650af..1a58d9e 100755
--- a/build/android_build.sh
+++ b/build/android_build.sh
@@ -50,8 +50,8 @@ COMMON_C_FLAGS="$COMMON_C_FLAGS "
 COMMON_CXX_FLAGS="$COMMON_CXX_FLAGS "

 # 公用FLAGS_RELEASE, 可根据实际项目需求增加-Ofast、-O3、-O2等选项, release默认-Os
-COMMON_C_FLAGS_RELEASE="$COMMON_C_FLAGS_RELEASE "
-COMMON_CXX_FLAGS_RELEASE="$COMMON_CXX_FLAGS_RELEASE "
+#COMMON_C_FLAGS_RELEASE="$COMMON_C_FLAGS_RELEASE "
+#COMMON_CXX_FLAGS_RELEASE="$COMMON_CXX_FLAGS_RELEASE "

 # hidden symbol
 if [ "$BUILD_HIDDEN_SYMBOL" != "OFF" ]; then
@@ -99,8 +99,8 @@ do_build()

     CMAKE_ARGS+=("-DCMAKE_C_FLAGS=$COMMON_C_FLAGS")
     CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS=$COMMON_CXX_FLAGS")
-    CMAKE_ARGS+=("-DCMAKE_C_FLAGS_RELEASE=$COMMON_C_FLAGS_RELEASE")
-    CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS_RELEASE=$COMMON_CXX_FLAGS_RELEASE")
+    #CMAKE_ARGS+=("-DCMAKE_C_FLAGS_RELEASE=$COMMON_C_FLAGS_RELEASE")
+    #CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS_RELEASE=$COMMON_CXX_FLAGS_RELEASE")

     # 编译安装目录
     if [ "$BUILD_BASE_DIR" = "" ]; then

Then the code is compiled with -O3 and the performance is good:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10      0.997 ms        0.990 ms           10

@jfgoog jfgoog closed this as completed May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants