[BUG] NDK r23b neon intrinsics too slow #1607

DaydreamCoding · 2021-11-17T10:13:17Z

Description

NDK r23b compile neon intrinsics is very slow:

https://github.com/Tencent/ncnn/blob/master/src/mat_pixel_affine.cpp#L1173

warpaffine_bilinear_c4

params:
constexpr int width = 160;
constexpr int height = 160;
constexpr float image_matrix[6] = {
-0.00673565036, 0.146258384, 4.34562492,
-0.146258384, -0.00673565036, 162.753372,
};

NDK r23b this funciton cost 8.40 ms.
NDK r22b this function cost 0.302 ms.

Environment Details

NDK Version: r23b
Build system: CMake
Host OS: Mac
ABI: arm64-v8a
NDK API level: android-21
Device API level: android-30

stephenhines · 2021-11-17T10:37:31Z

Can you get the preprocessed input source file (and show your compiler flags) for this file/function? You can get the preprocessed file by adding -save-temps to the command line used for this file, and then attaching the *.ii source file, which should contain everything without needing additional headers. Please also let us know the rest of the command line flags you are using, as that also helps with reproducing the problem.

zchrissirhcz · 2021-11-17T11:31:26Z

@DaydreamCoding Can you please provide a full code that call the mentioned method? Some arguments value still unknown, such as srcw, srch, strstride, w, h, stride, type, etc.

i.e. The minimal reproduce code https://stackoverflow.com/help/minimal-reproducible-example

zchrissirhcz · 2021-11-17T13:43:56Z

Let me give an reproducible example.

Benchmark result

src image: width=1920, height=1080
dst image: width=960, height=540

ndk version	time cost
ndk-r21b	3.621770 ms
ndk-r22b	3.475886 ms
ndk-r23b	177.813490 ms
ndk-r24-beta1	3.544636 ms

Complete code and building script

~~https://github.com/zchrissirhcz/test_ncnn_warp~~
https://github.com/zchrissirhcz/min-repros/tree/master/test_ncnn_warp

DaydreamCoding · 2021-11-18T03:54:23Z

@zchrissirhcz
@stephenhines

ndk-r23b && ndk-r24-beta1 all neon intrinsics too slow

Benchmark result

    constexpr int src_width = 1080;
    constexpr int src_height = 1440;
    constexpr int dst_width = 160;
    constexpr int dst_height = 160;
    constexpr float image_matrix[6] = {
            -0.00673565036, 0.146258384,    4.34562492,
            -0.146258384,   -0.00673565036, 162.753372,
    };

benchmark in qualcomm sm7250 chip

ndk version	time cost
ndk-r21e	1.60 ms
ndk-r22b	0.590 ms
ndk-r23b	8.43 ms
ndk-r24-beta1	7.86 ms

Complete code and building script

https://github.com/DaydreamCoding/neon-intrinsics-test

#export ANDROID_NDK=$ANDROID_NDK_R21e
#export ANDROID_NDK=$ANDROID_NDK_R22b
export ANDROID_NDK=$ANDROID_NDK_R23b
#export ANDROID_NDK=$ANDROID_NDK_R24_beta1

./build/android_build.sh && ./build/run_android64.sh

jfgoog · 2021-12-14T17:13:39Z

Reproduced this with a pixel 6 pro:

NDK 22.1.7171670, clang 11.0.5:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10       1.32 ms         1.29 ms           10

NDK 23.1.7779620, clang 12.0.8:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10       10.4 ms         10.3 ms           10

stephenhines · 2021-12-14T17:47:28Z

This is probably the same as Issue #1619 but we can confirm that by trying an updated toolchain with the cherry-pick applied.

jfgoog · 2021-12-14T22:44:32Z

Discussed this with stephenhines, Dan, and Pirama. This is not the same issue as #1619.

jfgoog · 2021-12-14T23:20:41Z

Ruled out a header issue: Preprocessed the test case with NDK r22, then compiled with r23. The performance is still slow.

jfgoog · 2021-12-14T23:32:31Z

Compile commands:

r22:

cd /usr/local/google/home/jamesfarrell/neon-intrinsics-test/build/libsample/build/android/arm64-v8a/test && /usr/local/google/home/jamesfarrell/android_sdk/ndk/22.1.7171670/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++ --target=aarch64-none-linux-android21 --gcc-toolchain=/usr/local/google/home/jamesfarrell/android_sdk/ndk/22.1.7171670/toolchains/llvm/prebuilt/linux-x86_64 --sysroot=/usr/local/google/home/jamesfarrell/android_sdk/ndk/22.1.7171670/toolchains/llvm/prebuilt/linux-x86_64/sysroot -DNDEBUG -isystem /usr/local/google/home/jamesfarrell/neon-intrinsics-test/3rdparty/libbenchmark/android/arm64-v8a/usr/include -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fvisibility=hidden -fvisibility-inlines-hidden -fno-rtti -fno-exceptions -O2 -DNDEBUG -fPIE -fno-rtti -ffunction-sections -fdata-sections -g -ffast-math -Wall -gline-tables-only -std=c++17 -MD -MT test/CMakeFiles/test.dir/WarpAffine.cpp.o -MF CMakeFiles/test.dir/WarpAffine.cpp.o.d -o CMakeFiles/test.dir/WarpAffine.cpp.o -c /usr/local/google/home/jamesfarrell/neon-intrinsics-test/test/WarpAffine.cpp -save-temps

r23:

cd /usr/local/google/home/jamesfarrell/neon-intrinsics-test/build/libsample/build/android/arm64-v8a/test && /usr/local/google/home/jamesfarrell/android_sdk/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++ --target=aarch64-none-linux-android21 --sysroot=/usr/local/google/home/jamesfarrell/android_sdk/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/sysroot -DNDEBUG -isystem /usr/local/google/home/jamesfarrell/neon-intrinsics-test/3rdparty/libbenchmark/android/arm64-v8a/usr/include -fvisibility=hidden -fvisibility-inlines-hidden -fno-rtti -fno-exceptions -fPIE -fno-rtti -ffunction-sections -fdata-sections -g -ffast-math -Wall -gline-tables-only -std=c++17 -MD -MT test/CMakeFiles/test.dir/WarpAffine.cpp.o -MF CMakeFiles/test.dir/WarpAffine.cpp.o.d -o CMakeFiles/test.dir/WarpAffine.cpp.o -c /usr/local/google/home/jamesfarrell/neon-intrinsics-test/test/WarpAffine.cpp -save-temps

jfgoog · 2021-12-14T23:41:08Z

WarpAffine-r22.ii.txt
WarpAffine-r23.ii.txt

WarpAffine-r22.s.txt
WarpAffine-r23.s.txt

jfgoog · 2021-12-15T19:00:04Z

Using the r22 version of arm_neon.h with the r23 NDK does not change the performance.

jfgoog · 2021-12-15T20:04:19Z

The performance regression still exists at HEAD of the ndk, which reports clang 14.0.0

stephenhines · 2021-12-15T20:54:05Z

The performance regression still exists at HEAD of the ndk, which reports clang 14.0.0

FYI Android toolchain version numbers don't exactly match upstream LLVM version numbers. In particular, the version number becomes 14.0.0 once the 13.x release ships in TOT, so we label our toolchain as 14.0.0. When Clang 14.0.0 releases, it will likely be further along than our 14.0.0. This is why we also keep track of the SHA in our version output.

Over17 · 2021-12-16T06:57:36Z

@jfgoog I can't see a clear asm regression in trunk output that brings to 8x slowdown; the instructions are reordered but that's probably fine. Could you run a sampling profiler and see if hot instruction paths differ? Streamline from Arm Mobile Studio worked for me.

jfgoog · 2021-12-16T15:52:33Z

Yury, did you look at the Godbolt link or look at the .s files I attached? If Godbolt, I recommend taking a look at the .s files, as they are the output of the actual NDK.

I am new to ARM asm, and compilers in general, but it looked to me like we are emitting way more asm for r23 than r22: 9k lines vs 3k lines, according to wc -l.

And, when I look at, for example, the asm associated with int32x4_t _Xl = vaddq_s32(vdupq_n_s32(X0), vld1q_s32(adelta.data() + x)); (the line after the first #if __ARM_NEON, which is line 103 in my modified source), for r22 I see:

	.loc	1 103 114 is_stmt 1             // test/WarpAffine.cpp:103:114
	ldr	q18, [x17]
--
	.loc	1 103 33                        // test/WarpAffine.cpp:103:33
	add	v20.4s, v18.4s, v16.4s

Whereas for r23 I get:

        .loc    1 103 55 is_stmt 1              // test/WarpAffine.cpp:103:55
        ldr     w9, [sp, #4052]
        .loc    1 103 43 is_stmt 0              // test/WarpAffine.cpp:103:43
        str     w9, [sp, #4284]
        add     x9, sp, #1, lsl #12             // =4096
        add     x9, x9, #188                    // =188
        ld1r    { v0.4s }, [x9]
        str     q0, [x8, #3696]
        ldr     q0, [x8, #3696]
        str     q0, [x8, #3712]
        ldr     q0, [x8, #3712]
        str     q0, [sp, #96]                   // 16-byte Folded Spill
        add     x0, sp, #1, lsl #12             // =4096
        add     x0, x0, #16                     // =16
        str     x0, [sp, #120]                  // 8-byte Folded Spill
        .loc    1 103 144                       // test/WarpAffine.cpp:103:144
        bl      _ZNSt6__ndk16vectorIiNS_9allocatorIiEEE4dataEv
        ldr     q1, [sp, #96]                   // 16-byte Folded Reload
        ldr     x8, [sp, #320]                  // 8-byte Folded Reload
        mov     x9, x0
        ldr     x0, [sp, #120]                  // 8-byte Folded Reload
        .loc    1 103 153                       // test/WarpAffine.cpp:103:153
        ldrsw   x10, [sp, #4020]
        .loc    1 103 151                       // test/WarpAffine.cpp:103:151
        lsl     x10, x10, #2
        .loc    1 103 114                       // test/WarpAffine.cpp:103:114
        ldr     q0, [x9, x10]
        .loc    1 103 100                       // test/WarpAffine.cpp:103:100
        str     q0, [x8, #3360]
        .loc    1 103 161                       // test/WarpAffine.cpp:103:161
        ldr     q0, [x8, #3360]
        .loc    1 103 94                        // test/WarpAffine.cpp:103:94
        str     q0, [x8, #3344]
        .loc    1 103 168                       // test/WarpAffine.cpp:103:168
        ldr     q0, [x8, #3344]
        .loc    1 103 33                        // test/WarpAffine.cpp:103:33
        str     q1, [x8, #3776]
        str     q0, [x8, #3760]
        ldr     q0, [x8, #3776]
        ldr     q1, [x8, #3760]
        add     v0.4s, v0.4s, v1.4s
        str     q0, [x8, #3744]
        ldr     q0, [x8, #3744]
        .loc    1 103 27                        // test/WarpAffine.cpp:103:27
        str     q0, [x8, #3376]

The branch to the allocator function is odd. In r22, we never do that. But in r23:

$ grep allocator WarpAffine-r23.s.txt | grep bl | wc -l
75

zchrissirhcz · 2021-12-16T19:20:25Z

@jfgoog
The demangle result of _ZNSt6__ndk16vectorIiNS_9allocatorIiEEE4dataEv is std::__ndk1::vector<int, std::__ndk1::allocator<int> >::data(), corresponding to adelta.data().

We may just replace the std::vector<int> to C/C++ dynamic allocated array

    // std::vector<int> adelta(w);
    // std::vector<int> bdelta(w);
    int adelta[w];
    int bdelta[w];

And replace adelta.data() with adelta, bdelta.data() with bdelta.

But on my Android Device (XiaoMI 11) it still keep same poor performance with Android NDK r23b, same as the vector version.

jfgoog · 2022-05-17T15:26:03Z

Well, I'm an idiot for missing this earlier.

This is a problem with build configuration. If you look at the compile commands up above, NDK 22 has -O2, but NDK 23 does not.

When I add -O2 to NDK 23, the performance difference goes away:

ANDROID_NDK=~/Android/Sdk/ndk/22.1.7171670/ ./android_build.sh && ./run_android64.sh
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10       1.25 ms         1.22 ms           10

ANDROID_NDK=~/Android/Sdk/ndk/23.1.7779620/ ./android_build.sh && ./run_android64.sh
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10       9.79 ms         9.64 ms           10

COMMON_CXX_FLAGS="-O2" ANDROID_NDK=~/Android/Sdk/ndk/23.1.7779620/ ./android_build.sh && ./run_android64.sh
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10       1.25 ms         1.21 ms           10

The build script, https://github.com/DaydreamCoding/neon-intrinsics-test/blob/develop/build/android_build.sh is apparently making some assumptions about the cmake configuration that changed between NDK22 and 23.

I tried commenting out the following lines:

diff --git a/build/android_build.sh b/build/android_build.sh
index 6c650af..1a58d9e 100755
--- a/build/android_build.sh
+++ b/build/android_build.sh
@@ -50,8 +50,8 @@ COMMON_C_FLAGS="$COMMON_C_FLAGS "
 COMMON_CXX_FLAGS="$COMMON_CXX_FLAGS "

 # 公用FLAGS_RELEASE, 可根据实际项目需求增加-Ofast、-O3、-O2等选项, release默认-Os
-COMMON_C_FLAGS_RELEASE="$COMMON_C_FLAGS_RELEASE "
-COMMON_CXX_FLAGS_RELEASE="$COMMON_CXX_FLAGS_RELEASE "
+#COMMON_C_FLAGS_RELEASE="$COMMON_C_FLAGS_RELEASE "
+#COMMON_CXX_FLAGS_RELEASE="$COMMON_CXX_FLAGS_RELEASE "

 # hidden symbol
 if [ "$BUILD_HIDDEN_SYMBOL" != "OFF" ]; then
@@ -99,8 +99,8 @@ do_build()

     CMAKE_ARGS+=("-DCMAKE_C_FLAGS=$COMMON_C_FLAGS")
     CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS=$COMMON_CXX_FLAGS")
-    CMAKE_ARGS+=("-DCMAKE_C_FLAGS_RELEASE=$COMMON_C_FLAGS_RELEASE")
-    CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS_RELEASE=$COMMON_CXX_FLAGS_RELEASE")
+    #CMAKE_ARGS+=("-DCMAKE_C_FLAGS_RELEASE=$COMMON_C_FLAGS_RELEASE")
+    #CMAKE_ARGS+=("-DCMAKE_CXX_FLAGS_RELEASE=$COMMON_CXX_FLAGS_RELEASE")

     # 编译安装目录
     if [ "$BUILD_BASE_DIR" = "" ]; then

Then the code is compiled with -O3 and the performance is good:

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_WarpNCNN_C4/iterations:10      0.997 ms        0.990 ms           10

DaydreamCoding added the bug label Nov 17, 2021

stephenhines added the needs-info label Nov 17, 2021

DanAlbert removed the needs-info label Nov 17, 2021

DanAlbert assigned stephenhines Nov 17, 2021

DaydreamCoding pushed a commit to DaydreamCoding/neon-intrinsics-test that referenced this issue Nov 18, 2021

[BUG] NDK r23b neon intrinsics too slow android/ndk#1607

92a1317

DaydreamCoding mentioned this issue Nov 23, 2021

[BUG] NDK r23b caused power consumption increases #1613

Closed

DanAlbert assigned jfgoog Jan 4, 2022

jfgoog closed this as completed May 17, 2022

DaydreamCoding mentioned this issue Jul 28, 2022

[BUG] NDK-r25 CMAKE_BUILD_TYPE=MinSizeRel(-Os) CMAKE_BUILD_TYPE=Release(null), NDK-r22b get MinSizeRel(-Os) Release(-O2) #1740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] NDK r23b neon intrinsics too slow #1607

[BUG] NDK r23b neon intrinsics too slow #1607

DaydreamCoding commented Nov 17, 2021

stephenhines commented Nov 17, 2021

zchrissirhcz commented Nov 17, 2021 •

edited

Loading

zchrissirhcz commented Nov 17, 2021 •

edited

Loading

DaydreamCoding commented Nov 18, 2021 •

edited

Loading

jfgoog commented Dec 14, 2021 •

edited

Loading

stephenhines commented Dec 14, 2021

jfgoog commented Dec 14, 2021

jfgoog commented Dec 14, 2021

jfgoog commented Dec 14, 2021

jfgoog commented Dec 14, 2021

jfgoog commented Dec 15, 2021

jfgoog commented Dec 15, 2021

stephenhines commented Dec 15, 2021

Over17 commented Dec 16, 2021

jfgoog commented Dec 16, 2021

zchrissirhcz commented Dec 16, 2021

jfgoog commented May 17, 2022

[BUG] NDK r23b neon intrinsics too slow #1607

[BUG] NDK r23b neon intrinsics too slow #1607

Comments

DaydreamCoding commented Nov 17, 2021

Description

Environment Details

stephenhines commented Nov 17, 2021

zchrissirhcz commented Nov 17, 2021 • edited Loading

zchrissirhcz commented Nov 17, 2021 • edited Loading

Benchmark result

Complete code and building script

DaydreamCoding commented Nov 18, 2021 • edited Loading

Benchmark result

Complete code and building script

jfgoog commented Dec 14, 2021 • edited Loading

stephenhines commented Dec 14, 2021

jfgoog commented Dec 14, 2021

jfgoog commented Dec 14, 2021

jfgoog commented Dec 14, 2021

jfgoog commented Dec 14, 2021

jfgoog commented Dec 15, 2021

jfgoog commented Dec 15, 2021

stephenhines commented Dec 15, 2021

Over17 commented Dec 16, 2021

jfgoog commented Dec 16, 2021

zchrissirhcz commented Dec 16, 2021

jfgoog commented May 17, 2022

zchrissirhcz commented Nov 17, 2021 •

edited

Loading

zchrissirhcz commented Nov 17, 2021 •

edited

Loading

DaydreamCoding commented Nov 18, 2021 •

edited

Loading

jfgoog commented Dec 14, 2021 •

edited

Loading