小网络在华为麒麟芯片上gpu比cpu慢，在骁龙芯片上gpu比CPU快，为什么？ #1230

MHGL · 2021-08-11T08:22:07Z

1. 环境（environment）

Build OS and Version: Ubuntu20.04
RunTime OS Version: Android 8.0
RunTime DEVICE: ARM

2. Github版本

branch：master
commit(optional): 8c4178d

3. 编译方式(compile method)

Android平台耗时测试
默认参数
ndk: r22b
platform-tools: r31.0.3

4. 麒麟芯片（HUAWEI P9，芯片Hisilicon Kirin 955）

shufflenet_v2.tnnproto（TNN官方）

./test/libTNNBenchmarkTest.so: 1 file pushed, 0 skipped. 1326.2 MB/s (230616 bytes in 0.000s)
./libTNN.so: 1 file pushed, 0 skipped. 31.6 MB/s (7525920 bytes in 0.227s)
test/TNNTest: 1 file pushed, 0 skipped. 360.1 MB/s (192504 bytes in 0.001s)
/home/liyang/GitHub/TNN/benchmark/benchmark_android/../benchmark-model/: 18 files pushed, 0 skipped. 3.6 MB/s (283496 bytes in 0.074s)
/data/local/tmp/tnn-benchmark/benchmark_models_result.txt: 1 file pulled, 0 skipped. 10.3 MB/s (52083 bytes in 0.005s)
EVA-AL10

benchmark device: ARM 

                        Summary
--------------------------------------------------------
|        Op Type | Total Kernel Time(ms) | Percent (%) |
--------------------------------------------------------
|    Convolution |                 7.473 |      86.476 |
|   StridedSlice |                 0.451 |       5.215 |
|        Pooling |                 0.262 |       3.034 |
|   BatchNormCxx |                 0.194 |       2.244 |
| ShuffleChannel |                 0.134 |       1.546 |
|         Concat |                 0.128 |       1.486 |
--------------------------------------------------------
kernel runtime total: 8.64113 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] shufflenet_v2.tnnproto - ARM                  TNN Benchmark time cost: min =  9.404   ms  |  max =  9.596   ms  |  avg =  9.474   ms 
08-11 15:55:58.555 18785 18785 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] shufflenet_v2.tnnproto - ARM                  TNN Benchmark time cost: min =  9.404   ms  |  max =  9.596   ms  |  avg =  9.474   ms 

benchmark device: OPENCL 

I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 120] OpenCL version: CL_TARGET_OPENCL_VERSION 200   CL_HPP_TARGET_OPENCL_VERSION 110   CL_HPP_MINIMUM_OPENCL_VERSION 110
I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 155] Create common opencl context

                        Summary
--------------------------------------------------------
|        Op Type | Total Kernel Time(ms) | Percent (%) |
--------------------------------------------------------
|       Conv_1x1 |                 5.047 |      49.808 |
|       Conv_3x3 |                 1.331 |      13.131 |
| Conv_Depthwise |                 1.110 |      10.951 |
| ShuffleChannel |                 0.956 |       9.437 |
|         Concat |                 0.680 |       6.705 |
|    StrideSlice |                 0.556 |       5.491 |
|        Pooling |                 0.320 |       3.157 |
|      BatchNorm |                 0.134 |       1.318 |
--------------------------------------------------------
kernel runtime total: 10.1338 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] shufflenet_v2.tnnproto - OPENCL               TNN Benchmark time cost: min = 32.275   ms  |  max = 47.919   ms  |  avg = 38.150   ms 
08-11 15:56:06.504 18802 18802 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] shufflenet_v2.tnnproto - OPENCL               TNN Benchmark time cost: min = 32.275   ms  |  max = 47.919   ms  |  avg = 38.150   ms

ESPNetV2（自定义）

./test/libTNNBenchmarkTest.so: 1 file pushed, 0 skipped. 776.0 MB/s (230616 bytes in 0.000s)
./libTNN.so: 1 file pushed, 0 skipped. 22.5 MB/s (7525920 bytes in 0.320s)
test/TNNTest: 1 file pushed, 0 skipped. 357.9 MB/s (192504 bytes in 0.001s)
/home/liyang/GitHub/TNN/benchmark/benchmark_android/../benchmark-model/: 18 files pushed, 0 skipped. 4.4 MB/s (283496 bytes in 0.062s)
/data/local/tmp/tnn-benchmark/benchmark_models_result.txt: 1 file pulled, 0 skipped. 10.8 MB/s (123124 bytes in 0.011s)
EVA-AL10

benchmark device: ARM 
 Summary
------------------------------------------------------
|      Op Type | Total Kernel Time(ms) | Percent (%) |
------------------------------------------------------
|  Convolution |               171.766 |      75.569 |
|        PReLU |                 9.712 |       4.273 |
|     Upsample |                 8.760 |       3.854 |
|       Concat |                 8.670 |       3.814 |
|          Add |                 8.004 |       3.522 |
|      Pooling |                 6.693 |       2.945 |
|          Pad |                 4.063 |       1.788 |
| BatchNormCxx |                 3.819 |       1.680 |
| SoftmaxCaffe |                 3.749 |       1.649 |
|       SplitV |                 2.059 |       0.906 |
------------------------------------------------------
kernel runtime total: 227.296 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - ARM                       TNN Benchmark time cost: min = 227.797  ms  |  max = 232.323  ms  |  avg = 230.944  ms 
08-11 16:07:09.257 18913 18913 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - ARM                       TNN Benchmark time cost: min = 227.797  ms  |  max = 232.323  ms  |  avg = 230.944  ms 

benchmark device: OPENCL 

I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 120] OpenCL version: CL_TARGET_OPENCL_VERSION 200   CL_HPP_TARGET_OPENCL_VERSION 110   CL_HPP_MINIMUM_OPENCL_VERSION 110
I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 155] Create common opencl context
Summary
--------------------------------------------------------
|        Op Type | Total Kernel Time(ms) | Percent (%) |
--------------------------------------------------------
|       Conv_1x1 |               147.789 |      69.348 |
| Conv_Depthwise |                11.484 |       5.389 |
|         Concat |                 9.341 |       4.383 |
|            Pad |                 8.492 |       3.985 |
|      BatchNorm |                 8.187 |       3.842 |
|          PRelu |                 7.954 |       3.732 |
|            Add |                 7.828 |       3.673 |
|        Pooling |                 3.697 |       1.735 |
|       Conv_3x3 |                 3.306 |       1.551 |
|       Upsample |                 3.086 |       1.448 |
|         SplitV |                 1.397 |       0.655 |
|        SoftMax |                 0.552 |       0.259 |
--------------------------------------------------------
kernel runtime total: 213.112 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - OPENCL                    TNN Benchmark time cost: min = 288.641  ms  |  max = 298.735  ms  |  avg = 293.257  ms 
08-11 16:07:19.737 18930 18930 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - OPENCL                    TNN Benchmark time cost: min = 288.641  ms  |  max = 298.735  ms  |  avg = 293.257  ms

5. 骁龙芯片（小米 Mix2s，芯片骁龙845）

ESPNetV2（自定义）

./test/libTNNBenchmarkTest.so: 1 file pushed, 0 skipped. 597.1 MB/s (230616 bytes in 0.000s)
./libTNN.so: 1 file pushed, 0 skipped. 61.0 MB/s (7525920 bytes in 0.118s)
test/TNNTest: 1 file pushed, 0 skipped. 342.2 MB/s (192504 bytes in 0.001s)
/home/liyang/GitHub/TNN/benchmark/benchmark_android/../benchmark-model/: 18 files pushed, 0 skipped. 11.6 MB/s (283496 bytes in 0.023s)
E/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 187] load program cache skipped, ret: 40966, msg: code: 0xA006 msg: open program cache file failed, input path: /data/local/tmp//d1_tnn_ocl_fd8c6f613ff9c0d503dbc462bf21353f_abc87b1bd5bec928c91c17fc45884487
/data/local/tmp/tnn-benchmark/benchmark_models_result.txt: 1 file pulled, 0 skipped. 24.3 MB/s (122071 bytes in 0.005s)
MIX 2S

benchmark device: ARM 
  Summary
------------------------------------------------------
|      Op Type | Total Kernel Time(ms) | Percent (%) |
------------------------------------------------------
|  Convolution |               149.520 |      82.845 |
|     Upsample |                 4.961 |       2.749 |
|      Pooling |                 4.927 |       2.730 |
|        PReLU |                 4.814 |       2.667 |
|          Add |                 4.553 |       2.523 |
|       Concat |                 3.513 |       1.947 |
| SoftmaxCaffe |                 2.735 |       1.515 |
|          Pad |                 2.203 |       1.221 |
| BatchNormCxx |                 2.183 |       1.210 |
|       SplitV |                 1.072 |       0.594 |
------------------------------------------------------
kernel runtime total: 180.481 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - ARM                       TNN Benchmark time cost: min = 180.755  ms  |  max = 185.242  ms  |  avg = 183.615  ms 
08-11 16:14:20.985 19859 19859 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - ARM                       TNN Benchmark time cost: min = 180.755  ms  |  max = 185.242  ms  |  avg = 183.615  ms 

benchmark device: OPENCL 

I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 120] OpenCL version: CL_TARGET_OPENCL_VERSION 200   CL_HPP_TARGET_OPENCL_VERSION 110   CL_HPP_MINIMUM_OPENCL_VERSION 110
I/tnn: tnn::Status tnn::OpenCLRuntime::Init() [File /home/liyang/GitHub/TNN/source/tnn/device/opencl/opencl_runtime.cc][Line 155] Create common opencl context
Summary
--------------------------------------------------------
|        Op Type | Total Kernel Time(ms) | Percent (%) |
--------------------------------------------------------
|       Conv_1x1 |                19.559 |      61.558 |
| Conv_Depthwise |                 3.476 |      10.940 |
|         Concat |                 2.204 |       6.936 |
|            Add |                 1.651 |       5.197 |
|          PRelu |                 1.597 |       5.026 |
|        Pooling |                 0.777 |       2.445 |
|            Pad |                 0.718 |       2.261 |
|      BatchNorm |                 0.562 |       1.769 |
|       Upsample |                 0.549 |       1.727 |
|       Conv_3x3 |                 0.372 |       1.172 |
|         SplitV |                 0.241 |       0.759 |
|        SoftMax |                 0.067 |       0.211 |
--------------------------------------------------------
kernel runtime total: 31.7729 ms

I/tnn: void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - OPENCL                    TNN Benchmark time cost: min = 34.534   ms  |  max = 38.537   ms  |  avg = 36.155   ms 
08-11 16:14:34.795 19885 19885 I tnn     : void tnn::test::Timer::Print() [File /home/liyang/GitHub/TNN/test/timer.cc][Line 60] portrait.tnnproto - OPENCL                    TNN Benchmark time cost: min = 34.534   ms  |  max = 38.537   ms  |  avg = 36.155   ms

5. 该如何优化麒麟芯片上小网络的推理性能

The text was updated successfully, but these errors were encountered:

MHGL · 2021-08-11T08:26:09Z

ESPNetV2 tnnproto文件
portrait.tnnproto.txt

lnmdlong · 2021-08-23T12:29:15Z

@MHGL 反馈的速度问题跟测试的机型相关，不是麒麟处理器上的GPU通用问题，选取的Kirin 955机器，CPU配置是四核Cortex A72+四核Cortex A53，GPU配置是Mali-T880 MP4，Mali-T880 MP4相比A72性能优势不大，如果要充分发挥GPU的速度优势，可以拿Kirin 970/980(Mali-G架构的GPU)去在项目上做落地

MHGL · 2021-08-25T07:35:41Z

@lnmdlong 非常感谢你的回复！
在这个性能测试文件中有关Kirin 970的测试数据中发现，小型网络如ShuffleNet，SqueezeNet都有体现出CPU性能优于GPU；所以我的问题是该如何针对性的优化小网络TNN模型在麒麟芯片上的表现呢？有具体的华为部署TNN流程吗？谢谢

lnmdlong · 2021-09-17T03:28:40Z

@MHGL 小型网络在麒麟芯片上的GPU性能TNN做了一些优化，部分模型性能不如CPU，跟模型结构和硬件特性相关，暂时还没有进一步优化的方案，后续有计划会及时同步；部署流程可以参考TNN的demo，https://github.com/Tencent/TNN/blob/master/doc/en/user/demo_en.md#ii-introduction-to-android-demo

MHGL mentioned this issue Aug 11, 2021

[BUG] 转换的TNN模型在华为麒麟处理器上opencl（GPU）比cpu速度慢 DeepVAC/deepvac#125

Open

shaundai-tencent assigned lnmdlong Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

小网络在华为麒麟芯片上gpu比cpu慢，在骁龙芯片上gpu比CPU快，为什么？ #1230

小网络在华为麒麟芯片上gpu比cpu慢，在骁龙芯片上gpu比CPU快，为什么？ #1230

MHGL commented Aug 11, 2021 •

edited

Loading

MHGL commented Aug 11, 2021

lnmdlong commented Aug 23, 2021

MHGL commented Aug 25, 2021

lnmdlong commented Sep 17, 2021

小网络在华为麒麟芯片上gpu比cpu慢，在骁龙芯片上gpu比CPU快，为什么？ #1230

小网络在华为麒麟芯片上gpu比cpu慢，在骁龙芯片上gpu比CPU快，为什么？ #1230

Comments

MHGL commented Aug 11, 2021 • edited Loading

MHGL commented Aug 11, 2021

lnmdlong commented Aug 23, 2021

MHGL commented Aug 25, 2021

lnmdlong commented Sep 17, 2021

MHGL commented Aug 11, 2021 •

edited

Loading