Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inner product bad performance when 1 < M < K/4 #525

Closed
WilliamTambellini opened this issue Aug 6, 2019 · 13 comments
Closed

inner product bad performance when 1 < M < K/4 #525

WilliamTambellini opened this issue Aug 6, 2019 · 13 comments
Assignees
Labels
bug A confirmed library bug
Milestone

Comments

@WilliamTambellini
Copy link
Contributor

WilliamTambellini commented Aug 6, 2019

Hello
This is a call for review to check some strange performance for the inner product fwd when K=1024 and 1 < M < K/4, given

  • M = number of rows of src
  • K = N = number of rows/cols of the weights

Note : The perf is good for M = 1 and M > K/4

GCC: 6.1.0
Eigen version: 3.3.90
Simd: AVX512, FMA, AVX2, AVX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
EIGEN_IDEAL_MAX_ALIGN_BYTES=64
EIGEN_MAX_ALIGN_BYTES=64
EIGEN_VECTORIZE_AVX2
EIGEN_VECTORIZE_FMA
EIGEN_VECTORIZE_AVX512
TopLevel cache size: 25344 KB
L1 cache size: 32 KB
L2 cache size: 1024 KB
L3 cache size: 25344 KB
Eigen::nbThreads: 1
_OPENMP: 201511
omp_get_num_threads: 1
omp_get_max_threads: 1
MKL-DNN version: 1.0.0 01206f3

inner product fwd: type=f repeat=600
M K=N ETensor MKLDNN
1 1024 642 194
2 1024 759 1327 <-
4 1024 950 1329 <-
8 1024 965 1378 <-
16 1024 976 1491 <-
32 1024 1021 1682 <-
64 1024 1611 2062 <-
128 1024 2389 2838 <-
256 1024 4594 4515
512 1024 8369 8030
1024 1024 16376 12998


Environment

  • CPU make and model
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 16
    On-line CPU(s) list: 0-15
    Thread(s) per core: 2
    Core(s) per socket: 8
    Socket(s): 1
    NUMA node(s): 1
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 85
    Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
    Stepping: 3
    CPU MHz: 3000.000
    BogoMIPS: 6000.00
    Hypervisor vendor: KVM
    Virtualization type: full
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 1024K
    L3 cache: 25344K
    NUMA node0 CPU(s): 0-15
    Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku

  • OS version (uname -a)
    Linux awsc5 4.4.0-1088-aws What is the criteria for dividing input channels? #99-Ubuntu SMP Thu Jul 4 14:25:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

  • Compiler version (gcc --version)
    6.1

  • CMake version (cmake --version)
    3.5.1

  • CMake command
    cmake ..
    -DCMAKE_BUILD_TYPE=RELEASE
    -DCMAKE_INSTALL_PREFIX=install
    -DMKLDNN_LIBRARY_TYPE=SHARED
    -DMKLDNN_CPU_RUNTIME=OMP
    -DMKLDNN_BUILD_TESTS=ON
    -DMKLDNN_BUILD_EXAMPLES=ON
    -DMKLDNN_ENABLE_JIT_PROFILING=OFF
    -DMKLDNN_ARCH_OPT_FLAGS=""

  • CMake output log
    -- Looking for pthread.h
    -- Looking for pthread.h - found
    -- Looking for pthread_create
    -- Looking for pthread_create - not found
    -- Looking for pthread_create in pthreads
    -- Looking for pthread_create in pthreads - not found
    -- Looking for pthread_create in pthread
    -- Looking for pthread_create in pthread - found
    -- Found Threads: TRUE
    -- Try OpenMP C flag = [-fopenmp]
    -- Performing Test OpenMP_FLAG_DETECTED
    -- Performing Test OpenMP_FLAG_DETECTED - Success
    -- Try OpenMP CXX flag = [-fopenmp]
    -- Performing Test OpenMP_FLAG_DETECTED
    -- Performing Test OpenMP_FLAG_DETECTED - Success
    -- Found OpenMP: -fopenmp
    -- GPU support is disabled
    -- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
    -- Found Git: /usr/bin/git (found version "2.7.4")
    VTuneel Amplifier JIT profiling disabled
    -- Configuring done
    -- Generating done

  • git hash (git log -1 --format=%H)
    01206f3

Steps to reproduce

Build and run the program there:
https://gist.github.com/WilliamTambellini/8294f211800e16791d47f3cf59472a49

g++ -std=c++11 -mavx512 -mfma -DEIGEN_NO_DEBUG -DNDEBUG -fopenmp -O3
-I eigen-eigen-9f48e814419e -I $mkldir/include -L $mkldir/lib64 -l mkldnn
eigen_vs_mkldnn.cpp -o eigen_vs_mkldnn

OMP_NUM_THREADS=1 ./eigen_vs_mkldnn

Actual behavior

mkldnn slower than eigenTensor for 1 < M < K/4

Expected behavior

same speed (or faster)

Tks.

@vpirogov vpirogov added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Aug 7, 2019
@rsdubtso rsdubtso self-assigned this Aug 7, 2019
@vpirogov
Copy link
Member

vpirogov commented Aug 8, 2019

Adding @aaraujom

@aaraujom
Copy link
Contributor

aaraujom commented Aug 9, 2019

Hi @WilliamTambellini,

Thank you for reporting and submitting a reproducer. I was able to reproduce the issue with your instructions. It seems that we can improve performance by using a different GEMM algorithm for the sizes you are interested. Hopefully we can get this fixed for you soon in master branch.

@vpirogov vpirogov assigned aaraujom and unassigned rsdubtso Aug 9, 2019
@vpirogov vpirogov added bug A confirmed library bug and removed sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Aug 9, 2019
@WilliamTambellini
Copy link
Contributor Author

Thank you @aaraujom

@WilliamTambellini
Copy link
Contributor Author

Hello @aaraujom
Here is the speed of my benchmark when using DNNL 1.1 on AWS C5 :
OMP_NUM_THREADS=1 LD_LIBRARY_PATH=. ./eigen_vs_dnnl
GCC: 6.1.0
Eigen version: 3.3.90
Simd: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
EIGEN_IDEAL_MAX_ALIGN_BYTES=32
EIGEN_MAX_ALIGN_BYTES=32
EIGEN_VECTORIZE_AVX2
EIGEN_VECTORIZE_FMA
TopLevel cache size: 25344 KB
L1 cache size: 32 KB
L2 cache size: 1024 KB
L3 cache size: 25344 KB
Eigen::nbThreads: 1
_OPENMP: 201511
omp_get_num_threads: 1
omp_get_max_threads: 1
DNNL version: 1.1.1 28f4c96
Gemm+add: type=f repeat=600
M K=N ETensor DNNL
1 512 63 22
2 512 85 210
4 512 104 210
8 512 108 220
16 512 144 248
32 512 282 295
64 512 474 386
1 1024 318 166
2 1024 403 1305
4 1024 473 1311
8 1024 482 1338
16 1024 614 1439
32 1024 1117 1625
64 1024 1806 1997
1 2048 1495 696
2 2048 1808 5777
4 2048 2032 5793
8 2048 2091 5911
16 2048 2579 6296
32 2048 4610 7030
64 2048 7038 8526

Would you know if 1.2 is fixing that issue ?
Kind
WT.

@aaronjohnson
Copy link
Contributor

Hi William,

A fix is currently under code review. It might not be a part of v1.2. I will look into if it can be included and report back.

1483.diff.txt

It is a small change in which kernel is chosen for the sequential case; a copy vs nocopy kernel.

Aaron

@WilliamTambellini
Copy link
Contributor Author

ok thanks @aaraujom
Trying on my side to move forward faster. TBC.

@aaronjohnson
Copy link
Contributor

Yes, @WilliamTambellini this change will be in v1.2

@WilliamTambellini
Copy link
Contributor Author

Tks.
Using today's master WITHOUT the patch :
GCC: 6.1.0
Eigen version: 3.3.90
Simd: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
EIGEN_IDEAL_MAX_ALIGN_BYTES=32
EIGEN_MAX_ALIGN_BYTES=32
EIGEN_VECTORIZE_AVX2
EIGEN_VECTORIZE_FMA
TopLevel cache size: 25344 KB
L1 cache size: 32 KB
L2 cache size: 1024 KB
L3 cache size: 25344 KB
Eigen::nbThreads: 1
_OPENMP: 201511
omp_get_num_threads: 1
omp_get_max_threads: 1
DNNL version: 1.2.0 7832294
Gemm+add: type=f repeat=600
M K=N ETensor DNNL
1 512 66 24
2 512 85 208
4 512 104 208
8 512 107 215
16 512 146 243
32 512 297 291
64 512 485 385
128 512 917 575
256 512 1753 965
512 512 4413 1772
1024 512 7270 3714
1 1024 340 194
2 1024 430 1313
4 1024 495 1318
8 1024 505 1344
16 1024 635 1451
32 1024 1154 1638
64 1024 1851 2018
128 1024 3447 2801
256 1024 6627 4491
512 1024 13350 8018
1024 1024 27797 13616
1 2048 1771 846
2 2048 2188 5941
4 2048 2479 5971
8 2048 2484 6095
16 2048 3050 6502
32 2048 5143 7279
64 2048 7874 8831
128 2048 14434 12010
256 2048 26532 18397
512 2048 51551 33212
1024 2048 111513 54185

Patching and retesting ...

@WilliamTambellini
Copy link
Contributor Author

The perf seems indeed better with the patch :

GCC: 6.1.0
Eigen version: 3.3.90
Simd: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
EIGEN_IDEAL_MAX_ALIGN_BYTES=32
EIGEN_MAX_ALIGN_BYTES=32
EIGEN_VECTORIZE_AVX2
EIGEN_VECTORIZE_FMA
TopLevel cache size: 25344 KB
L1 cache size: 32 KB
L2 cache size: 1024 KB
L3 cache size: 25344 KB
Eigen::nbThreads: 1
_OPENMP: 201511
omp_get_num_threads: 1
omp_get_max_threads: 1
DNNL version: 1.2.0 7832294 (with @aaraujom patch)
Gemm+add: type=f repeat=600
M K=N ETensor DNNL
1 512 68 27
2 512 88 86
4 512 108 88
8 512 108 94
16 512 149 121
32 512 290 172
64 512 484 271
128 512 923 469
256 512 1775 864
512 512 4535 1702
1024 512 7506 3487
1 1024 359 195
2 1024 438 539
4 1024 509 546
8 1024 519 578
16 1024 656 679
32 1024 1140 866
64 1024 1850 1252
128 1024 3511 2037
256 1024 6747 3693
512 1024 13582 7228
1024 1024 28542 13823
1 2048 1875 896
2 2048 2282 2903
4 2048 2584 2775
8 2048 2669 3035
16 2048 3146 3462
32 2048 5219 4456
64 2048 8049 6216
128 2048 14571 9467
256 2048 26776 15827
512 2048 52436 28982
1024 2048 113111 54041

Let me rerun the full unittest suite...

@WilliamTambellini
Copy link
Contributor Author

All unit tests passed using:

  • GCC 6.1
    -DDNNL_CPU_RUNTIME=OMP
    -DDNNL_ARCH_OPT_FLAGS=""

Any thing you d like me to do to test further this change ?
Kind

@aaronjohnson
Copy link
Contributor

Nothing further. Thank you for confirming the fix.

@vpirogov
Copy link
Member

The fix is promoted to master (ae61165) and rls-v1.2 (cf7edb6).

@WilliamTambellini
Copy link
Contributor Author

Thank you gentlemen.

vpirogov pushed a commit that referenced this issue Mar 13, 2020
Rolls back to previous and more conservative no-copy dispatching for
sequential mode to avoid performance regressions. This still keeps the
better performance for inner product primitive listed in #525.
mkl-dnn pushed a commit that referenced this issue Mar 17, 2020
Rolls back to previous and more conservative no-copy dispatching for
sequential mode to avoid performance regressions. This still keeps the
better performance for inner product primitive listed in #525.
vpirogov pushed a commit that referenced this issue Mar 17, 2020
Rolls back to previous and more conservative no-copy dispatching for
sequential mode to avoid performance regressions. This still keeps the
better performance for inner product primitive listed in #525.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A confirmed library bug
Projects
None yet
Development

No branches or pull requests

5 participants