inner product bad performance when 1 < M < K/4 #525

WilliamTambellini · 2019-08-06T21:37:11Z

Hello
This is a call for review to check some strange performance for the inner product fwd when K=1024 and 1 < M < K/4, given

M = number of rows of src
K = N = number of rows/cols of the weights

Note : The perf is good for M = 1 and M > K/4

GCC: 6.1.0
Eigen version: 3.3.90
Simd: AVX512, FMA, AVX2, AVX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
EIGEN_IDEAL_MAX_ALIGN_BYTES=64
EIGEN_MAX_ALIGN_BYTES=64
EIGEN_VECTORIZE_AVX2
EIGEN_VECTORIZE_FMA
EIGEN_VECTORIZE_AVX512
TopLevel cache size: 25344 KB
L1 cache size: 32 KB
L2 cache size: 1024 KB
L3 cache size: 25344 KB
Eigen::nbThreads: 1
_OPENMP: 201511
omp_get_num_threads: 1
omp_get_max_threads: 1
MKL-DNN version: 1.0.0 01206f3

inner product fwd: type=f repeat=600
M K=N ETensor MKLDNN
1 1024 642 194
2 1024 759 1327 <-
4 1024 950 1329 <-
8 1024 965 1378 <-
16 1024 976 1491 <-
32 1024 1021 1682 <-
64 1024 1611 2062 <-
128 1024 2389 2838 <-
256 1024 4594 4515
512 1024 8369 8030
1024 1024 16376 12998

Environment

CPU make and model
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping: 3
CPU MHz: 3000.000
BogoMIPS: 6000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
OS version (uname -a)
Linux awsc5 4.4.0-1088-aws What is the criteria for dividing input channels? #99-Ubuntu SMP Thu Jul 4 14:25:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Compiler version (gcc --version)
6.1
CMake version (cmake --version)
3.5.1
CMake command
cmake ..
-DCMAKE_BUILD_TYPE=RELEASE
-DCMAKE_INSTALL_PREFIX=install
-DMKLDNN_LIBRARY_TYPE=SHARED
-DMKLDNN_CPU_RUNTIME=OMP
-DMKLDNN_BUILD_TESTS=ON
-DMKLDNN_BUILD_EXAMPLES=ON
-DMKLDNN_ENABLE_JIT_PROFILING=OFF
-DMKLDNN_ARCH_OPT_FLAGS=""
CMake output log
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Found OpenMP: -fopenmp
-- GPU support is disabled
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
-- Found Git: /usr/bin/git (found version "2.7.4")
VTuneel Amplifier JIT profiling disabled
-- Configuring done
-- Generating done
git hash (git log -1 --format=%H)
01206f3

Steps to reproduce

Build and run the program there:
https://gist.github.com/WilliamTambellini/8294f211800e16791d47f3cf59472a49

g++ -std=c++11 -mavx512 -mfma -DEIGEN_NO_DEBUG -DNDEBUG -fopenmp -O3
-I eigen-eigen-9f48e814419e -I $mkldir/include -L $mkldir/lib64 -l mkldnn
eigen_vs_mkldnn.cpp -o eigen_vs_mkldnn

OMP_NUM_THREADS=1 ./eigen_vs_mkldnn

Actual behavior

mkldnn slower than eigenTensor for 1 < M < K/4

Expected behavior

same speed (or faster)

Tks.

The text was updated successfully, but these errors were encountered:

vpirogov · 2019-08-08T16:13:26Z

Adding @aaraujom

aaraujom · 2019-08-09T21:12:32Z

Hi @WilliamTambellini,

Thank you for reporting and submitting a reproducer. I was able to reproduce the issue with your instructions. It seems that we can improve performance by using a different GEMM algorithm for the sizes you are interested. Hopefully we can get this fixed for you soon in master branch.

WilliamTambellini · 2019-08-09T21:48:28Z

Thank you @aaraujom

WilliamTambellini · 2020-01-13T19:54:57Z

Hello @aaraujom
Here is the speed of my benchmark when using DNNL 1.1 on AWS C5 :
OMP_NUM_THREADS=1 LD_LIBRARY_PATH=. ./eigen_vs_dnnl
GCC: 6.1.0
Eigen version: 3.3.90
Simd: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
EIGEN_IDEAL_MAX_ALIGN_BYTES=32
EIGEN_MAX_ALIGN_BYTES=32
EIGEN_VECTORIZE_AVX2
EIGEN_VECTORIZE_FMA
TopLevel cache size: 25344 KB
L1 cache size: 32 KB
L2 cache size: 1024 KB
L3 cache size: 25344 KB
Eigen::nbThreads: 1
_OPENMP: 201511
omp_get_num_threads: 1
omp_get_max_threads: 1
DNNL version: 1.1.1 28f4c96
Gemm+add: type=f repeat=600
M K=N ETensor DNNL
1 512 63 22
2 512 85 210
4 512 104 210
8 512 108 220
16 512 144 248
32 512 282 295
64 512 474 386
1 1024 318 166
2 1024 403 1305
4 1024 473 1311
8 1024 482 1338
16 1024 614 1439
32 1024 1117 1625
64 1024 1806 1997
1 2048 1495 696
2 2048 1808 5777
4 2048 2032 5793
8 2048 2091 5911
16 2048 2579 6296
32 2048 4610 7030
64 2048 7038 8526

Would you know if 1.2 is fixing that issue ?
Kind
WT.

aaronjohnson · 2020-01-14T18:02:50Z

Hi William,

A fix is currently under code review. It might not be a part of v1.2. I will look into if it can be included and report back.

1483.diff.txt

It is a small change in which kernel is chosen for the sequential case; a copy vs nocopy kernel.

Aaron

WilliamTambellini · 2020-01-14T19:01:00Z

ok thanks @aaraujom
Trying on my side to move forward faster. TBC.

aaronjohnson · 2020-01-14T22:01:28Z

Yes, @WilliamTambellini this change will be in v1.2

WilliamTambellini · 2020-01-14T22:56:59Z

Tks.
Using today's master WITHOUT the patch :
GCC: 6.1.0
Eigen version: 3.3.90
Simd: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
EIGEN_IDEAL_MAX_ALIGN_BYTES=32
EIGEN_MAX_ALIGN_BYTES=32
EIGEN_VECTORIZE_AVX2
EIGEN_VECTORIZE_FMA
TopLevel cache size: 25344 KB
L1 cache size: 32 KB
L2 cache size: 1024 KB
L3 cache size: 25344 KB
Eigen::nbThreads: 1
_OPENMP: 201511
omp_get_num_threads: 1
omp_get_max_threads: 1
DNNL version: 1.2.0 7832294
Gemm+add: type=f repeat=600
M K=N ETensor DNNL
1 512 66 24
2 512 85 208
4 512 104 208
8 512 107 215
16 512 146 243
32 512 297 291
64 512 485 385
128 512 917 575
256 512 1753 965
512 512 4413 1772
1024 512 7270 3714
1 1024 340 194
2 1024 430 1313
4 1024 495 1318
8 1024 505 1344
16 1024 635 1451
32 1024 1154 1638
64 1024 1851 2018
128 1024 3447 2801
256 1024 6627 4491
512 1024 13350 8018
1024 1024 27797 13616
1 2048 1771 846
2 2048 2188 5941
4 2048 2479 5971
8 2048 2484 6095
16 2048 3050 6502
32 2048 5143 7279
64 2048 7874 8831
128 2048 14434 12010
256 2048 26532 18397
512 2048 51551 33212
1024 2048 111513 54185

Patching and retesting ...

WilliamTambellini · 2020-01-14T23:29:00Z

The perf seems indeed better with the patch :

GCC: 6.1.0
Eigen version: 3.3.90
Simd: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
EIGEN_IDEAL_MAX_ALIGN_BYTES=32
EIGEN_MAX_ALIGN_BYTES=32
EIGEN_VECTORIZE_AVX2
EIGEN_VECTORIZE_FMA
TopLevel cache size: 25344 KB
L1 cache size: 32 KB
L2 cache size: 1024 KB
L3 cache size: 25344 KB
Eigen::nbThreads: 1
_OPENMP: 201511
omp_get_num_threads: 1
omp_get_max_threads: 1
DNNL version: 1.2.0 7832294 (with @aaraujom patch)
Gemm+add: type=f repeat=600
M K=N ETensor DNNL
1 512 68 27
2 512 88 86
4 512 108 88
8 512 108 94
16 512 149 121
32 512 290 172
64 512 484 271
128 512 923 469
256 512 1775 864
512 512 4535 1702
1024 512 7506 3487
1 1024 359 195
2 1024 438 539
4 1024 509 546
8 1024 519 578
16 1024 656 679
32 1024 1140 866
64 1024 1850 1252
128 1024 3511 2037
256 1024 6747 3693
512 1024 13582 7228
1024 1024 28542 13823
1 2048 1875 896
2 2048 2282 2903
4 2048 2584 2775
8 2048 2669 3035
16 2048 3146 3462
32 2048 5219 4456
64 2048 8049 6216
128 2048 14571 9467
256 2048 26776 15827
512 2048 52436 28982
1024 2048 113111 54041

Let me rerun the full unittest suite...

WilliamTambellini · 2020-01-14T23:50:21Z

All unit tests passed using:

GCC 6.1
-DDNNL_CPU_RUNTIME=OMP
-DDNNL_ARCH_OPT_FLAGS=""

Any thing you d like me to do to test further this change ?
Kind

aaronjohnson · 2020-01-15T00:03:51Z

Nothing further. Thank you for confirming the fix.

vpirogov · 2020-01-16T22:25:28Z

The fix is promoted to master (ae61165) and rls-v1.2 (cf7edb6).

WilliamTambellini · 2020-01-16T22:30:51Z

Thank you gentlemen.

Rolls back to previous and more conservative no-copy dispatching for sequential mode to avoid performance regressions. This still keeps the better performance for inner product primitive listed in #525.

vpirogov added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Aug 7, 2019

rsdubtso self-assigned this Aug 7, 2019

vpirogov assigned aaraujom and unassigned rsdubtso Aug 9, 2019

vpirogov added bug A confirmed library bug and removed sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Aug 9, 2019

WilliamTambellini mentioned this issue Aug 29, 2019

Perf of intel-tensorflow vs default tensorflow on BERT #540

Closed

vpirogov added this to the v1.2 milestone Oct 31, 2019

pinzhenx mentioned this issue Jan 14, 2020

inner product perf issue w/ transposed weight on single thread #632

Closed

vpirogov closed this as completed Jan 16, 2020

jojivk73 mentioned this issue Jun 25, 2020

keras.LSTM in Tensorflow built with mkl is 5 times slower on CPU tensorflow/tensorflow#33138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inner product bad performance when 1 < M < K/4 #525

inner product bad performance when 1 < M < K/4 #525

WilliamTambellini commented Aug 6, 2019 •

edited

Loading

vpirogov commented Aug 8, 2019

aaraujom commented Aug 9, 2019

WilliamTambellini commented Aug 9, 2019

WilliamTambellini commented Jan 13, 2020

aaronjohnson commented Jan 14, 2020

WilliamTambellini commented Jan 14, 2020

aaronjohnson commented Jan 14, 2020

WilliamTambellini commented Jan 14, 2020

WilliamTambellini commented Jan 14, 2020

WilliamTambellini commented Jan 14, 2020

aaronjohnson commented Jan 15, 2020

vpirogov commented Jan 16, 2020

WilliamTambellini commented Jan 16, 2020

inner product bad performance when 1 < M < K/4 #525

inner product bad performance when 1 < M < K/4 #525

Comments

WilliamTambellini commented Aug 6, 2019 • edited Loading

Environment

Steps to reproduce

Actual behavior

Expected behavior

vpirogov commented Aug 8, 2019

aaraujom commented Aug 9, 2019

WilliamTambellini commented Aug 9, 2019

WilliamTambellini commented Jan 13, 2020

aaronjohnson commented Jan 14, 2020

WilliamTambellini commented Jan 14, 2020

aaronjohnson commented Jan 14, 2020

WilliamTambellini commented Jan 14, 2020

WilliamTambellini commented Jan 14, 2020

WilliamTambellini commented Jan 14, 2020

aaronjohnson commented Jan 15, 2020

vpirogov commented Jan 16, 2020

WilliamTambellini commented Jan 16, 2020

WilliamTambellini commented Aug 6, 2019 •

edited

Loading