-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inner product bad performance when 1 < M < K/4 #525
Comments
Adding @aaraujom |
Thank you for reporting and submitting a reproducer. I was able to reproduce the issue with your instructions. It seems that we can improve performance by using a different GEMM algorithm for the sizes you are interested. Hopefully we can get this fixed for you soon in master branch. |
Thank you @aaraujom |
Hello @aaraujom Would you know if 1.2 is fixing that issue ? |
Hi William, A fix is currently under code review. It might not be a part of v1.2. I will look into if it can be included and report back. It is a small change in which kernel is chosen for the sequential case; a copy vs nocopy kernel. Aaron |
ok thanks @aaraujom |
Yes, @WilliamTambellini this change will be in v1.2 |
Tks. Patching and retesting ... |
The perf seems indeed better with the patch : GCC: 6.1.0 Let me rerun the full unittest suite... |
All unit tests passed using:
Any thing you d like me to do to test further this change ? |
Nothing further. Thank you for confirming the fix. |
Thank you gentlemen. |
Rolls back to previous and more conservative no-copy dispatching for sequential mode to avoid performance regressions. This still keeps the better performance for inner product primitive listed in #525.
Rolls back to previous and more conservative no-copy dispatching for sequential mode to avoid performance regressions. This still keeps the better performance for inner product primitive listed in #525.
Rolls back to previous and more conservative no-copy dispatching for sequential mode to avoid performance regressions. This still keeps the better performance for inner product primitive listed in #525.
Hello
This is a call for review to check some strange performance for the inner product fwd when K=1024 and 1 < M < K/4, given
Note : The perf is good for M = 1 and M > K/4
GCC: 6.1.0
Eigen version: 3.3.90
Simd: AVX512, FMA, AVX2, AVX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
EIGEN_IDEAL_MAX_ALIGN_BYTES=64
EIGEN_MAX_ALIGN_BYTES=64
EIGEN_VECTORIZE_AVX2
EIGEN_VECTORIZE_FMA
EIGEN_VECTORIZE_AVX512
TopLevel cache size: 25344 KB
L1 cache size: 32 KB
L2 cache size: 1024 KB
L3 cache size: 25344 KB
Eigen::nbThreads: 1
_OPENMP: 201511
omp_get_num_threads: 1
omp_get_max_threads: 1
MKL-DNN version: 1.0.0 01206f3
inner product fwd: type=f repeat=600
M K=N ETensor MKLDNN
1 1024 642 194
2 1024 759 1327 <-
4 1024 950 1329 <-
8 1024 965 1378 <-
16 1024 976 1491 <-
32 1024 1021 1682 <-
64 1024 1611 2062 <-
128 1024 2389 2838 <-
256 1024 4594 4515
512 1024 8369 8030
1024 1024 16376 12998
Environment
CPU make and model
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping: 3
CPU MHz: 3000.000
BogoMIPS: 6000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
OS version (
uname -a
)Linux awsc5 4.4.0-1088-aws What is the criteria for dividing input channels? #99-Ubuntu SMP Thu Jul 4 14:25:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Compiler version (
gcc --version
)6.1
CMake version (
cmake --version
)3.5.1
CMake command
cmake ..
-DCMAKE_BUILD_TYPE=RELEASE
-DCMAKE_INSTALL_PREFIX=install
-DMKLDNN_LIBRARY_TYPE=SHARED
-DMKLDNN_CPU_RUNTIME=OMP
-DMKLDNN_BUILD_TESTS=ON
-DMKLDNN_BUILD_EXAMPLES=ON
-DMKLDNN_ENABLE_JIT_PROFILING=OFF
-DMKLDNN_ARCH_OPT_FLAGS=""
CMake output log
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Found OpenMP: -fopenmp
-- GPU support is disabled
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
-- Found Git: /usr/bin/git (found version "2.7.4")
VTuneel Amplifier JIT profiling disabled
-- Configuring done
-- Generating done
git hash (
git log -1 --format=%H
)01206f3
Steps to reproduce
Build and run the program there:
https://gist.github.com/WilliamTambellini/8294f211800e16791d47f3cf59472a49
g++ -std=c++11 -mavx512 -mfma -DEIGEN_NO_DEBUG -DNDEBUG -fopenmp -O3
-I eigen-eigen-9f48e814419e -I $mkldir/include -L $mkldir/lib64 -l mkldnn
eigen_vs_mkldnn.cpp -o eigen_vs_mkldnn
OMP_NUM_THREADS=1 ./eigen_vs_mkldnn
Actual behavior
mkldnn slower than eigenTensor for 1 < M < K/4
Expected behavior
same speed (or faster)
Tks.
The text was updated successfully, but these errors were encountered: