Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL #17980

Closed
kpuatamazon opened this issue Apr 6, 2020 · 39 comments

Comments

@kpuatamazon
Copy link
Contributor

kpuatamazon commented Apr 6, 2020

Problem

Not sure how much we care about MKL support but to the extent it still appears in the buld system, operator support should be consistent.

When compiled with MKL present (MKL is found in /opt/intel), MXNet calls MKL for dot and batch_dot and DNNL for fully_connected. These are all GEMM operators; why is it inconsistent? This is making Sockeye decoding 22% slower (see below).

This inconsistency did not matter much in MXNet 1.5.0 because MKLDNN would delegate to MKL. However, aa1074d upgraded to MKLDNN 1.0, which hid the ability of MKLDNN to delegate to MKL: oneapi-src/oneDNN@3049150 . (MKLDNN has since been renamed DNNL.)

Since MKLDNN only hid support for delegating to MKL, it's possible to restore delegatation (see workaround).

Testing

Tested with MXNet cfb474b. Compiled with mostly-default cmake settings:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..

Then when I run

export MKL_VERBOSE=1
export MKLDNN_VERBOSE=1
python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 3.00GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x1a0fdc0,1,0x1a0fdc0,1) 1.47ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24
>>> a = mx.nd.ones(shape=(2,2))
>>> mx.nd.FullyConnected(a,a,num_hidden=2,no_bias=True)
dnnl_verbose,info,DNNL v1.1.2 (commit cb2cc7ac17ff4e2ef50805c7048d33256d82be4d)
dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb2ic2oc2,74.9971

[[2. 2.]
 [2. 2.]]
<NDArray 2x2 @cpu(0)>
>>> a = mx.nd.ones(shape=(2,2,2))
>>> mx.nd.batch_dot(a,a)
MKL_VERBOSE SGEMM_BATCH(N,N,0x7fc3238b809c,0x7fc3238b80a0,0x7fc3238b80a4,0x7fc3238b80b4,0x7fc228010b90,0x7fc3238b80a8,0x7fc22800f770,0x7fc3238b80ac,0x7fc3238b80b8,0x7fc2280190e0,0x7fc3238b80b0,0x7fc3238b7fc8,0x7 363.79us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24

[[[2. 2.]
  [2. 2.]]

 [[2. 2.]
  [2. 2.]]]
>>> mx.nd.dot(a,a)
MKL_VERBOSE SGEMM(N,N,4,4,2,0x7fc3238b8198,0x7fc2280043c0,4,0x7fc2280043c0,2,0x7fc3238b81a0,0x7fc228004580,4) 8.52us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24

[[[[2. 2.]
   [2. 2.]]

  [[2. 2.]
   [2. 2.]]]


 [[[2. 2.]
   [2. 2.]]

  [[2. 2.]
   [2. 2.]]]]
<NDArray 2x2x2x2 @cpu(0)>

You can see DNNL is called for FullyConnected while MKL is called for dot and batch_dot.

Performance impact

I timed Sockeye decoding. Commit aa1074d made decoding 22% slower (416.878s up from 342.037s for b5d07e3) even with MKL installed in /opt/intel/.

Commit Compilation Time(s)
b5d07e3 (before MKLDNN 1.0 change) Default 342.037
aa1074d (MKLDNN 1.0 change) Default 416.878
aa1074d (MKLDNN 1.0 change) Workaround 343.706
cfb474b Recent Default 385.587
cfb474b Recent Workaround 312.509

(Default compilation is cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..; workaround compilation is below.)

Tested on a Skylake Xeon, c5.9xlarge Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz with OMP_NUM_THREADS=4.

Workaround

Since DNNL hid support for delegating to MKL, it's still possible to turn delegation back on.

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..

which compiles but then triggers a link error at runtime OSError: /home/ubuntu/mxnet/build/3rdparty/mkldnn/src/libmkldnn.so.1: undefined symbol: cblas_gemm_s8u8s32_pack
So I kludged it with export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so and was then able to use MXNet at runtime. There's probably a cleaner way of fixing the linkage.

Recommended fix

When compiled with MKL, MXNet should call MKL directly from FullyConnected like it already does for dot and batch_dot.

@kpuatamazon kpuatamazon added the Bug label Apr 6, 2020
@TaoLv
Copy link
Member

TaoLv commented Apr 6, 2020

Thanks for raising the issue, @kpuatamazon. We plan to optimize the dot and batch_dot operator with DNNL MatMul primitive. But please note that, the performance of MatMul primitive was not optimized until v1.3 which was released last week. That's why we didn't integrate the primitive at the first time when it's introduced in v1.2.

When compiled with MKL, MXNet should call MKL directly from FullyConnected like it already does for dot and batch_dot.

As mentioned above, dot and batch_dot will also be optimized with DNNL. And as DNNL is more dedicated on deep learning, we will consider DNNL with high priority when both DNNL and MKL are enabled in compilation.

Please provide a simple reproducer if you find FullyConnected based on DNNL is slower than it based on MKL BLAS.

@kpuatamazon
Copy link
Contributor Author

kpuatamazon commented Apr 6, 2020

Here's some benchmarks on 2fff11d (current tip of master) with the same Skylake c5.9xlarge. DNNL is substantially slower than MKL.

(Which DNNL is in master?)

#!/usr/bin/env python3
import mxnet as mx
import time

def time_procedure(shape, count, proc):
  rows, inner, cols = shape
  a = mx.nd.random_uniform(shape=(rows, inner), low=-1.0, high=1.0)
  b = mx.nd.random_uniform(shape=(cols, inner), low=-1.0, high=1.0)
  # Burn in
  proc(a, b, cols)
  mx.nd.waitall()
  begin = time.time()
  for i in range(0, count):
    proc(a, b, cols)
    mx.nd.waitall()
  return (time.time() - begin) / count

shapes = [(5, 512, 512), (5,512,1536), (5,512,2048), (5,2048,512), (4,512,512)]
count = 1000

procedures = {
  "fullyconnected (DNNL)" : (lambda a, b, cols : mx.nd.FullyConnected(a, b, no_bias=True, num_hidden=cols)),
  "dot (MKL)" : (lambda a, b, cols : mx.nd.dot(a, b, transpose_b = True))
}
for s in shapes:
  print("Shape " + str(s))
  stats = {}
  for name, l in procedures.items():
    stats[name] = time_procedure(s, count, l)
    print("{:.7f} seconds for {}".format(stats[name], name))

Run as OMP_NUM_THREADS=4 ./mult_bench.py:

Shape (5, 512, 512)
0.0000961 seconds for fullyconnected (DNNL)
0.0000509 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0002011 seconds for fullyconnected (DNNL)
0.0000735 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002521 seconds for fullyconnected (DNNL)
0.0001027 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003569 seconds for fullyconnected (DNNL)
0.0001018 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000946 seconds for fullyconnected (DNNL)
0.0000496 seconds for dot (MKL)

I don't really mind what the default BLAS implementation is. But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options.

@kpuatamazon
Copy link
Contributor Author

I also tried the latest DNNL:

cd 3rdparty
rm -rf mkldnn
git clone https://github.com/oneapi-src/oneDNN mkldnn
cd ../build && ninja -j 32

and DNNL is still slower:

Shape (5, 512, 512)
0.0000875 seconds for fullyconnected (DNNL)
0.0000507 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0002005 seconds for fullyconnected (DNNL)
0.0000729 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002516 seconds for fullyconnected (DNNL)
0.0000974 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003564 seconds for fullyconnected (DNNL)
0.0000981 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000928 seconds for fullyconnected (DNNL)
0.0000497 seconds for dot (MKL)

@TaoLv
Copy link
Member

TaoLv commented Apr 6, 2020

Thank you @kpuatamazon. I will try to reproduce the performance issue and get back to you later.

(Which DNNL is in master?)

It should be v1.2.2 but was down graded to v1.1.2 by #17084 (maybe by accident).

But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options.

Not sure I understand what do you mean. But using MKL BLAS is documented in:

@kpuatamazon
Copy link
Contributor Author

kpuatamazon commented Apr 6, 2020

This still uses MKLDNN for FullyConnected

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON ..

I was able to get MKL to run by disabling MKLDNN entirely:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON -DUSE_MKLDNN=OFF ..

But then I lose other kernels like DNNL softmax and actually Sockeye randomly crashes:

mxnet.base.MXNetError: MXNetError: Out of range value for value, value='inf', in operator _full(name="", dtype="float32", value="inf", ctx="cpu(0)", shape="(5, 1)")

(bizarrely this happens after it translated many sentences with that same code.)

How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for FullyConnected using only documented options?

@TaoLv
Copy link
Member

TaoLv commented Apr 6, 2020

Yes, USE_MKLDNN is ON by default on Linux. See https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L44. If want to disable DNNL optimizations, you need set USE_MKLDNN to OFF explicitly in the cmake line.

How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for FullyConnected using only documented options?

I don't think it's possible at this moment. I think the real question is how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. Take the last shape as an example (on my machine, not ec2):

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0490723
0.0001380 seconds for fullyconnected (DNNL)
0.0000556 seconds for dot (MKL)

You can set DNNL_VERBOSE=1 when running the script to get the verbose output.

@kpuatamazon
Copy link
Contributor Author

kpuatamazon commented Apr 6, 2020

I don't think it's possible at this moment.

Can we please have a build option to force MKL usage for all GEMM calls? This is a pretty big performance regression.

I think the real question is how to improve the performance of FullyConnected when DNNL is used.

Hmmm, when DNNL delegates to MKL, that happens internally to DNNL, right? If the MXNet wrapper is the problem, then why is delegating to MKL much faster?

Applying my workaround:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
ninja -j 30
export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
export OMP_NUM_THREADS=4
./mult_bench.py #NB I edited the script to label things more clearly.  

Output:

Shape (5, 512, 512)
0.0000628 seconds for fullyconnected (MKL workaround)
0.0000488 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0000816 seconds for fullyconnected (MKL workaround)
0.0000661 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0001109 seconds for fullyconnected (MKL workaround)
0.0000957 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0001078 seconds for fullyconnected (MKL workaround)
0.0000954 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000501 seconds for fullyconnected (MKL workaround)
0.0000434 seconds for dot (MKL)

There is something less efficient about FullyConnected than dot but it's nowhere near explaining DNNL's slowdown.

FYI I usually wear this hat (as opposed to @kpu) on Mondays so expect slower responses on other days.

@TaoLv
Copy link
Member

TaoLv commented Apr 7, 2020

@emfomenk What's -D_DNNL_USE_MKL=FULL for? I'm not aware of this option and guessing it's not a user facing option.

@emfomenk
Copy link

emfomenk commented Apr 7, 2020

@TaoLv, it is a developer's option to enable Intel MKL gemm back (instead of the jitted one). I would suggest to avoid using it, as it internal one (marked with leading underscore). I think it was even broken somewhere in v1.1 to v1.3...

@TaoLv
Copy link
Member

TaoLv commented Apr 7, 2020

@emfomenk Thanks for the explanation. Do you think that enabling this option can improve the performance of inner product primitive for the above cases?

@pengzhao-intel
Copy link
Contributor

@bartekkuncer please help take a look for this issue as well

@emfomenk
Copy link

emfomenk commented Apr 7, 2020

Do you think that enabling this option can improve the performance of inner product primitive for the above cases?

@TaoLv, yes, this could be. Two reasons for that:

  1. If Intel MKL is linked, then most likely libiomp5.so will be used instead of libgomp.so. So, if you see the performance difference, I would suggest to test Intel MKL-DNN v1.0 / DNNL v1.1+ with LD_PRELOAD=libiomp5.so, just to confirm that the issue due to or not due to OpenMP run-time.
  2. Performance bug in Intel MKL-DNN v1.0 / DNNL v1.1+. We definitely fixed few performance issues in the library since v1.0.

I would also suggest to try upgrading the library to the most recent version and check the performance.
I strongly suggest not using this developer's option. It is unsupported, not tested, and as I said, it is most likely broken in the subsequent releases.

@kpu
Copy link

kpu commented Apr 7, 2020

If Intel MKL is linked

MKL was linked in both cases and in fact called in both cases from dot just not from FullyConnected

libiomp5.so will be used instead of libgomp.so

I tried single-threaded (OMP_NUM_THREADS=1) and still saw a performance drop. There shouldn't be much difference between OMP libraries then right?

I would also suggest to try upgrading the library to the most recent version

See my post above "I also tried the latest DNNL". Performance was unchanged.

@emfomenk
Copy link

emfomenk commented Apr 7, 2020

I tried single-threaded (OMP_NUM_THREADS=1) and still saw a performance drop. There shouldn't be much difference between OMP libraries then right?

Yeah, there should be zero difference in this case :)

I would also suggest to try upgrading the library to the most recent version

See my post above "I also tried the latest DNNL". Performance was unchanged.

Didn't see that, thanks! Yeah, it looks like a performance bug indeed.

@pengzhao-intel
Copy link
Contributor

Thanks @emfomenk & @TaoLv to identify the issue. Our team will follow up on the issue and switch more OP to DNNL soon.

@kpuatamazon
Copy link
Contributor Author

Single-threaded benchmarks (OMP_NUM_THREADS=1) to confirm that it's not a difference in OMP library:

Shape dot (MKL) FullyConnected (MKL workaround) FullyConnected (DNNL) DNNL slowdown
5, 512, 512 0.0000914 0.0001016 0.0002448 2.40x
5, 512, 1536 0.0002939 0.0003051 0.0006576 2.15x
5, 512, 2048 0.0003828 0.0003919 0.0008582 2.18x
5, 2048, 512 0.0003681 0.0003785 0.0014638 3.86x
4, 512, 512 0.0000917 0.0001038 0.0002364 2.27x

@emfomenk
Copy link

Summoning @aaraujom

@kpuatamazon
Copy link
Contributor Author

kpuatamazon commented Apr 13, 2020

@TaoLv Regarding your comment

how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script.

I ran with both MKL and DNNL verbose options: OMP_NUM_THREADS=1 MKL_VERBOSE=1 DNNL_VERBOSE=1

verbose.txt

To summarize the verbose output in microseconds:

Shape MKL raw DNNL raw DNNL raw slowdown MKL Python overhead DNNL Python overhead
5, 512, 512 55.71 213.19 3.83x 53.19 61.61
5, 512, 1536 260.18 636.10 2.44x 55.82 62.90
5, 512, 2048 345.66 844.64 2.44x 56.74 65.66
5, 2048, 512 328.68 1346.32 4.10x 56.62 187.78
4, 512, 512 53.75 202.27 3.76x 51.45 60.03

Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds.
Python overhead is defined as time reported by the above Python script (converted to microseconds) minus raw time. Note that MKL used dot whilst DNNL used FullyConnected.

In both cases, I excluded the first call for each shape to allow for JIT/burn-in.

So we really have four issues:

  1. Raw speed as reported by MKL's verbose output is much faster than raw speed reported by DNNL's verbose output. This is definitely a speed problem with DNNL.
  2. No documented option to compile MKLDNN's kernels with MKL's GEMM. In my opinion, when I compile with -DBLAS set to something, it should use that BLAS. You can add an auto option that picks according to the priorities on the website.
  3. It's weird that dot calls MKL but FullyConnected calls DNNL given that they are both GEMMs. I understand @pengzhao-intel is working on this. Though unless 1 and 2 above are fixed, it will only make performance worse.
  4. MXNet / Python overhead is larger for FullyConnected than dot. This is small compared to the DNNL vs MKL difference though. We can also question why there is so much overhead relative to the cost of the multiply.

Intel should fix DNNL but that will take time.

MXNet should make -DBLAS actually choose the BLAS routine both to support a short-term bypass and because it better matches user expectations of what a compilation flag does.

@kpuatamazon
Copy link
Contributor Author

In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for LD_PRELOAD. Just do this before running cmake the first time:

export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"

Then cmake can be run normally:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..

and the compiled MXNet can be run normally without any special environment variables.

To be clear, the above kludge is an undocumented abomination. The problem with cmake -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include is that it links against MKL twice. DNNL is hard-coded to link dynamically against libmkl_rt.so:
https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L64
Then MXNet also links statically against MKL. And it still wants the shared library in dlopen. So we should just remove the shared library. How do we do that? Implement everything else directly without using the build file. Specifically it implements in CXXFLAGS what the rest of build system does https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L65-L67

@leezu
Copy link
Contributor

leezu commented May 4, 2020

Does DNNL always link to MKL by default if MKL is available on the system, just as MXNet does link to MKL in such situation?

Further, regarding the double-linking issue, #17794 should fix it.

@kpuatamazon
Copy link
Contributor Author

DNNL has only a hidden unsupported option to link to MKL. It will not link to MKL by default.

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented May 5, 2020

In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for LD_PRELOAD. Just do this before running cmake the first time:

export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"

Then cmake can be run normally:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..

and the compiled MXNet can be run normally without any special environment variables.

@kpuatamazon Hi I was trying to benchmark using opperf for mkl [default] vs workaround
And despite ensuring mkl is installed & using export CXXFlags followed by usual cmake command, build failed with

gemm.cpp:(.text+0xb45): undefined reference to `cblas_gemm_s8u8s32'

I tried the undocumented abominable kludge option you mentioned and that worked smoothly.

export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
rm -rf build/
mkdir -p build && cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include ..
cmake --build . --parallel 1024

Script for OpPerf : https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad

Results

Operator LHS RHS MKL Default MKL Workaround
Dot (4, 512, 512) (4, 512, 512) 15.1122 4.1254
(5, 512, 512) (5, 512, 512) 38.1678 7.5323
(5, 512, 1536 (5, 512, 1536) 21.6601 19.2503
(5, 512, 2048) (5, 512, 2048) 29.0369 23.7432
(5, 2048, 512) (5, 2048, 512) 167.5528 129.9957
Batch_dot (4, 512, 512) (4, 512, 512) 1.7898 1.5445
(5, 512, 512) (5, 512, 512) 2.2457 1.9361
(5, 512, 1536) (5, 512, 1536) 6.1453 5.4034
(5, 512, 2048) (5, 512, 2048) 8.246 8.0442
(5, 2048, 512) (5, 2048, 512) 160.6243 29.0772
Data Weight
FC (4, 512) (512, 512) 0.0609 0.068
(5, 512) (512, 512) 0.0633 0.0731
(5, 512) (1536, 512) 0.0916 0.0996
(5, 512) (2048, 512) 0.1081 0.1084

However @kpuatamazon when I try to test out with default [i.e. default -> workaround -> default] by unsetting the environment variable LD_PRELOAD, it failed to build default with gemm.cpp:(.text+0xe6b): undefined reference to cblas_gemm_s8u8s32'`

@kpuatamazon
Copy link
Contributor Author

@ChaiBapchya to be clear, here's how I am building the second option:

export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
unset LD_PRELOAD  #Technically this should be what exists in your environment by default
rm -rf build
mkdir build
cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
ninja -j 30

Note that cmake does not appear to pick up on changes to the build so it needs a fresh build directory (deleting the cache might work).

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented May 15, 2020

Tested with MXNet cfb474b. Compiled with mostly-default cmake settings:

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..

Then when I run

export MKL_VERBOSE=1
export MKLDNN_VERBOSE=1
python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime

@kpuatamazon @kpu
Running on Ubuntu 18.04 [which doesn't have MKL installed by default] with default cmake config doesn't use MKL as blas.
Hence we can't get the above exports.

Thus for Ubuntu 18.04 base AMI, one has to install MKL in /opt/intel & update the cmake command to

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..

This I found uses mkl as blas & export MKL_VERBOSE=1 confirms it.

With this addition to both [default & workaround] I reran the opperf & I didn't see much perf differences.

Commands

Default

cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..

Workaround

export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..

Logs

Default

Batch_dot

MKL_VERBOSE SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe 28.71ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
MKL_VERBOSE SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe 28.53ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16

FC

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0551758
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0559082

Workaround

Batch_dot [same as default]

MKL_VERBOSE SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b 28.72ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
MKL_VERBOSE SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b 28.77ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16

FC [additional MKL_VERBOSE before DNNL_VERBOSE]

MKL_VERBOSE SGEMM(T,N,512,4,512,0x7f985b789c28,0x7f97b5e52e80,512,0x7f97b401e600,512,0x7f985b789c30,0x7f976e89dd80,512) 39.68us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0769043
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f985b789c28,0x7f976c400100,2048,0x7f976e887100,2048,0x7f985b789c30,0x7f976e8da000,512) 79.41us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11377

Results

Operator LHS RHS MKL Default MKL Workaround
Dot (4, 512, 512) (4, 512, 512) 4.1112 4.8241
(5, 512, 512) (5, 512, 512) 6.4421 7.607
(5, 512, 1536 (5, 512, 1536) 20.3648 19.2217
(5, 512, 2048) (5, 512, 2048) 23.3236 23.2849
(5, 2048, 512) (5, 2048, 512) 123.1235 123.9806
Batch_dot (4, 512, 512) (4, 512, 512) 1.4105 1.407
(5, 512, 512) (5, 512, 512) 1.7558 1.7511
(5, 512, 1536) (5, 512, 1536) 6.5931 6.5585
(5, 512, 2048) (5, 512, 2048) 9.1452 9.1031
(5, 2048, 512) (5, 2048, 512) 29.0192 28.9236
Operator Data Weight MKL Default MKL Workaround
FC (4, 512) (512, 512) 0.057 0.0685
(5, 512) (512, 512) 0.0591 0.0698
(5, 512) (1536, 512) 0.0823 0.0939
(5, 512) (2048, 512) 0.0916 0.1026
(5, 2048) (512, 2048) 0.1146 0.1267

@kpu
Copy link

kpu commented May 15, 2020

dot hasn't been taken over by MKLDNN = DNNL = oneAPI yet, so you shouldn't expect to see a difference there because it never calls MKLDNN's GEMM anyway. FullyConnected has been taken over by MKLDNN = DNNL = oneAPI. This inconsistency is part of the bug report. Your logs agree with these statements.

Do you really mean LHS (4, 512, 512)? I'm talking about LHS (4,512) RHS (512, 512).

What CPU is this on? Note that mine were on a Skylake Xeon c9.9xlarge. I'd expect the AVX512 kernels to be pretty different from the AVX2 kernels.

What OMP threading were you using? I'd expect for matrices this small that thread scaling is terrible for anything but a tiny number of threads. (I mentioned using 4).

@ChaiBapchya
Copy link
Contributor

Yes logs agree with these statements. But the perf difference isn't visible via opperf.
My bad, lhs, rhs wrongly interpreted. But that's for dot,batch_dot. FC values for data, weight are same as yours.

CPU : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Instance p3.8xl

This instance doesn't have avx512

$ lscpu | grep Flags | grep avx
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt

OMP_Threading i used was default.

However trying FC with OMP_NUM_THREADS=4, still shows that workaround makes FC slower [despite it using MKL] than default[which doesn't use MKL]

Workaround [slow]

MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f0b79ff8c28,0x7f0b62cfb040,2048,0x7f0b6800b740,2048,0x7f0b79ff8c30,0x7f0b68074e80,512) 119.93us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.142822
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0925}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0921}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_forward_FullyConnected': 0.1441}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_forward_FullyConnected': 0.1688}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.1773}]}]

Default [faster]

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11499
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.115967
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0691}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.0669}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_forward_FullyConnected': 0.1166}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048},  'avg_time_forward_FullyConnected': 0.1438}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_forward_FullyConnected': 0.1509}]}]

Python's time-it function in OpPerf

Also, instead of using MXNet's native [built-in] profiler, if i use python time-it function still

Workaround [slow]

MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f86067f9c28,0x7f85e34fd040,2048,0x7f85f000b740,2048,0x7f86067f9c30,0x7f85f005fc80,512) 120.56us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.143066
[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.18504947423934937, 'p50_time_FullyConnected': 0.1817569718696177, 'p90_time_FullyConnected': 0.19327133195474744, 'p99_time_FullyConnected': 0.21545797004364425}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.1922117848880589, 'p50_time_FullyConnected': 0.18283800454810262, 'p90_time_FullyConnected': 0.19954713061451912, 'p99_time_FullyConnected': 0.36291924072429527}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.24743830785155296, 'p50_time_FullyConnected': 0.23927190341055393, 'p90_time_FullyConnected': 0.25965459644794464, 'p99_time_FullyConnected': 0.3613424429204313}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.2728361077606678, 'p50_time_FullyConnected': 0.2635499695315957, 'p90_time_FullyConnected': 0.2853184472769499, 'p99_time_FullyConnected': 0.39780672872439016}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.2834270987659693, 'p50_time_FullyConnected': 0.27543300529941916, 'p90_time_FullyConnected': 0.29397032922133803, 'p99_time_FullyConnected': 0.40460130083374674}]}]

Default [faster]

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.116943
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.115967
[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.17584662651643157, 'p50_time_FullyConnected': 0.17386197578161955, 'p90_time_FullyConnected': 0.18297710921615362, 'p99_time_FullyConnected': 0.2026369632221758}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.177996635902673, 'p50_time_FullyConnected': 0.17414247849956155, 'p90_time_FullyConnected': 0.18130375538021326, 'p99_time_FullyConnected': 0.23936283076181997}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.22948215948417783, 'p50_time_FullyConnected': 0.22402300965040922, 'p90_time_FullyConnected': 0.2324321656487882, 'p99_time_FullyConnected': 0.3020400775130837}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.2527528256177902, 'p50_time_FullyConnected': 0.24718954227864742, 'p90_time_FullyConnected': 0.263673288282007, 'p99_time_FullyConnected': 0.3303098364267497}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.26115449611097574, 'p50_time_FullyConnected': 0.25852309772744775, 'p90_time_FullyConnected': 0.26592351496219635, 'p99_time_FullyConnected': 0.304172352189198}]}]

Since Skylake has AVX512. I'm guessing this FC slowdown might be in AVX512 kernel. Let me reproduce that.

@ChaiBapchya
Copy link
Contributor

Can confirm that this issue is specific to AVX512 kernels.
Tried this on c5.xl
$ lscpu

Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

Results

Default [slower]

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.133789
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.132812
[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.10202302001744101, 'p50_time_FullyConnected': 0.10086749989568489, 'p90_time_FullyConnected': 0.10658760029400582, 'p99_time_FullyConnected': 0.13521948004836298}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.10642346004715364, 'p50_time_FullyConnected': 0.09991750016524747, 'p90_time_FullyConnected': 0.10565369971118344, 'p99_time_FullyConnected': 0.2586996700802042}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.16890607999812346, 'p50_time_FullyConnected': 0.16431500012004108, 'p90_time_FullyConnected': 0.1781331999154645, 'p99_time_FullyConnected': 0.2831235897247094}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.20140223995440465, 'p50_time_FullyConnected': 0.19778950013460417, 'p90_time_FullyConnected': 0.20401089991537447, 'p99_time_FullyConnected': 0.3063294199228036}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.21596427998701984, 'p50_time_FullyConnected': 0.2096700000038254, 'p90_time_FullyConnected': 0.21819640001012885, 'p99_time_FullyConnected': 0.3412436299549877}]}]

MKL Workaround [Faster]

MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b1405e840,512) 21.25us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:18
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0378418
MKL_VERBOSE SGEMM(T,N,512,5,2048,0x7f9bcf6fac28,0x7f9bc22f4040,2048,0x7f9b1400ce80,2048,0x7f9bcf6fac30,0x7f9b14061c00,512) 20.94us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:18
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0371094
[{'FullyConnected': [
{'inputs': {'data': (4, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.11772135999308375, 'p50_time_FullyConnected': 0.1149684999290912, 'p90_time_FullyConnected': 0.1244978000613628, 'p99_time_FullyConnected': 0.14825501980340045}, 
{'inputs': {'data': (5, 512), 'weight': (512, 512), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.120828840035756, 'p50_time_FullyConnected': 0.11370450010872446, 'p90_time_FullyConnected': 0.12752780021401122, 'p99_time_FullyConnected': 0.2412066401620902}, 
{'inputs': {'data': (5, 512), 'weight': (1536, 512), 'no_bias': True, 'num_hidden': 1536}, 'avg_time_FullyConnected': 0.13385597998421872, 'p50_time_FullyConnected': 0.12600750005731243, 'p90_time_FullyConnected': 0.14806160011175962, 'p99_time_FullyConnected': 0.2509373301927551}, 
{'inputs': {'data': (5, 512), 'weight': (2048, 512), 'no_bias': True, 'num_hidden': 2048}, 'avg_time_FullyConnected': 0.14175208003507578, 'p50_time_FullyConnected': 0.1372545000322134, 'p90_time_FullyConnected': 0.14401020002878798, 'p99_time_FullyConnected': 0.2423993399725075}, 
{'inputs': {'data': (5, 2048), 'weight': (512, 2048), 'no_bias': True, 'num_hidden': 512}, 'avg_time_FullyConnected': 0.143890859962994, 'p50_time_FullyConnected': 0.1397979999637755, 'p90_time_FullyConnected': 0.14637689982919258, 'p99_time_FullyConnected': 0.22678783964693117}]}]

To reproduce
https://gist.github.com/ChaiBapchya/a849cfd566b8114e695454850b48077b
https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad#file-benchmark_intel_mkl-py

@aaraujom
Copy link

Hi thanks for checking. Let me try to reproduce this in my end and fix it.

@ChaiBapchya
Copy link
Contributor

@aaraujom any update?

@aaraujom
Copy link

aaraujom commented Jun 3, 2020

Hi @ChaiBapchya - I was able to reproduce issue on my end. DNNL is missing some kernels for very specific problem sizes affecting the gemm size in this thread. We will address this issue in DNNL.

@kpuatamazon
Copy link
Contributor Author

Mostly fixed by the v1.6.x release of oneDNN. There is still a speed discrepancy but it's not as bad.

git checkout 6b568fd2c10e8cc14c21d6150ba71933a2af2a41 (current v1.x branch of MXNet including oneDNN 1.6.1)

Shape (5, 512, 512)
0.0000471 seconds for fullyconnected (DNNL)
0.0000413 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0000691 seconds for fullyconnected (DNNL)
0.0000639 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0000936 seconds for fullyconnected (DNNL)
0.0000868 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0000949 seconds for fullyconnected (DNNL)
0.0000870 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000446 seconds for fullyconnected (DNNL)
0.0000392 seconds for dot (MKL)

Revert to oneDNN v1.3.
git checkout 6b568fd2c10e8cc14c21d6150ba71933a2af2a41 && git revert 17111037176294d86b7dc6f719347aa24c96b317 && git revert eae6171cb56190cf06f5b11597380c082745991b

Shape (5, 512, 512)
0.0000842 seconds for fullyconnected (DNNL)
0.0000407 seconds for dot (MKL)
Shape (5, 512, 1536)
0.0001908 seconds for fullyconnected (DNNL)
0.0000606 seconds for dot (MKL)
Shape (5, 512, 2048)
0.0002421 seconds for fullyconnected (DNNL)
0.0000880 seconds for dot (MKL)
Shape (5, 2048, 512)
0.0003462 seconds for fullyconnected (DNNL)
0.0000912 seconds for dot (MKL)
Shape (4, 512, 512)
0.0000841 seconds for fullyconnected (DNNL)
0.0000395 seconds for dot (MKL)

@aaraujom
Copy link

@ChaiBapchya - Will those improvements work for you?

@ChaiBapchya
Copy link
Contributor

Is there a way to bring down the deficit and have parity? or do we know the reason why DNNL is slower? @TaoLv @aaraujom

@b-shi
Copy link

b-shi commented Aug 18, 2020

Hi, we benchmarked gemm performance for those sizes internally using oneDNN vs oneMKL directly and here is what we got:

Max (20K trials) GFlops oneMKL (1Th) oneDNN (1Th) oneMKL (4Th) oneDNN (4Th) oneMKL (8Th) oneDNN (8Th) oneMKL (28Th) oneDNN (28Th)
5, 512, 512 34.49 40.96 174.76 262.14 291.27 436.91 374.49 524.29
5, 512, 1536 26.57 27.40 178.73 231.30 357.47 491.52 561.74 873.81
5, 512, 2048 25.83 26.55 143.64 174.76 338.25 436.91 616.81 1048.58
5, 2048, 512 25.51 26.82 139.81 183.96 388.36 524.29 873.81 1310.72
4, 512, 512 31.78 42.80 161.32 233.02 262.14 419.43 299.59 419.43
Avg (20K trials) GFlops oneMKL (1Th) oneDNN (1Th) oneMKL (4Th) oneDNN (4Th) oneMKL (8Th) oneDNN (8Th) oneMKL (28Th) oneDNN (28Th)
5, 512, 512 33.96 39.45 169.05 233.88 258.32 364.93 328.68 399.95
5, 512, 1536 26.30 27.13 174.04 226.75 342.51 463.81 492.14 765.09
5, 512, 2048 25.58 26.35 140.57 171.42 328.70 429.25 589.81 873.36
5, 2048, 512 25.27 26.57 137.35 179.91 368.89 508.38 792.29 1101.67
4, 512, 512 31.02 41.91 146.59 216.20 226.68 335.54 271.98 334.98

@TaoLv
Copy link
Member

TaoLv commented Aug 19, 2020

@ChaiBapchya Is it possible to compare the performance of FC in two versions? Comparing FC with dot is not so fair to me. As you may know, they are using different execution paths, FCompute vs. FComputeEx.

@pengzhao-intel
Copy link
Contributor

closing since the fix is updated with the new MKLDNN.

@ChaiBapchya
Copy link
Contributor

@pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right?
https://github.com/apache/incubator-mxnet/tree/1.7.0/3rdparty
This means public mxnet v1.7 won't have this fix. v1.8, on the contrary, would have the optimized mkldnn [by virtue of v1.x branch having mkldnn v1.6+ prior to 1.8.rc0] right?

@pengzhao-intel
Copy link
Contributor

@pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right?
https://github.com/apache/incubator-mxnet/tree/1.7.0/3rdparty
This means public mxnet v1.7 won't have this fix. v1.8, on the contrary, would have the optimized mkldnn [by virtue of v1.x branch having mkldnn v1.6+ prior to 1.8.rc0] right?

Thanks to raise the question. China team is on a public holiday this week.
@anko-intel could you help take a look at the issue?

@anko-intel
Copy link
Contributor

@ChaiBapchya, You are right, we also have to include 1.6.4 (#19251) to make it happened.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants