-
Notifications
You must be signed in to change notification settings - Fork 6.8k
When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL #17980
Comments
Thanks for raising the issue, @kpuatamazon. We plan to optimize the dot and batch_dot operator with DNNL MatMul primitive. But please note that, the performance of MatMul primitive was not optimized until v1.3 which was released last week. That's why we didn't integrate the primitive at the first time when it's introduced in v1.2.
As mentioned above, dot and batch_dot will also be optimized with DNNL. And as DNNL is more dedicated on deep learning, we will consider DNNL with high priority when both DNNL and MKL are enabled in compilation. Please provide a simple reproducer if you find FullyConnected based on DNNL is slower than it based on MKL BLAS. |
Here's some benchmarks on 2fff11d (current tip of master) with the same Skylake c5.9xlarge. DNNL is substantially slower than MKL. (Which DNNL is in master?) #!/usr/bin/env python3
import mxnet as mx
import time
def time_procedure(shape, count, proc):
rows, inner, cols = shape
a = mx.nd.random_uniform(shape=(rows, inner), low=-1.0, high=1.0)
b = mx.nd.random_uniform(shape=(cols, inner), low=-1.0, high=1.0)
# Burn in
proc(a, b, cols)
mx.nd.waitall()
begin = time.time()
for i in range(0, count):
proc(a, b, cols)
mx.nd.waitall()
return (time.time() - begin) / count
shapes = [(5, 512, 512), (5,512,1536), (5,512,2048), (5,2048,512), (4,512,512)]
count = 1000
procedures = {
"fullyconnected (DNNL)" : (lambda a, b, cols : mx.nd.FullyConnected(a, b, no_bias=True, num_hidden=cols)),
"dot (MKL)" : (lambda a, b, cols : mx.nd.dot(a, b, transpose_b = True))
}
for s in shapes:
print("Shape " + str(s))
stats = {}
for name, l in procedures.items():
stats[name] = time_procedure(s, count, l)
print("{:.7f} seconds for {}".format(stats[name], name)) Run as
I don't really mind what the default BLAS implementation is. But choosing to use MKL should be 1) atomic and 2) not require undocumented compilation options. |
I also tried the latest DNNL: cd 3rdparty
rm -rf mkldnn
git clone https://github.com/oneapi-src/oneDNN mkldnn
cd ../build && ninja -j 32 and DNNL is still slower:
|
Thank you @kpuatamazon. I will try to reproduce the performance issue and get back to you later.
It should be v1.2.2 but was down graded to v1.1.2 by #17084 (maybe by accident).
Not sure I understand what do you mean. But using MKL BLAS is documented in: |
This still uses MKLDNN for
I was able to get MKL to run by disabling MKLDNN entirely:
But then I lose other kernels like DNNL softmax and actually Sockeye randomly crashes:
(bizarrely this happens after it translated many sentences with that same code.) How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for |
Yes,
I don't think it's possible at this moment. I think the real question is how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. Take the last shape as an example (on my machine, not ec2):
You can set DNNL_VERBOSE=1 when running the script to get the verbose output. |
Can we please have a build option to force MKL usage for all GEMM calls? This is a pretty big performance regression.
Hmmm, when DNNL delegates to MKL, that happens internally to DNNL, right? If the MXNet wrapper is the problem, then why is delegating to MKL much faster? Applying my workaround:
Output:
There is something less efficient about FYI I usually wear this hat (as opposed to @kpu) on Mondays so expect slower responses on other days. |
@emfomenk What's |
@TaoLv, it is a developer's option to enable Intel MKL gemm back (instead of the jitted one). I would suggest to avoid using it, as it internal one (marked with leading underscore). I think it was even broken somewhere in v1.1 to v1.3... |
@emfomenk Thanks for the explanation. Do you think that enabling this option can improve the performance of inner product primitive for the above cases? |
@bartekkuncer please help take a look for this issue as well |
@TaoLv, yes, this could be. Two reasons for that:
I would also suggest to try upgrading the library to the most recent version and check the performance. |
MKL was linked in both cases and in fact called in both cases from
I tried single-threaded (
See my post above "I also tried the latest DNNL". Performance was unchanged. |
Yeah, there should be zero difference in this case :)
Didn't see that, thanks! Yeah, it looks like a performance bug indeed. |
Single-threaded benchmarks (
|
Summoning @aaraujom |
@TaoLv Regarding your comment
I ran with both MKL and DNNL verbose options: To summarize the verbose output in microseconds:
Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds. In both cases, I excluded the first call for each shape to allow for JIT/burn-in. So we really have four issues:
Intel should fix DNNL but that will take time. MXNet should make |
In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include" Then cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release .. and the compiled MXNet can be run normally without any special environment variables. To be clear, the above kludge is an undocumented abomination. The problem with |
Does DNNL always link to MKL by default if MKL is available on the system, just as MXNet does link to MKL in such situation? Further, regarding the double-linking issue, #17794 should fix it. |
DNNL has only a hidden unsupported option to link to MKL. It will not link to MKL by default. |
@kpuatamazon Hi I was trying to benchmark using opperf for mkl [default] vs workaround
I tried the undocumented abominable kludge option you mentioned and that worked smoothly.
Script for OpPerf : https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad Results
However @kpuatamazon when I try to test out with default [i.e. default -> workaround -> default] by unsetting the environment variable LD_PRELOAD, it failed to build default with |
@ChaiBapchya to be clear, here's how I am building the second option: export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
unset LD_PRELOAD #Technically this should be what exists in your environment by default
rm -rf build
mkdir build
cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
ninja -j 30 Note that |
@kpuatamazon @kpu Thus for Ubuntu 18.04 base AMI, one has to install MKL in /opt/intel & update the cmake command to
This I found uses mkl as blas & export MKL_VERBOSE=1 confirms it. With this addition to both [default & workaround] I reran the opperf & I didn't see much perf differences. CommandsDefault
Workaround
LogsDefaultBatch_dot
FC
WorkaroundBatch_dot [same as default]
FC [additional MKL_VERBOSE before DNNL_VERBOSE]
Results
|
Do you really mean LHS (4, 512, 512)? I'm talking about LHS (4,512) RHS (512, 512). What CPU is this on? Note that mine were on a Skylake Xeon What OMP threading were you using? I'd expect for matrices this small that thread scaling is terrible for anything but a tiny number of threads. (I mentioned using 4). |
Yes logs agree with these statements. But the perf difference isn't visible via opperf. CPU : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz This instance doesn't have avx512
OMP_Threading i used was default. However trying FC with OMP_NUM_THREADS=4, still shows that workaround makes FC slower [despite it using MKL] than default[which doesn't use MKL] Workaround [slow]
Default [faster]
Python's time-it function in OpPerfAlso, instead of using MXNet's native [built-in] profiler, if i use python time-it function still Workaround [slow]
Default [faster]
Since Skylake has AVX512. I'm guessing this FC slowdown might be in AVX512 kernel. Let me reproduce that. |
Can confirm that this issue is specific to AVX512 kernels.
ResultsDefault [slower]
MKL Workaround [Faster]
To reproduce |
Hi thanks for checking. Let me try to reproduce this in my end and fix it. |
@aaraujom any update? |
Hi @ChaiBapchya - I was able to reproduce issue on my end. DNNL is missing some kernels for very specific problem sizes affecting the gemm size in this thread. We will address this issue in DNNL. |
Mostly fixed by the v1.6.x release of oneDNN. There is still a speed discrepancy but it's not as bad.
Revert to oneDNN v1.3.
|
@ChaiBapchya - Will those improvements work for you? |
Hi, we benchmarked gemm performance for those sizes internally using oneDNN vs oneMKL directly and here is what we got:
|
@ChaiBapchya Is it possible to compare the performance of FC in two versions? Comparing FC with dot is not so fair to me. As you may know, they are using different execution paths, FCompute vs. FComputeEx. |
closing since the fix is updated with the new MKLDNN. |
@pengzhao-intel v1.7.0 public mxnet uses MKLDNN v1.3 and hence won't have this optimization right? |
Thanks to raise the question. China team is on a public holiday this week. |
@ChaiBapchya, You are right, we also have to include 1.6.4 (#19251) to make it happened. |
Problem
Not sure how much we care about MKL support but to the extent it still appears in the buld system, operator support should be consistent.
When compiled with MKL present (MKL is found in
/opt/intel
), MXNet calls MKL fordot
andbatch_dot
and DNNL forfully_connected
. These are all GEMM operators; why is it inconsistent? This is making Sockeye decoding 22% slower (see below).This inconsistency did not matter much in MXNet 1.5.0 because MKLDNN would delegate to MKL. However, aa1074d upgraded to MKLDNN 1.0, which hid the ability of MKLDNN to delegate to MKL: oneapi-src/oneDNN@3049150 . (MKLDNN has since been renamed DNNL.)
Since MKLDNN only hid support for delegating to MKL, it's possible to restore delegatation (see workaround).
Testing
Tested with MXNet cfb474b. Compiled with mostly-default cmake settings:
Then when I run
You can see DNNL is called for
FullyConnected
while MKL is called fordot
andbatch_dot
.Performance impact
I timed Sockeye decoding. Commit aa1074d made decoding 22% slower (416.878s up from 342.037s for b5d07e3) even with MKL installed in
/opt/intel/
.(Default compilation is
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
; workaround compilation is below.)Tested on a Skylake Xeon, c5.9xlarge
Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
withOMP_NUM_THREADS=4
.Workaround
Since DNNL hid support for delegating to MKL, it's still possible to turn delegation back on.
which compiles but then triggers a link error at runtime
OSError: /home/ubuntu/mxnet/build/3rdparty/mkldnn/src/libmkldnn.so.1: undefined symbol: cblas_gemm_s8u8s32_pack
So I kludged it with
export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
and was then able to use MXNet at runtime. There's probably a cleaner way of fixing the linkage.Recommended fix
When compiled with MKL, MXNet should call MKL directly from
FullyConnected
like it already does fordot
andbatch_dot
.The text was updated successfully, but these errors were encountered: