Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inner product perf issue w/ transposed weight on single thread #632

Closed
pinzhenx opened this issue Jan 14, 2020 · 11 comments
Closed

inner product perf issue w/ transposed weight on single thread #632

pinzhenx opened this issue Jan 14, 2020 · 11 comments
Assignees
Labels
performance sighting Suspicious library behavior. Should be promoted to a bug when confirmed
Milestone

Comments

@pinzhenx
Copy link
Contributor

pinzhenx commented Jan 14, 2020

./benchdnn --matmul --stag=ab --wtag=ab --mode=p m16n1040k1040
total perf: min(ms):0.0354004 avg(ms):0.046758

./benchdnn --matmul --stag=ab --wtag=ba --mode=p m16n1040k1040
total perf: min(ms):0.110352 avg(ms):0.129114
@rsdubtso rsdubtso added performance sighting Suspicious library behavior. Should be promoted to a bug when confirmed and removed question labels Jan 14, 2020
@pinzhenx
Copy link
Contributor Author

pinzhenx commented Jan 14, 2020

Sorry for the confusion the original issue may cause. Let me elaborate on this.

Actually, it came from an issue of inner_product when we compared with mkldnn, in which we forced the weight format to be ab and it seemed a little slower in DNNL using single thread.

threads mkldnn (ms) dnnl (ms)
1 0.49 0.61
2 0.33 0.32
3 0.24 0.25
4 0.20 0.21
5 0.18 0.17
6 0.16 0.15
7 0.15 0.15
8 0.14 0.14

chart

# Test
Iteration: 1000
mkldnn_verbose,exec,inner_product,gemm:jit,forward_training,fsrc:nc fwei:oi fbia:x fdst:nc,,mb16ic1040oc1040,0.425049 
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_training,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,,,mb16ic1040oc1040,0.584961

# Environment
export LD_PRELOAD=/opt/intel/compilers_and_libraries/linux/lib/intel64_lin/libiomp5.so
export KMP_AFFINITY=granularity=fine,compact,1,0
v1.2.0 (commit 4ea278bf2089e7c798203762ffca976fcc109b51)
v0.20.5 (commit 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)

As we didn't find a way to reproduce exactly the same cases above with benchdnn 0.2x, I have to use matmul in the first place, which is quite misleading.

@pinzhenx pinzhenx changed the title Matmul performance on ba weight? inner product perf issue w/ transposed weight on single thread Jan 14, 2020
@pinzhenx
Copy link
Contributor Author

Found a similar issue: #525

BTW, the case above was extracted from dlrm

@vpirogov
Copy link
Member

@pinzhenx, the fix for #525 is promoted. Could you please validate that it solves your problem?

@vpirogov vpirogov self-assigned this Jan 16, 2020
@pinzhenx
Copy link
Contributor Author

pinzhenx commented Jan 17, 2020

Hi @vpirogov
It's working fine. Close this issue

@pinzhenx pinzhenx reopened this Jan 17, 2020
@pinzhenx
Copy link
Contributor Author

pinzhenx commented Jan 17, 2020

Same issue in kernel dispatching, here's a another shape mb16ic100oc1040

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_training,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,,,mb16ic100oc1040
threads mkldnn 0.20.5 dnnl 1.2.0 delta
1 0.0340785 0.0495767 45.48%
2 0.0276547 0.0279831 1.19%
3 0.0201197 0.0204975 1.88%
4 0.0167322 0.0176753 5.64%

@aaraujom
Copy link
Contributor

Hi @pinzhenx - Just to confirm is this for avx512 enable system?

@vpirogov vpirogov assigned aaraujom and unassigned vpirogov Jan 23, 2020
@pinzhenx
Copy link
Contributor Author

@aaraujom yes

@aaraujom
Copy link
Contributor

Hi @pinzhenx - Would a fix for master branch be good enough for you? I have a fix and it should be available in master soon.


On master (e6a24ce):

$ OMP_NUM_THREADS=1 ./benchdnn --ip --mode=p --stag=ab --wtag=ab mb16ic100oc1040
Output template: perf,%engine%,%name%,%prb%,%Gops%,%Gfreq%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,,--ip --stag=ab --wtag=ab mb16ic100oc1040,0.003328,0,0.0683594,48.6839,0.0694891,47.8924
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 failed:0 listed:0
total perf: min(ms):0.0683594 avg(ms):0.0694891

After fix:

$ OMP_NUM_THREADS=1 ./benchdnn --ip --mode=p --stag=ab --wtag=ab mb16ic100oc1040
Output template: perf,%engine%,%name%,%prb%,%Gops%,%Gfreq%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,,--ip --stag=ab --wtag=ab mb16ic100oc1040,0.003328,0,0.0456543,72.8957,0.0464355,71.6694
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 failed:0 listed:0
total perf: min(ms):0.0456543 avg(ms):0.0464355

@WilliamTambellini
Copy link
Contributor

Hi @aaraujom Is it too late for that fix to be in 1.2 ? Tks, W.

@aaraujom
Copy link
Contributor

aaraujom commented Jan 31, 2020

Hi @WilliamTambellini, we can do a patch release of v1.2 branch containing this fix. Would that work for you?

@vpirogov vpirogov added this to the v1.3 milestone Feb 3, 2020
@pinzhenx
Copy link
Contributor Author

@aaraujom Sorry for my late reply. I think it's good enough. Thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance sighting Suspicious library behavior. Should be promoted to a bug when confirmed
Projects
None yet
Development

No branches or pull requests

5 participants