inner product perf issue w/ transposed weight on single thread #632

pinzhenx · 2020-01-14T05:54:39Z

./benchdnn --matmul --stag=ab --wtag=ab --mode=p m16n1040k1040 total perf: min(ms):0.0354004 avg(ms):0.046758 ./benchdnn --matmul --stag=ab --wtag=ba --mode=p m16n1040k1040 total perf: min(ms):0.110352 avg(ms):0.129114

The text was updated successfully, but these errors were encountered:

pinzhenx · 2020-01-14T11:08:54Z

Sorry for the confusion the original issue may cause. Let me elaborate on this.

Actually, it came from an issue of inner_product when we compared with mkldnn, in which we forced the weight format to be ab and it seemed a little slower in DNNL using single thread.

threads	mkldnn (ms)	dnnl (ms)
1	0.49	0.61
2	0.33	0.32
3	0.24	0.25
4	0.20	0.21
5	0.18	0.17
6	0.16	0.15
7	0.15	0.15
8	0.14	0.14

# Test
Iteration: 1000
mkldnn_verbose,exec,inner_product,gemm:jit,forward_training,fsrc:nc fwei:oi fbia:x fdst:nc,,mb16ic1040oc1040,0.425049 
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_training,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,,,mb16ic1040oc1040,0.584961

# Environment
export LD_PRELOAD=/opt/intel/compilers_and_libraries/linux/lib/intel64_lin/libiomp5.so
export KMP_AFFINITY=granularity=fine,compact,1,0
v1.2.0 (commit 4ea278bf2089e7c798203762ffca976fcc109b51)
v0.20.5 (commit 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)

As we didn't find a way to reproduce exactly the same cases above with benchdnn 0.2x, I have to use matmul in the first place, which is quite misleading.

pinzhenx · 2020-01-14T11:51:45Z

Found a similar issue: #525

BTW, the case above was extracted from dlrm

vpirogov · 2020-01-16T22:27:37Z

@pinzhenx, the fix for #525 is promoted. Could you please validate that it solves your problem?

pinzhenx · 2020-01-17T02:09:46Z

Hi @vpirogov
It's working fine. Close this issue

pinzhenx · 2020-01-17T08:34:11Z

Same issue in kernel dispatching, here's a another shape mb16ic100oc1040

dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_training,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,,,mb16ic100oc1040

threads	mkldnn 0.20.5	dnnl 1.2.0	delta
1	0.0340785	0.0495767	45.48%
2	0.0276547	0.0279831	1.19%
3	0.0201197	0.0204975	1.88%
4	0.0167322	0.0176753	5.64%

aaraujom · 2020-01-23T22:30:21Z

Hi @pinzhenx - Just to confirm is this for avx512 enable system?

pinzhenx · 2020-01-24T01:57:49Z

@aaraujom yes

aaraujom · 2020-01-31T18:27:18Z

Hi @pinzhenx - Would a fix for master branch be good enough for you? I have a fix and it should be available in master soon.

On master (e6a24ce):

$ OMP_NUM_THREADS=1 ./benchdnn --ip --mode=p --stag=ab --wtag=ab mb16ic100oc1040
Output template: perf,%engine%,%name%,%prb%,%Gops%,%Gfreq%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,,--ip --stag=ab --wtag=ab mb16ic100oc1040,0.003328,0,0.0683594,48.6839,0.0694891,47.8924
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 failed:0 listed:0
total perf: min(ms):0.0683594 avg(ms):0.0694891

After fix:

$ OMP_NUM_THREADS=1 ./benchdnn --ip --mode=p --stag=ab --wtag=ab mb16ic100oc1040
Output template: perf,%engine%,%name%,%prb%,%Gops%,%Gfreq%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,,--ip --stag=ab --wtag=ab mb16ic100oc1040,0.003328,0,0.0456543,72.8957,0.0464355,71.6694
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 failed:0 listed:0
total perf: min(ms):0.0456543 avg(ms):0.0464355

WilliamTambellini · 2020-01-31T19:38:54Z

Hi @aaraujom Is it too late for that fix to be in 1.2 ? Tks, W.

aaraujom · 2020-01-31T23:15:42Z

Hi @WilliamTambellini, we can do a patch release of v1.2 branch containing this fix. Would that work for you?

pinzhenx · 2020-02-11T01:47:16Z

@aaraujom Sorry for my late reply. I think it's good enough. Thanks so much!

pinzhenx added the question label Jan 14, 2020

rsdubtso added performance sighting Suspicious library behavior. Should be promoted to a bug when confirmed and removed question labels Jan 14, 2020

pinzhenx changed the title ~~Matmul performance on ba weight?~~ inner product perf issue w/ transposed weight on single thread Jan 14, 2020

vpirogov self-assigned this Jan 16, 2020

pinzhenx closed this as completed Jan 17, 2020

pinzhenx reopened this Jan 17, 2020

vpirogov assigned aaraujom and unassigned vpirogov Jan 23, 2020

vpirogov added this to the v1.3 milestone Feb 3, 2020

pinzhenx closed this as completed Feb 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inner product perf issue w/ transposed weight on single thread #632

inner product perf issue w/ transposed weight on single thread #632

pinzhenx commented Jan 14, 2020 •

edited

Loading

pinzhenx commented Jan 14, 2020 •

edited

Loading

pinzhenx commented Jan 14, 2020

vpirogov commented Jan 16, 2020

pinzhenx commented Jan 17, 2020 •

edited

Loading

pinzhenx commented Jan 17, 2020 •

edited

Loading

aaraujom commented Jan 23, 2020

pinzhenx commented Jan 24, 2020

aaraujom commented Jan 31, 2020

WilliamTambellini commented Jan 31, 2020

aaraujom commented Jan 31, 2020 •

edited

Loading

pinzhenx commented Feb 11, 2020

inner product perf issue w/ transposed weight on single thread #632

inner product perf issue w/ transposed weight on single thread #632

Comments

pinzhenx commented Jan 14, 2020 • edited Loading

pinzhenx commented Jan 14, 2020 • edited Loading

pinzhenx commented Jan 14, 2020

vpirogov commented Jan 16, 2020

pinzhenx commented Jan 17, 2020 • edited Loading

pinzhenx commented Jan 17, 2020 • edited Loading

aaraujom commented Jan 23, 2020

pinzhenx commented Jan 24, 2020

aaraujom commented Jan 31, 2020

WilliamTambellini commented Jan 31, 2020

aaraujom commented Jan 31, 2020 • edited Loading

pinzhenx commented Feb 11, 2020

pinzhenx commented Jan 14, 2020 •

edited

Loading

pinzhenx commented Jan 14, 2020 •

edited

Loading

pinzhenx commented Jan 17, 2020 •

edited

Loading

pinzhenx commented Jan 17, 2020 •

edited

Loading

aaraujom commented Jan 31, 2020 •

edited

Loading