Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BYOC][DNNL] Improve performance of DNNL BYOC dense operator #11513

Merged
merged 10 commits into from
Jun 10, 2022

Conversation

billishyahao
Copy link
Contributor

@billishyahao billishyahao commented May 31, 2022

This patch is to enhance the performance of DNNL BYOC dense operators by 1) introducing gelu fusion and 2) introducing alter dense weight layout. (Implemented after merging PR #11345, Thanks @apeskov )

Why do we introduce gelu fusion:
For the model family of BERT, GELU (Gaussian Error Linear Unit) activation is used heavily so if we perform gelu fusion in those models, then we gain a better performance boost.

Why do we introduce automatically packed dense and its altered weight layout:
Format tag::ab (aka. tag::NC) is not the best format selected by DNNL inner_product primitive. It is a drawback in current DNNL BYOC module.

For what model it fit in:
Dense intensity type such as Bert family

With this patch, I benchmarked the inference performance of a kind of vision-tranformer called PCPVT (https://arxiv.org/abs/2104.13840) on ICX-8352Y. Here is some boost data:

32 cores Latency (dev)
baseline byoc 11.45ms
byoc w/ patch 7.93ms

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

@crazydemo
Copy link
Contributor

I think this PR may have some overlap with @mengceng15's work.
Here are some lint issue, and UT is needed.

@mengceng15
Copy link
Contributor

I think this PR may have some overlap with @mengceng15's work. Here are some lint issue, and UT is needed.

Yes. I worked on BYOC oneDNN dense several months ago.

Since it is for models including bert, maybe batch matmul support is also needed (with matmul primitive of oneDNN)?

@billishyahao
Copy link
Contributor Author

I think this PR may have some overlap with @mengceng15's work. Here are some lint issue, and UT is needed.

Hi @crazydemo , Thanks for your suggestion. I have added some testcases in UI module. It seems that the end-of-the-line style of test_dnnl.py is CRLF(windows mode). I reformat it by dos2unix to make it identical to other dnnl file. Feel free to leave comments here.

@billishyahao
Copy link
Contributor Author

I think this PR may have some overlap with @mengceng15's work. Here are some lint issue, and UT is needed.

Yes. I worked on BYOC oneDNN dense several months ago.

Since it is for models including bert, maybe batch matmul support is also needed (with matmul primitive of oneDNN)?

I think so. I implemented matmul primitive too in my local environment and may publish it soon.

@billishyahao billishyahao force-pushed the enhance_dnnl_dense branch 2 times, most recently from 847f871 to 77beff9 Compare June 8, 2022 02:44
@billishyahao billishyahao changed the title [BYOC][DNNL] Enhance performance of DNNL BYOC dense operator [BYOC][DNNL] Improve performance of DNNL BYOC dense operator Jun 8, 2022
@billishyahao
Copy link
Contributor Author

Hi @masahi , @comaniac , @trevor-m , @mbaret , Could you take a look at this pr? Thanks!

@linlifan
Copy link

linlifan commented Jun 9, 2022

LGTM

@billishyahao
Copy link
Contributor Author

@masahi Please take a look at this PR. Thanks!

@apeskov
Copy link
Contributor

apeskov commented Jun 9, 2022

One important comment about performance. Just to point out.

In this patch you are using mechanic of auto detection proper layout inside of dnnl_json_runtime. It works correctly and dense primitive will use optimal layout. But it will execute weight reordering each inference call. This reordering significantly break performance (still better than previously, but less than possible).

To avoid weight reordering it should be done once during Init. For that you need change dense weight pattern from wildcard to is_constant.

@billishyahao
Copy link
Contributor Author

One important comment about performance. Just to point out.

In this patch you are using mechanic of auto detection proper layout inside of dnnl_json_runtime. It works correctly and dense primitive will use optimal layout. But it will execute weight reordering each inference call. This reordering significantly break performance (still better than previously, but less than possible).

To avoid weight reordering it should be done once during Init. For that you need change dense weight pattern from wildcard to is_constant.

Hi @apeskov , the following is a clip of dnnl verbose log:

onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc512,0.0400391 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc1024,0.0717773 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc512,0.0351562 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-post-ops:eltwise_gelu_erf ,,mb49ic512oc2048,0.215088 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic2048oc512,0.227051 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc512,0.0339355 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc1024,0.072998 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic512oc512,0.0349121 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-post-ops:eltwise_gelu_erf ,,mb49ic512oc2048,0.226807 onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user ,,mb49ic2048oc512,0.231934

I don't observe the reorder primitive executed before or after inner_product. I think current mechanism still work?

@apeskov
Copy link
Contributor

apeskov commented Jun 9, 2022

@billishyahao Thanks for verbose log and quick response!

Looks like it works for you. But I'm a little bit surprised. My previous experiments with BERT (quantised version) show that reordering is happen... I will recheck.

@yangulei
Copy link
Contributor

yangulei commented Jun 9, 2022

@billishyahao I remember that you introduced a mechanism to query optimal layout and alter op layout before the graph is consumed by DNNL, just like we do for Conv. Did you cancel that and do the layout transform in dnnl json runtime? If so, I agree with @apeskov and wondering why no layout transform of the weights.

@billishyahao
Copy link
Contributor Author

@billishyahao I remember that you introduced a mechanism to query optimal layout and alter op layout before the graph is consumed by DNNL, just like we do for Conv. Did you cancel that and do the layout transform in dnnl json runtime? If so, I agree with @apeskov and wondering why no layout transform of the weights.

Yes, I do revert the change about altering op layout after I saw the PR #11345 from @apeskov .

@billishyahao
Copy link
Contributor Author

@apeskov @yangulei I addressed the above comments. Feel free to comment more.

@billishyahao
Copy link
Contributor Author

Hi @masahi , want to check if approval can be granted for this patch? Thanks in advance:-)

src/relay/backend/contrib/dnnl/codegen.cc Outdated Show resolved Hide resolved
@yangulei
Copy link
Contributor

@billishyahao I remember that you introduced a mechanism to query optimal layout and alter op layout before the graph is consumed by DNNL, just like we do for Conv. Did you cancel that and do the layout transform in dnnl json runtime? If so, I agree with @apeskov and wondering why no layout transform of the weights.

Yes, I do revert the change about altering op layout after I saw the PR #11345 from @apeskov .

I think we need them both. Query and alter layout can do the reordering of the weights in build time to ensure optimal performance in run time.

@billishyahao
Copy link
Contributor Author

@billishyahao I remember that you introduced a mechanism to query optimal layout and alter op layout before the graph is consumed by DNNL, just like we do for Conv. Did you cancel that and do the layout transform in dnnl json runtime? If so, I agree with @apeskov and wondering why no layout transform of the weights.

Yes, I do revert the change about altering op layout after I saw the PR #11345 from @apeskov .

I think we need them both. Query and alter layout can do the reordering of the weights in build time to ensure optimal performance in run time.

Sure. @yangulei Let me add this code mentioned in the following change.

@billishyahao
Copy link
Contributor Author

@masahi Thanks for the approval. Shall we go on to merge this pr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants