-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BYOC][DNNL] Improve performance of DNNL BYOC dense operator #11513
Conversation
I think this PR may have some overlap with @mengceng15's work. |
Yes. I worked on BYOC oneDNN dense several months ago. Since it is for models including bert, maybe batch matmul support is also needed (with matmul primitive of oneDNN)? |
Hi @crazydemo , Thanks for your suggestion. I have added some testcases in UI module. It seems that the end-of-the-line style of test_dnnl.py is CRLF(windows mode). I reformat it by dos2unix to make it identical to other dnnl file. Feel free to leave comments here. |
I think so. I implemented matmul primitive too in my local environment and may publish it soon. |
847f871
to
77beff9
Compare
77beff9
to
0cc56e8
Compare
0fc1c26
to
f2b9dcb
Compare
…fusion and 2) introducing alter dense weight layout.
f2b9dcb
to
2bd1dd3
Compare
LGTM |
@masahi Please take a look at this PR. Thanks! |
One important comment about performance. Just to point out. In this patch you are using mechanic of auto detection proper layout inside of dnnl_json_runtime. It works correctly and dense primitive will use optimal layout. But it will execute weight reordering each inference call. This reordering significantly break performance (still better than previously, but less than possible). To avoid weight reordering it should be done once during |
Hi @apeskov , the following is a clip of dnnl verbose log:
I don't observe the reorder primitive executed before or after inner_product. I think current mechanism still work? |
@billishyahao Thanks for verbose log and quick response! Looks like it works for you. But I'm a little bit surprised. My previous experiments with BERT (quantised version) show that reordering is happen... I will recheck. |
@billishyahao I remember that you introduced a mechanism to query optimal layout and alter op layout before the graph is consumed by DNNL, just like we do for |
Yes, I do revert the change about altering op layout after I saw the PR #11345 from @apeskov . |
Hi @masahi , want to check if approval can be granted for this patch? Thanks in advance:-) |
I think we need them both. Query and alter layout can do the reordering of the weights in build time to ensure optimal performance in run time. |
Sure. @yangulei Let me add this code mentioned in the following change. |
@masahi Thanks for the approval. Shall we go on to merge this pr? |
This patch is to enhance the performance of DNNL BYOC dense operators by 1) introducing gelu fusion and
2) introducing alter dense weight layout.(Implemented after merging PR #11345, Thanks @apeskov )Why do we introduce gelu fusion:
For the model family of BERT, GELU (Gaussian Error Linear Unit) activation is used heavily so if we perform gelu fusion in those models, then we gain a better performance boost.
Why do we introduce automatically packed dense and its altered weight layout:
Format tag::ab (aka. tag::NC) is not the best format selected by DNNL inner_product primitive. It is a drawback in current DNNL BYOC module.
For what model it fit in:
Dense intensity type such as Bert family
With this patch, I benchmarked the inference performance of a kind of vision-tranformer called PCPVT (https://arxiv.org/abs/2104.13840) on ICX-8352Y. Here is some boost data:
Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.