Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BYOC][DNNL] Enable layer normalization in DNNL byoc. #11508

Merged
merged 4 commits into from
Jun 8, 2022

Conversation

billishyahao
Copy link
Contributor

@billishyahao billishyahao commented May 30, 2022

This patch is to enable layer normalization in DNNL BYOC by providing an out-of-box rewrite pattern for combining the operators into a single relay layer normalization operator as well as its implementation in dnnl json codegen.

After applying the rewrite pattern, we will observe the following dnnl function:

6202 def @tvmgen_default_dnnl_main_108(%dnnl_108_i0: Tensor[(1, 784, 128), float32], Inline=1, Compiler="dnnl", global_symbol="tvmgen_default_dnnl_main_108", Primitive=1) -> Tensor[(1, 784, 128), float32] {
6203   nn.layer_norm(%dnnl_108_i0, meta[relay.Constant][56] /* ty=Tensor[(128), float32] */, meta[relay.Constant][57] /* ty=Tensor[(128), float32] */) /* ty=Tensor[(1,  784, 128), float32] */
6204 }

Once you enable DNNL_VERBOSE flag, more informations are shown in log file as below:

onednn_verbose,exec,cpu,layer_normalization,simple_layer_normalization:any,forward_inference,data_f32::blocked:abc:f0 stats_undef::undef::f0 diff_undef::undef::f0,,flags:CH,1x784x128,0.0551758

With this patch, I benchmarked the inference performance of a kind of vision-tranformer called PCPVT (https://arxiv.org/abs/2104.13840) on ICX-8352Y. It gains up to 1.18X boost. Here is some boost data:

32 cores Latency
baseline byoc 11.45ms
byoc w/ patch 9.68ms

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

@crazydemo
Copy link
Contributor

Thanks for your contribution for BYOC-DNNL. And my suggestions are listed below:

  1. I wonder if we can get better performance via running layernorm on dnnl codegen than running consecutive ops on native codegen. Could you please provide some performance numbers?
  2. Lint has failed. Please run task_lint.sh to check the code style.
  3. UT is required. You can add your test cases in tests/python/contrib/test_dnnl.py to ensure the functionality of the enabled ops.


dnnl::memory::dims data_shape = nodes_[data_entry.id_].GetOpShape()[data_entry.index_];

float epsilon = std::stof(node.GetAttr<std::vector<std::string>>("epsilon")[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original "nn.layer_norm" has not only epsilon argument. At least axis, center and scale. By this code you assume that they always equal axis = -1, center=true and scale=true.

Could you please add support of all attributes or verify their values on codegen stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add ICHECK to this case. And will update later for supporting all other attributes.

@apeskov
Copy link
Contributor

apeskov commented Jun 1, 2022

@crazydemo Answering your question about performance.

I wonder if we can get better performance via running layernorm on dnnl codegen than running consecutive ops on native codegen. Could you please provide some performance numbers?

Yes, there is performance benefit. At least they use different memory access approach. Consecutive ops with llvm codegen will produce sequence of fused kernel like next:

  • mean. One pass through src tensor
  • sub. One pass through src and dst tensor
  • power + mean. One pass through src
  • add + sqrt + div + mul + add. One pass through src and dst.

Totally we have 6 times traversing through data tensor for TVM codegen. DNNL implement it as single kernel and do only 4 passes through memory buffers (or 3 in case of in place memory).

In case of multi core system(xeon servers and other) normalise op is memory bound. And reduction of memory access becomes more important.

@billishyahao
Copy link
Contributor Author

Hi @apeskov , please take a look at the latest version here. Feel free to comment more.

@billishyahao billishyahao force-pushed the enable_dnnl_ln branch 2 times, most recently from 355766e to 4c34e00 Compare June 5, 2022 08:00
@billishyahao
Copy link
Contributor Author

Hi @masahi Please take a look.

@billishyahao
Copy link
Contributor Author

Hi @comaniac , @trevor-m , @mbaret , Could you take a look at this pr? Thanks!

@masahi
Copy link
Member

masahi commented Jun 8, 2022

You need to resolve the conflict in test_dnnl.py, but since it will be modified in #11513, let's merge #11513 first.

@billishyahao
Copy link
Contributor Author

You need to resolve the conflict in test_dnnl.py, but since it will be modified in #11513, let's merge #11513 first.

Hi @masahi , Thanks for pointing out this. I have resolved the conflict in test_dnnl.py.

@masahi masahi merged commit 9817338 into apache:main Jun 8, 2022
Kathryn-cat pushed a commit to Kathryn-cat/tvm that referenced this pull request Jun 10, 2022
* Enable layer normalization in DNNL byoc.

* Added unittest for layer norm and make code compatible after introducing TensorRequisite(PR-11345)

* Fix lint issue

* Fix clang format issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants