-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve NHWC depthwise convolution for AArch64 #6095
Conversation
Hi @FrozenGene , The aim is to reduce the tuning time. The point is that sometimes we are constrained by the number of registers available in AArch64, so trying out different splits might only increase the tuning time, without giving any benefit. So the idea was to have a "default" schedule which mimics ACL implementation and then introduce (the minimal set of) tuning knobs + tensorization to speed things up. What do you think? If you want to add tuning knobs in this PR, I will try to do the tuning analysis today |
Hi @giuseros Thanks for the work. I fully understand your purpose and smoothy development path. As this schedule will be the default NHWC depthwise convolution, my opinion is we should try to achieve a good performance as far as we could achieve. Notably I don't mean we mush achieve like ACL ultimate performance then we could merge, optimization is not one-shot deal. But here I think we could enable auto tvm to help us to achieve better performance. I think it is worthy introducing into this pr.
I agree we should reduce tuning knob and improve tuning time experience, but if it could help us improve performance, I think we should introduce it in, otherwise we could avoid it. |
1ca776e
to
cc43c78
Compare
Hi @FrozenGene , Let me thank you for the review and the pointers (the Halide paper is quite interesting)! I tried most of your suggestions and all I got was a tiny improvement on Since this looks like it will take more time for more analysis I would prefer if we could take this PR in as is and I can follow up with further improvements in the future. |
Hi @tqchen , Is this an issue with my changes or with the CI? It seems to point to an |
I agree. We could limit our tuning knob if we find it does no effect. |
Seems not related with your change. Others meet this CI error too. |
Hi @FrozenGene ,
If this won't be approved by tonight, I will turn into a draft to pick up after I come back from holidays (i.e., 15 days). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one minor comment.
I probably enabled a CUDA test that was making the CI hang. I reverted the test hoping that this was the issue |
@FrozenGene , @anijain2305 , |
We created a default schedule (no auto-tuning or tensorization) named depthwise_conv2d_nhwc which does a decent job at optimizing depthwise for NHWC layouts (on aarch64). Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af
Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1
Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1
Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30
Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9
Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd
Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9
Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0
Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9
Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e
2dc9a22
to
6ef6572
Compare
Hi @FrozenGene , @anijain2305 , Thanks! |
Thanks @giuseros Merged now |
* Improve NHWC depthwise convolution for aarch64 We created a default schedule (no auto-tuning or tensorization) named depthwise_conv2d_nhwc which does a decent job at optimizing depthwise for NHWC layouts (on aarch64). Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af * Add tuning knobs in depthwise schedule Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1 * Introduce padding policy Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1 * Vectorize padding Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30 * Legalize depthwise convolution (2x improvement) and fix tuning issue Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9 * Adding assert on padding Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd * Fix python linting Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9 * Removing commented code Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0 * Revert test file to make CI pass Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9 * Enabling only arm_cpu tests Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e * Rebasing Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
* Improve NHWC depthwise convolution for aarch64 We created a default schedule (no auto-tuning or tensorization) named depthwise_conv2d_nhwc which does a decent job at optimizing depthwise for NHWC layouts (on aarch64). Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af * Add tuning knobs in depthwise schedule Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1 * Introduce padding policy Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1 * Vectorize padding Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30 * Legalize depthwise convolution (2x improvement) and fix tuning issue Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9 * Adding assert on padding Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd * Fix python linting Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9 * Removing commented code Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0 * Revert test file to make CI pass Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9 * Enabling only arm_cpu tests Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e * Rebasing Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
* Improve NHWC depthwise convolution for aarch64 We created a default schedule (no auto-tuning or tensorization) named depthwise_conv2d_nhwc which does a decent job at optimizing depthwise for NHWC layouts (on aarch64). Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af * Add tuning knobs in depthwise schedule Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1 * Introduce padding policy Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1 * Vectorize padding Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30 * Legalize depthwise convolution (2x improvement) and fix tuning issue Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9 * Adding assert on padding Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd * Fix python linting Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9 * Removing commented code Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0 * Revert test file to make CI pass Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9 * Enabling only arm_cpu tests Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e * Rebasing Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
* Improve NHWC depthwise convolution for aarch64 We created a default schedule (no auto-tuning or tensorization) named depthwise_conv2d_nhwc which does a decent job at optimizing depthwise for NHWC layouts (on aarch64). Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af * Add tuning knobs in depthwise schedule Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1 * Introduce padding policy Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1 * Vectorize padding Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30 * Legalize depthwise convolution (2x improvement) and fix tuning issue Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9 * Adding assert on padding Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd * Fix python linting Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9 * Removing commented code Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0 * Revert test file to make CI pass Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9 * Enabling only arm_cpu tests Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e * Rebasing Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
* Improve NHWC depthwise convolution for aarch64 We created a default schedule (no auto-tuning or tensorization) named depthwise_conv2d_nhwc which does a decent job at optimizing depthwise for NHWC layouts (on aarch64). Change-Id: I01e32903f6c1950623f33eae18484e70244fe0af * Add tuning knobs in depthwise schedule Change-Id: I15080e7f12b16e6c6aba99a04e42023845eeabf1 * Introduce padding policy Change-Id: If12a6d05dce9153861550ddef1ee5216809dd1e1 * Vectorize padding Change-Id: I7e2062a40358bf111c0366a449945eb077fb2e30 * Legalize depthwise convolution (2x improvement) and fix tuning issue Change-Id: I4b82c58b167e40b0b7747d28293bbb488c505dd9 * Adding assert on padding Change-Id: Idf8eeaaface5eb7799109cd00f437e404778b9cd * Fix python linting Change-Id: Iac16a8daea1268f0eb331fe4ec18a62408106cf9 * Removing commented code Change-Id: I1412f22ad9864273d77a7bf38a6768694339b7f0 * Revert test file to make CI pass Change-Id: Ica3eff8f9f0fd4c6f32f7ae80adc922f8b16cec9 * Enabling only arm_cpu tests Change-Id: Icbaafcb39e892a5d1a4685133c1699e4d1a8e07e * Rebasing Change-Id: Ibb23f1d4e0d0107e4e3b3571437161cdc2ee2909
High level description of this contribution
We created a default schedule (no auto-tuning or tensorization) named
depthwise_conv2d_nhwc
which does a decent job at optimizing depthwisefor NHWC layouts (on AArch64 architectures).
The schedule lives in :
topi/python/topi/arm_cpu/depthwise_conv2d.py
While we register the strategy in :
python/tvm/relay/op/strategy/arm_cpu.py
This contribution is based on the following RFC: https://discuss.tvm.ai/t/rfc-improve-depthwise-convolution-for-nhwc-layouts-on-aarch64-targets/7360