-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Improve quantized convolution performance for armv8 architectures #5754
[RFC] Improve quantized convolution performance for armv8 architectures #5754
Conversation
CC: @u99127 @anijain2305 |
Thanks for the great work! I have some quick question:
@qnn_conv2d_legalize.register('arm_cpu')
def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
# ARM prefers the dtypes to be same.
if is_aarch64_arm():
return helper_change_dtypes_to_be_same(attrs, inputs, types, relay.qnn.op.conv2d)
return helper_no_fast_int8_hw_legalization(attrs, inputs, types, relay.qnn.op.conv2d) It disables us using
is_aarch64 = "aarch64" in str(isa.target)
if is_aarch64 and data.dtype in ["int8", "uint8"]:
strategy.add_implementation(
wrap_compute_conv2d(topi.arm_cpu.compute_conv2d_NHWC_quantized),
wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_NHWC_quantized),
name="compute_conv2d_NHWC_quantized.arm_cpu")
else:
strategy.add_implementation(
wrap_compute_conv2d(topi.arm_cpu.conv2d_nhwc_spatial_pack),
wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_nhwc_spatial_pack),
name="conv2d_nhwc_spatial_pack.arm_cpu") This is our design purpose of strategy. I suspect there is some workload our spatial pack could perform better. This situation is the same as Winograd, we could perform winograd and default template and choose better. |
cc @ajtulloch |
Hi @FrozenGene ,
I will reply on the code in the following comment. |
@giuseros Glad to see we have the same thought we should let autotvm select the best. Autoscheduler reley on the legalization pass to generate smlal inst(After auto scheduler is released, let us make it better together.) One information I missed before, my testing rasp 3b+ os is Ubuntu 64 bits, not 32 bits, so the target is aarch64 too. I mention auto scheduler is not to question your work (your work is very great!) and is orthogonal as you said. I just mention that we use smlal inst on A53 (aarch64 os mentioned before) we could get nice performance too. So I want to know on low-end arm cpu, whether smlal is better than this (as fb qnnpack blog said: The default microkernel uses the fewest possible instructions and thus delivers the best performance on low-end cores, which can execute only one NEON instruction per cycle.). So I wish we could test several arm cpus to proove our this work work well all aarch64 cores (low-end core, high-end core). Secondly, I suggest let us test mobilenet v2 too. To see that whether our pr could work well across various models. Your work is very great but I wish let us use more data and result to make it more convincing. |
Hi @FrozenGene ,
|
Hi @FrozenGene
|
@giuseros I think add the algorithm name could be one way to handle it. For example, we could add it in the
Yes, our NHWC schedule on arm cpu doesn't be complete. After our careful testing, NHWC is also perform better than NCHW on arm cpu using Ansor (aka auto scheduler) too. So this prompts us we could improve our AutoTVM NHWC schedule on arm cpu too. As the result I show in the post, we use auto schedule is to leverage NHWC layout and One background of auto scheduler: In auto scheduler, we only need tvm.compute, then we could generate schedule automatically, so we could try NHWC / NCHW easily. So there is no spatial pack schedule template concept in the auto scheduler world in fact.
So we should keep both algorithm better, right?
I think add algorithm name mentioned before maybe could help to solve it.
Ok. I buy it in. After legalization pass we discussed is solved, I am glad to do code review carefully and handle this pr. |
Hi @FrozenGene , Maybe we could add a new pass that runs based on the strategy? Or we can hack in |
@giuseros what you mean run based on the strategy? in alter_op_layout, we could extract workload[0] to get strategy, however could you help me to double check whether our autotvm tuning will use alter_op_layout pass?(i.e. O3), I have forgot a little bit. If so, maybe we could change the dtype here according to strategy. cc @anijain2305 any better idea too? |
So I mean to add a This doesn't seem possible directly in the To reply to your question, yes, the |
I think this is one interesting pass. Like we have However, this pass seems do most of the same thing in legalize (change dtype). So our legalization pass should complete this work according to different strategy. |
@giuseros I suddenly think of auto scheduler will have one environment value. So the change of legalization won't affect auto scheduler. We could check the value of this environment value for auto scheduler and use |
Hi @FrozenGene , For Armv8-A and NHWC we don't convert during the legalization step, but during the I agree that a better solution, where the legalization changes depending on the strategy, would be better. However, I don't think the legalization step has got enough information to know the strategy (for now). What do you think? |
I think it is ok. |
Hi @FrozenGene , So for now, we have to content with the |
I think I could accept this way. |
@FrozenGene @giuseros If QNN Legalization is causing issues, we can remove QNN legalization for ARM CPUs altogether and move the logic to Alter Op layout. Alter op layout might become more complicated (like we might have to handle uint8 x int8 input and kernel dtype in alter op layout now). Just an idea if consolidating things at one place makes life easier. |
@anijain2305 , thanks for the review! About getting rid of the legalization, I would not do that for now. It is in my backlog to go back to this issue and try to retrieve the strategy from the legalization pass. This should give us more optimization options. If that turns out to be not possible, then yes, I would remove the pass and do everything in the |
b6dc7c5
to
c151f90
Compare
@FrozenGene Can you please review when you get time? |
Yep. I could review it tomorrow. |
Signed-off-by: Giuseppe Rossini <[email protected]> Change-Id: I3a3d29f5332dd9b3354e8e0dfb24677a521f9c8f
Change-Id: I33853279e39c849ae1b555a9c91d7557985a0a35
Change-Id: Ieee22f032e595dabfc1616ab33466fcbf8d94365
Change-Id: I435d4d7bca7500db99547f4401fdc0d0995a1ff4
Change-Id: I2fc1ad8453e9020072ab967c849df5390c2967b5
Change-Id: I0a67a49a7849f52ef7d57b9292ce9125bbb7cb2c
Change-Id: I91b67fabd475e90a9b75f2dd5ecfee851265e0bb
Change-Id: I9a03040a8c40a6cd2658ed14c3751e05a8e19f2b
Change-Id: Ice34101e358e3ce8ebfb12c58f73e910ba5de8e8
Change-Id: Id9273688b2620e1ea849ab01b4c46af8fbf37fd0
Change-Id: Ia1755a0af7b6d159072d9f0c93c932c481101e48
Change-Id: I3333186bbc2fe4054b58ce15d910e3be7b315482
c151f90
to
9057c8b
Compare
Change-Id: Ifb5f1f33af7512fe67c6b049b20a42a0bb2d26c9
Change-Id: I25ccc844d9cee23766096e1daddb6180abc413a6
Hi @FrozenGene , Could you double check please, and let me know if this has got anything to do with my changes? Thanks |
It actually seems related to: #5827 |
Hi @FrozenGene , @anijain2305 , Thanks |
for the CI, maybe you could force trigger it or you could comment it (and contact @jroesch ) and explain the reason? |
I push an empty commit to retrigger the CI - https://coderwall.com/p/vkdekq/git-commit-allow-empty |
Change-Id: Id37706fb7cf77a87a3cc817ecf8046297d9ca95a
@anijain2305 could you have a look another round? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @giuseros @anijain2305 MERGED NOW. |
…es (apache#5754) * Improve quantized conv2d performance for armv8 Signed-off-by: Giuseppe Rossini <[email protected]> Change-Id: I3a3d29f5332dd9b3354e8e0dfb24677a521f9c8f * Add ASF header to conv2d_gemm.py Change-Id: I33853279e39c849ae1b555a9c91d7557985a0a35 * Run clang-format-10 on c++ files Change-Id: Ieee22f032e595dabfc1616ab33466fcbf8d94365 * Fix pylint errors/warnings Change-Id: I435d4d7bca7500db99547f4401fdc0d0995a1ff4 * Fix pylint errors/warnings in topi Change-Id: I2fc1ad8453e9020072ab967c849df5390c2967b5 * Fix legalizations tests for aarch64 Change-Id: I0a67a49a7849f52ef7d57b9292ce9125bbb7cb2c * Reintroduce conv2d_nhwc_spatial_pack.arm_cpu and int16 cast Change-Id: I91b67fabd475e90a9b75f2dd5ecfee851265e0bb * Switch type of legalization depending on the strategy used Change-Id: I9a03040a8c40a6cd2658ed14c3751e05a8e19f2b * Revert last commit Change-Id: Ice34101e358e3ce8ebfb12c58f73e910ba5de8e8 * Fix the auto-tuner by registering the correct schedules Change-Id: Id9273688b2620e1ea849ab01b4c46af8fbf37fd0 * Address review comments Change-Id: Ia1755a0af7b6d159072d9f0c93c932c481101e48 * Improve usability and readability of conv2d_gemm_weight_transform Change-Id: I3333186bbc2fe4054b58ce15d910e3be7b315482 * Change variable name to weight in Conv2DGemmWeightTransformRel Change-Id: Ifb5f1f33af7512fe67c6b049b20a42a0bb2d26c9 * Fix clang-10 linting errors Change-Id: I25ccc844d9cee23766096e1daddb6180abc413a6 * Trigger tests Change-Id: Id37706fb7cf77a87a3cc817ecf8046297d9ca95a
…es (apache#5754) * Improve quantized conv2d performance for armv8 Signed-off-by: Giuseppe Rossini <[email protected]> Change-Id: I3a3d29f5332dd9b3354e8e0dfb24677a521f9c8f * Add ASF header to conv2d_gemm.py Change-Id: I33853279e39c849ae1b555a9c91d7557985a0a35 * Run clang-format-10 on c++ files Change-Id: Ieee22f032e595dabfc1616ab33466fcbf8d94365 * Fix pylint errors/warnings Change-Id: I435d4d7bca7500db99547f4401fdc0d0995a1ff4 * Fix pylint errors/warnings in topi Change-Id: I2fc1ad8453e9020072ab967c849df5390c2967b5 * Fix legalizations tests for aarch64 Change-Id: I0a67a49a7849f52ef7d57b9292ce9125bbb7cb2c * Reintroduce conv2d_nhwc_spatial_pack.arm_cpu and int16 cast Change-Id: I91b67fabd475e90a9b75f2dd5ecfee851265e0bb * Switch type of legalization depending on the strategy used Change-Id: I9a03040a8c40a6cd2658ed14c3751e05a8e19f2b * Revert last commit Change-Id: Ice34101e358e3ce8ebfb12c58f73e910ba5de8e8 * Fix the auto-tuner by registering the correct schedules Change-Id: Id9273688b2620e1ea849ab01b4c46af8fbf37fd0 * Address review comments Change-Id: Ia1755a0af7b6d159072d9f0c93c932c481101e48 * Improve usability and readability of conv2d_gemm_weight_transform Change-Id: I3333186bbc2fe4054b58ce15d910e3be7b315482 * Change variable name to weight in Conv2DGemmWeightTransformRel Change-Id: Ifb5f1f33af7512fe67c6b049b20a42a0bb2d26c9 * Fix clang-10 linting errors Change-Id: I25ccc844d9cee23766096e1daddb6180abc413a6 * Trigger tests Change-Id: Id37706fb7cf77a87a3cc817ecf8046297d9ca95a
RFC
This PR is based on the following RFC: https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920
High level description of the submission
The main algorithm lives in:
The Weight transform touches different files (since it is computed at compile time):
Strategies (mapping relay-node -> compute+schedules) are defined here: