[TOPI][AutoTVM] NHWC conv2d templates for ARM #3859

zhenhuaw-me · 2019-08-30T03:33:38Z

Per #3754 and #3141 (comment) , we are enabling NHWC conv2d templates for ARM as a nearly final solution. The benefits include:

Enable NHWC schedule directly. Previously, we need to transpose between NCHW and NHWC.
AutoTVM now can tune NHWC directly. Previously, we need to build a NCHW network to tune.
Potential performance advantage in NHWC which known to community.

Cowork with @FrozenGene and @etaf .

This is a draft to loop people who may have interest, @anijain2305 @tmoreau89 . Will loop more when the PR is ready, thank you. :)

anijain2305 · 2019-08-30T03:52:57Z

@jackwish Very glad to see this :)
Does this also support int8 computation?

zhenhuaw-me · 2019-08-30T06:26:54Z

Hi Animesh, thanks for the interest, INT8 support will be added later, we'd like to improve the draft schedule first :) Animesh Jain <notifications@github.com> 于2019年8月30日周五上午11:53写道：

…

@jackwish <https://github.com/jackwish> Very glad to see this :) Does this also support int8 computation? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3859>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABFVHDIYDR5DUG6FR2RTVSLQHCKTDANCNFSM4ISI3ULQ> .

tmoreau89 · 2019-08-30T18:33:31Z

Great to see this draft @jackwish ! A question I have regarding code maintainability: should we consider reusing some common infrastructure between x86 and ARM?

kevinthesun · 2019-08-30T21:25:32Z

@jackwish Glad to see this!

As we discussed in #3754, NWHC layout can further boost conv2d on arm_cpu. In this case, we can use NHWC by default as backend schedule for arm_cpu. Combining with enhanced alter op layout pass
, we can boost models, such as MXNet model, which have NCHW as data layout.

tmoreau89 · 2019-08-30T21:39:59Z

topi/python/topi/arm_cpu/conv2d.py

@@ -132,177 +143,52 @@ def _callback(op):
    traverse_inline(s, outs[0].op, _callback)
    return s

+@autotvm.register_topi_schedule(schedule_conv2d_nhwc, 'arm_cpu', ['direct'])
+def schedule_conv2d_nchw_arm_cpu(cfg, outs):


rename to schedule_conv2d_nhwc_arm_cpu

Yes, thank you!

FrozenGene · 2019-08-31T02:06:01Z

@tmoreau89 yes, I agree. we could consider opening a RFC to discuss. Many code / schedule can be unified between x86 with arm. However, There is some different places we will have. Especially tensorize for producing special instruction.

zhenhuaw-me · 2019-08-31T14:53:40Z

Great to see this draft @jackwish ! A question I have regarding code maintainability: should we consider reusing some common infrastructure between x86 and ARM?

Yes @tmoreau89 , that is really a great idea to share code between x86 and ARM (or any other CPU ISA). And, we have some internal discussions previously. Personally, I am thinking that have a module which handles generic CPU schedules, topi.cpu for example, and let the arch-dependent code lies in topi.{isa}. We can further discuss this, I was planning to draft a RPC next week, what you say :)

zhenhuaw-me · 2019-08-31T14:55:12Z

@jackwish Glad to see this!

As we discussed in #3754, NWHC layout can further boost conv2d on arm_cpu. In this case, we can use NHWC by default as backend schedule for arm_cpu. Combining with enhanced alter op layout pass , we can boost models, such as MXNet model, which have NCHW as data layout.

Yes @kevinthesun , we are working on this to improve performance, and also enabling the functionality.

tmoreau89 · 2019-08-31T17:45:32Z

@jackwish an RFC sounds like a great idea. I agree, there will be differences between different backends (tensorization for ARM 8bit, and bitserial; vectorize for AVX etc.) but overall the schedules shouldn't diverge too much. It will make maintaining the code bases more tenable; and minimize technical debt in the long run

zhenhuaw-me · 2019-09-06T08:59:38Z

Hi guys, I am a bit busy this week, will come back to this (for the RFC part) later :)

yzhliu · 2019-09-09T03:05:11Z

Is NHWC generally faster than NCHW on ARM CPUs? should we transform NCHW to NHWC in alter_layout?

zhenhuaw-me · 2019-09-09T03:41:06Z

Hi Yizhi, thank you for ask. NHWC is not generally faster than NCHW, it depends on the schedule still. I guess we can use AutoTVM to decide which layout is preferred for specific workloads :) Regards Yizhi Liu <notifications@github.com> 于2019年9月9日周一上午11:06写道：

…

Is NHWC generally faster than NCHW on ARM CPUs? should we transform NCHW to NHWC in alter_layout? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3859>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABFVHDMJLO5ADM64PTGGDULQIW4PXANCNFSM4ISI3ULQ> .

jwfromm · 2019-09-10T22:10:41Z

@jackwish, are you sure NHWC isn't faster than NCHW on ARM? In my experience I've consistently observed speeds up to 5X faster and haven't found any cases where NCHW is preferable. I'm definitely in support of automatically altering to NHWC for arm_cpu workloads.

zhenhuaw-me · 2019-09-11T01:55:54Z

@jackwish, are you sure NHWC isn't faster than NCHW on ARM? In my experience I've consistently observed speeds up to 5X faster and haven't found any cases where NCHW is preferable. I'm definitely in support of automatically altering to NHWC for arm_cpu workloads.

Hi @jwfromm , sorry if I made any confusing comments, but I was saying that NHWC is not always faster than NCHW. To achieve better performance, sound schedule is needed.

It's really impressive to obtain 5X performance for NHWC over NCHW, which I think this PR cannot make it happen (sad story)... Would you please share more your design if that's is possible? I am sure it must be inspiring and insightful!

jwfromm · 2019-09-18T20:38:41Z

I've been using the NHWC schedule written by @cowanmeg, which can be found in the quantized_end2end branch of her TVM fork (https://github.com/cowanmeg/tvm/tree/quantized_end2end). Maybe she can chime in on what works well for NHWC.

zhenhuaw-me · 2019-09-19T01:55:58Z

I've been using the NHWC schedule written by @cowanmeg, which can be found in the quantized_end2end branch of her TVM fork (https://github.com/cowanmeg/tvm/tree/quantized_end2end). Maybe she can chime in on what works well for NHWC.

Thank you @jwfromm. I have taken a look at the code (wish not using a wrong one), which is very similar to this PR. Still, I am very curious about the 5X performance, would you please share the detail of getting the performance (I will try @cowanmeg 's schedule too)? As mobilenet1 with NCHW can get 121 ms on Raspberry Pi 3B (TVM benchmark page), 5X means ~25 ms, which outperforms many fast quantization optimization approaches I think.

jwfromm · 2019-09-24T20:43:48Z

Hi Jack,

I haven't tried benchmarking mobilenet. Because it uses separable convolutions instead of the more typical convolution layers in the models I've worked with it's very reasonable that data layout format might have a much smaller impact. Most recently I found that SqueezeNet takes 780ms to run in NCHW format and 182ms to run in NHWC format. It might be worth checking out your schedules on a squeezenet or vggnet (lots of standard convolutions).

ZihengJiang · 2019-09-26T22:03:17Z

Hey @jackwish , nice work for supporting NHWC data layout. It would be great if you could share some performance number of common models on ARM CPUs.

zhenhuaw-me · 2019-09-27T01:53:47Z

Hi Jack,

I haven't tried benchmarking mobilenet. Because it uses separable convolutions instead of the more typical convolution layers in the models I've worked with it's very reasonable that data layout format might have a much smaller impact. Most recently I found that SqueezeNet takes 780ms to run in NCHW format and 182ms to run in NHWC format. It might be worth checking out your schedules on a squeezenet or vggnet (lots of standard convolutions).

Hi @jwfromm , thanks for you data. I have tried that schedule with MobileNetV1, but didn't get performance advantage like 5X. I think the performance gain depends on the network architecture also :)

zhenhuaw-me · 2019-09-27T01:57:36Z

Hey @jackwish , nice work for supporting NHWC data layout. It would be great if you could give some performance number of common models on ARM CPUs.

Thanks for the reminder @ZihengJiang , we will share some performance data when this is ready for review. :) We have been working on internal project, so the progress of this PR could be last for days :(

anijain2305 · 2019-10-29T16:27:06Z

Hi @jackwish , any idea when this PR can be in good shape :)

zhenhuaw-me · 2019-10-31T01:33:38Z

Hi @jackwish , any idea when this PR can be in good shape :)

Hi @anijain2305 , thank you for asking. This PR was pending due to NHWC building failure on ARM targets (w/o this patch). I didn't look into it in detail as I have been busy with Quantization and OpenCL backends. Will re-check later this week :)

zhenhuaw-me · 2019-11-03T10:45:41Z

Rebased now. Collecting benchmarking data.

PS. Though we expected NHWC have a potential of better performance on CPU than NCHW, however, the result depends on how the schedule is designed. Say, a naive NHWC cannot outperform a carefully designed NCHW schedule.
We (AliOS team in Alibaba) have not observed better NHWC performance in spatial pack before, as not paying much effort in this. We was focusing on other schedules. So, this patch may not outperform the original NCHW spatial pack schedule. But still a good try :)

As some frontends (tflite for example) are using NHWC as the default layout, we are enabling NHWC schedule templates in TOPI and AutoTVM.

zhenhuaw-me · 2019-11-21T10:56:13Z

Rewritten schedule. For workload (input/output HW 64, input/output channel 64, kernel size 1), we now have comparable performance on Rasp 3B+ w/ AArch64 Ubuntu (~5ms). Tuning a larger workload currently.

So basically this is ready for review, please have a look at @tmoreau89 @ZihengJiang .

Notes:

Kernel prepack not enabled yet. We'd like to hold that until depthwise NHWC were done, and the AlterOpLayout were simplified.
Won't be used unless disable NHWC-NCHW layout transformation in AlterOpLayout. To try out this, put return None in AlterOpLayout.
AlterOpLayout will be updated when we can run whole MobileNet in NHWC.

zhenhuaw-me · 2019-11-21T10:59:59Z

For the CI sanity check , I failed to reproduce it locally :(. And, I think the warning (as below) should be ignored because we are really using cfg['compat'].val as integer rather than bool.

python3 -m pylint topi/python/topi --rcfile=/workspace/tests/lint/pylintrc

Using config file /workspace/tests/lint/pylintrc

************* Module topi.arm_cpu.conv2d_spatial_pack

topi/python/topi/arm_cpu/conv2d_spatial_pack.py:314: [R1706(consider-using-ternary), schedule_conv2d_spatial_pack_nhwc] Consider using ternary (oco if cfg['compat'].val else owo)

topi/python/topi/arm_cpu/conv2d_spatial_pack.py:331: [R1706(consider-using-ternary), schedule_conv2d_spatial_pack_nhwc] Consider using ternary (oco if cfg['compat'].val else owo)



--------------------------------------------------------------------

Your code has been rated at 10.00/10 (previous run: 10.00/10, -0.00)

I suggest to disable that warning @tqchen . Anyway, if disabling globally is too strong, I will disable it in this file.

FrozenGene · 2019-11-21T11:13:47Z

@jackwish disable in current file is acceptable

topi/python/topi/arm_cpu/conv2d_spatial_pack.py

zhenhuaw-me · 2019-11-22T03:36:17Z

Thanks @FrozenGene for your quick review, comments addressed.

tmoreau89 · 2019-11-22T06:24:07Z

@jackwish great work, thank you for pushing this PR through since it was opened. I'm glad the performance is on par on a small kernel. Do you mind checking tuned performance for e2e resnet18 and/or mobilenet? One thing to keep track of is the schedule space. If the schedule space for your implementation turns out to be very large, it might make tuning quite expensive. One thing to get around this is to use xgboost tuner, and limit the number of trials to something like 1000.

tmoreau89 · 2019-11-22T06:25:29Z

Other than that, I'd recommend adding more comments in the schedule code. Particularly in terms of decisions that affect compute declaration/schedule space definition. For instance, why did you list the different dimension reorderings etc. This will help future maintainers, or folks who'd want to write future schedule templates for ARM.

zhenhuaw-me · 2019-11-22T07:00:57Z

Hi @tmoreau89 , thanks for your comments!

Do you mind checking tuned performance for e2e resnet18 and/or mobilenet?

I'd like to provide e2e performance too, though it seems pretty hard before we have a dedicated depthwise NHWC schedule. I will tune some workloads from MobileNet.

One thing to keep track of is the schedule space.

The tuning space is a bit larger than NCHW, but still manageable. When evaluating the performance, I have a trail of 1000 and early stop of 500 for both NCHW and NHWC. The best result is obtained at about 500th iteration even with a sufficient large workload (image size 128 with input/output channel 128).

I'd recommend adding more comments in the schedule code.

For sure I agree with this. Most of the TVM schedule primitives are straightforward (to people who have experience), while the recording are basically for locality in general regarding different workloads.

tmoreau89 · 2019-12-04T03:41:35Z

@jackwish any updates on the e2e mobilenet tuning results? thanks!

zhenhuaw-me · 2019-12-05T06:10:29Z

@jackwish any updates on the e2e mobilenet tuning results? thanks!

@tmoreau89 I am very sorry for the delay, we have been working on enabling some models internally. I will update the data as soon as possible when available.

tqchen · 2019-12-22T05:00:30Z

What is the status of the PR, is it ready to be merged?

FrozenGene · 2019-12-26T11:48:02Z

@jackwish do you have spare time to continue to handle this? If we could support NHWC layout of depthwise convolution in this pr, it will be nicer, because this will help users tune whole mobilenet when to use TFLite. I wish we could make it in ASAP.

FrozenGene · 2019-12-26T11:50:18Z

@jackwish If you don't have time to benchmark end2end results compared with NCHW, I think it doesn't matter, we could support this functionality and continue to evolve. @tqchen how about your opinion of this?

tqchen · 2019-12-26T17:37:30Z

Given the PR stands complete, i will merge it in for now, would be great if we can followup with a thread for benchmarks :) Thanks @jackwish @FrozenGene @tmoreau89 @snowolfhawk @jwfromm @yzhliu

zhenhuaw-me · 2019-12-27T08:38:48Z

Thanks @tqchen for your help of merging this PR. The long progress have been interrupted many times due to busy internal work, I am sorry for that and related todo things. Currently, I lose access to the development environment as I have left the team, but I may still be available to help to improve related schedules individually. One this to for this PR is that, the NHWC->NCHW layout transform needs to be disabled to tune with this NHWC schedule. But please don't rush for it as I have not tested it with latest code.

Thank you everyone.
Best regards

* [AutoTVM][TOPI] NHWC conv2d templates (spatial pack) for ARM As some frontends (tflite for example) are using NHWC as the default layout, we are enabling NHWC schedule templates in TOPI and AutoTVM. * some comments fix

@cchung100m

* Change upstream url * Fix bias_add gradient (apache#4516) * Fix bias_add gradient A change caused collapse_sum_like to reject implicit dimension broadcasting for bias_add gradient, so switch to explicit sum reduction on the non-bias axis dimensions. * Lint fix * [Bugfix][Frontend][TFlite] Fix wrong function call in TANH tests (apache#4517) * Replace sigmoid() with tanh() in tests for TANH * Fixed extra reshape parameter bug. (apache#4524) * Use the best tuner possible (apache#4397) * Use the best tuner possible * Add comment denoting availability of better tuners * Fix typos and wording * [ir] use DataType instead of Type for readability because Type has been deprecated (apache#4513) * add bfloat16 typeflag support (apache#4525) * fix empty config caused KeyError (apache#4520) * fix onnx shape dtype (apache#4528) * fix crash issue in tsim backend (apache#4527) * PIL is depreciated and should be replaced with pillow (a fork of PIL) (apache#4533) Change-Id: If2075df5475505f2da87dae7145af5a7ab83d8a4 * [Relay] External codegen (apache#4482) * Update legacy places from nnvm to relay. (apache#4535) * Update legacy places from nnvm to relay. This PR prepares the current mainline to remove nnvm compiler dep. * remove legacy stage * Implement 1d deconvolution (apache#4476) * [relay][op] add expand op (from ONNX) to relay frontend (apache#4483) * Add Expand to onnx.py * add test function for expand * Fix a onnx frontend test * Add tests for the value itself instead of shape only on test_expand * Cleaned up some unnecessary modifications. * [TOPI] Allow batch matmul to be fused into injective ops (apache#4537) * [TOPI] Fixed nms max_output_size loop (apache#4541) One of the loops in hybrid_nms used for performing the max_output_size reordering was incorrectly designated as parallel resulting in incorrect behaviour. This patch changes that loop to a serial loop. Change-Id: I97184f5887f5f028d8ab339fa2808eb7630a4017 * [DOCS] Mention Ninja build system in install/from_source.rst (apache#4554) * [DOCS] Mention Ninja build system in install/from_source.rst * Address comments * [PYTHON][FFI] Cythonize NDArray.copyto (apache#4549) * [PYTHON][FFI] Cythonize NDArray.copyto * Cythonize the shape property * vm external codegen (apache#4544) * [COMMUNITY] @cchung100m -> reviewer (apache#4557) * [VTA] improved virtual memory mapping (apache#4545) * [VTA] improved virtual memory mapping * Update virtual_memory.cc * [IR] fix style in ir_mutator and ir_visitor (apache#4561) * [RUNTIME][VULKAN] Fix compiler warning (apache#4559) * [REFACTOR][DTYPE] Isolate dtype to runtime (apache#4560) dtype.h -> runtime/data_type.h Changes: - Rename all old reference of tvm::Type to DataType - ExprNode.type -> ExprNode.dtype - Expr.type() -> Expr.dtype() - Change Expr related functions to expr_operator. - DataType::min() -> min_value(DataType) - DataType::max() -> max_value(DataType) - Move type constructor Int, UInt, Float, Handle, Bool into DataType. - Int(bits) -> DataType::Int(bits) - UInt(bits) -> DataType::UInt(bits) * Support standardize runtime module (apache#4532) * [Relay][Frontend][ONNX] Support auto_pad in Conv and ConvTranspose (apache#4563) * [TEST] Remove nnvm related code in topi and test script (apache#4562) * [TEST] Remove nnvm related code in topi and test script * Remove docs dep * [Relay] add max_pool3d in relay and TF converter (apache#4551) * [Relay] add max_pool3d in relay and TF converter * fix comments * Remove nnvm (apache#4565) * [VTA][Chisel] End-to-end Inference with Chisel VTA (apache#4574) * [VTA][Chisel] End-to-end Inference with Chisel VTA * Update TensorAlu.scala * remove unnecessary cast to int32 (apache#4573) * Fix llvm-enabled build by adding missing intrinsics headers (apache#4575) * [DEPRECATION] Remove NNVM compiler (apache#4571) * Remove NNVM compiler * [Relay/Topi][Op] Added native DepthToSpace and SpaceToDepth Operators (apache#4566) * Added tvm function stencil for subpixel operations to topi. * Topi subpixel operators added and tested. * Added subpixel attrs. * Added depth_to_space relay attributes. * depth_to_space fully working. * Fixed NHWC shape bug. * SpaceToDepth in and all tests passing. * lint fixes. * Added string include * Fixed topi formatting. * Added DCR/CDR mode to depthtospace operator. * [DOC] fix doc in api.py (apache#4580) * [DEPRECATION] Cleanup legacy verilog support (apache#4576) This PR cleans up the left over code for legacy verilog support which was experimental. The new hardware backend path is now support by VTA via TSIM. * [RUNTIME] Remove Extension VTable in favor of Unified Object system. (apache#4578) Before the unified object protocol, we support pass additional extension objects around by declaring a type as an extension type. The old extension mechanism requires the types to register their constructor and deleter to a VTable and does not enjoy the benefit of the self-contained deletion property of the new Object system. This PR upgrades the extension example to make use of the new object system and removed the old Extension VTable. Note that the register_extension funtion in the python side continues to work when the passed argument does not require explicit container copy/deletion, which covers the current usecases of the extension mechanism. * Some Windows and MSVC fixes (apache#4569) * fix python exception creation in Windows * better string conversion for msvc * fix cpp style issue * [NEWS] add v0.6 release (apache#4558) * [NEWS] add v0.6 release * remove link prefix * fix issue number * [DOCS]fix typos in autotvm tutorial (apache#4585) * [Quantization, Calibrate] Fix context creation when current_target is explicity set (apache#4582) * [Container] Fix NDArray SaveDLTensor declaration and implementation signature different (apache#4586) * [TOPI][AutoTVM] NHWC conv2d templates for ARM (apache#3859) * [AutoTVM][TOPI] NHWC conv2d templates (spatial pack) for ARM As some frontends (tflite for example) are using NHWC as the default layout, we are enabling NHWC schedule templates in TOPI and AutoTVM. * some comments fix * [FIX][TOPI][X86] schedule dense pack (apache#4539) * [Relay] Convert Layout Pass. (apache#4335) * [Relay][AlterLayout] Broadcast with scalar shape (apache#4577) * [TOPI] add 3D upsampling Op. (apache#4584) * [TOPI] add 3D upsampling Op. * fix lint issues * change align_corners to coordinate_transformation_mode * fix resize3d half_pixel * make a simple function and clean up trilinear_resize3d_python * fix doc * [Runtime] add necessary const qualifier for NDArray container of parameters (apache#4590) * [autotvm] fix typos in comment (apache#4591) * fix tf.compat.v1 issue for tf verison <=1.12 (apache#4593) * [FRONTEND][TF] conv2d_transpose 'SAME' support kernel more than 1x1 (apache#4484) * [FRONTEND][TF] conv3d_transpose 'SAME' support kernel more than 1x1 * revised per as review comments * add more fallback wolkaround to make all tests pass * [GraphRuntime] Support parameter out in the graph runtime debug (apache#4598) * [GraphRuntime] Support parameter out in the graph runtime debug * Dummy commit to trigger build * [Perf] Add CublasLt extern support for better Igemm performance (apache#4550) * cublaslt added * fix lint * address comments * address more comments * Trigger CI * Trigger CI * fix codegenc (apache#4597) * [REFACTOR][RUNTIME] Update NDArray use the Unified Object System (apache#4581) * [REFACTOR][RUNTIME] Move NDArray to Object System. Previously NDArray has its own object reference counting mechanism. This PR migrates NDArray to the unified object protocol. The calling convention of NDArray remained intact. That means NDArray still has its own type_code and its handle is still DLTensor compatible. In order to do so, this PR added a few minimum runtime type detection in TVMArgValue and RetValue only when the corresponding type is a base type(ObjectRef) that could also refer to NDArray. This means that even if we return a base reference object ObjectRef which refers to the NDArray. The type_code will still be translated correctly as kNDArrayContainer. If we assign a non-base type(say Expr) that we know is not compatible with NDArray during compile time, no runtime type detection will be performed. This PR also adopts the object protocol for NDArray sub-classing and removed the legacy NDArray subclass protocol. Examples in apps/extension are now updated to reflect that. Making NDArray as an Object brings all the benefits of the object system. For example, we can now use the Array container to store NDArrays. * Address review comments * [Relay][Convert Layout] Handling batch norm layout change. (apache#4600) * [relay][refactor] Cache Op::Get in passes to reduce lookup overhead (apache#4594) * Refactor to use IsOp utility * retrigger CI * Update dmlc_tvm_commit_id.txt * disable one test_batch_norm unit test for now to check CI * enable test_batch_norm Co-authored-by: SWu <SWu@users.noreply.github.com> Co-authored-by: Ina Dobreva <55383260+inadob@users.noreply.github.com> Co-authored-by: Josh Fromm <jwfromm@uw.edu> Co-authored-by: miheer vaidya <v.miheer@gmail.com> Co-authored-by: Liang ZOU <liang.d.zou@gmail.com> Co-authored-by: YixinBao <yixin.bao@intel.com> Co-authored-by: Cody Yu <comaniac0422@gmail.com> Co-authored-by: masahi <masahi129@gmail.com> Co-authored-by: Liangfu Chen <liangfu.chen@icloud.com> Co-authored-by: lhutton1 <35535092+lhutton1@users.noreply.github.com> Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> Co-authored-by: Alex Gladkov <gladkov_alex@yahoo.com> Co-authored-by: Takato Yamada <tkclimb0911@gmail.com> Co-authored-by: Haichen Shen <shenhaichen@gmail.com> Co-authored-by: mbarrett97 <55580676+mbarrett97@users.noreply.github.com> Co-authored-by: Hideto Ueno <uenoku.tokotoko@gmail.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Zhao Wu <wuzhaozju@gmail.com> Co-authored-by: Neo Chien <cchung100m@cs.ccu.edu.tw> Co-authored-by: Yong Wu <55wuyong@163.com> Co-authored-by: Dmitri Makarov <dmakarov@users.noreply.github.com> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: kice <wslikerqs@gmail.com> Co-authored-by: Yizhi Liu <liuyizhi@apache.org> Co-authored-by: Wang Yucheng <wyc91543@163.com> Co-authored-by: 王振华(Zhenhua WANG) <i@jackwish.net> Co-authored-by: deepIgnorance <zhengsizemax@outlook.com> Co-authored-by: Animesh Jain <anijain@umich.edu> Co-authored-by: optima2005 <56945758+optima2005@users.noreply.github.com> Co-authored-by: zhuochen <zhuochen@outlook.com> Co-authored-by: Leyuan Wang <laurawly@gmail.com>

* [AutoTVM][TOPI] NHWC conv2d templates (spatial pack) for ARM As some frontends (tflite for example) are using NHWC as the default layout, we are enabling NHWC schedule templates in TOPI and AutoTVM. * some comments fix

This was referenced Aug 30, 2019

[TFLite] Convert TFLite NCHW to NHWC #3141

Merged

[Relay][Legalize][ARM_CPU] Handling NHWC layout for arm_cpu. #3754

Merged

tmoreau89 mentioned this pull request Aug 30, 2019

[Relay] Bitserial ops #3844

Merged

tmoreau89 reviewed Aug 30, 2019

View reviewed changes

zhenhuaw-me mentioned this pull request Sep 3, 2019

[Test] enable NHWC of relay.testing.mobilenet #3886

Merged

tqchen assigned ZihengJiang Sep 13, 2019

zhenhuaw-me mentioned this pull request Sep 19, 2019

[TOPI] Move conv2d spatial pack schedule to dedicated file #3972

Merged

zhenhuaw-me marked this pull request as ready for review November 3, 2019 10:29

[AutoTVM][TOPI] NHWC conv2d templates (spatial pack) for ARM

e83f8e6

As some frontends (tflite for example) are using NHWC as the default layout, we are enabling NHWC schedule templates in TOPI and AutoTVM.

zhenhuaw-me changed the title ~~[WIP] [AutoTVM][TOPI] NHWC conv2d templates for ARM~~ [AutoTVM][TOPI] NHWC conv2d templates for ARM Nov 21, 2019

zhenhuaw-me changed the title ~~[AutoTVM][TOPI] NHWC conv2d templates for ARM~~ [TOPI][AutoTVM] NHWC conv2d templates for ARM Nov 21, 2019

zhenhuaw-me requested a review from tmoreau89 November 21, 2019 11:01

FrozenGene reviewed Nov 21, 2019

View reviewed changes

some comments fix

db4fe6e

FrozenGene mentioned this pull request Dec 26, 2019

[AutoTVM] Tuning fails for an NHWC network on Arm CPU #4542

Closed

tqchen approved these changes Dec 26, 2019

View reviewed changes

tqchen merged commit 672b090 into apache:master Dec 26, 2019

zhenhuaw-me deleted the topi/conv-nhwc branch December 27, 2019 08:39

tqchen mentioned this pull request Jan 15, 2020

[AutoTVM][TOPI] AutoTVM support for NHWC con2d #3858

Closed

anijain2305 mentioned this pull request Apr 15, 2020

Schedule Transferability between Intel and ARM CPU targets #5340

Closed

4 tasks

zhiics mentioned this pull request Sep 15, 2020

TVM v0.7 Release Note Candidate #6486

Closed

[TOPI][AutoTVM] NHWC conv2d templates for ARM #3859

[TOPI][AutoTVM] NHWC conv2d templates for ARM #3859

Conversation

zhenhuaw-me commented Aug 30, 2019 • edited Loading

anijain2305 commented Aug 30, 2019

zhenhuaw-me commented Aug 30, 2019 via email

tmoreau89 commented Aug 30, 2019

kevinthesun commented Aug 30, 2019

tmoreau89 Aug 30, 2019

Choose a reason for hiding this comment

zhenhuaw-me Aug 31, 2019

Choose a reason for hiding this comment

FrozenGene commented Aug 31, 2019

zhenhuaw-me commented Aug 31, 2019

zhenhuaw-me commented Aug 31, 2019 • edited Loading

tmoreau89 commented Aug 31, 2019

zhenhuaw-me commented Sep 6, 2019

yzhliu commented Sep 9, 2019

zhenhuaw-me commented Sep 9, 2019 via email

jwfromm commented Sep 10, 2019 • edited Loading

zhenhuaw-me commented Sep 11, 2019

jwfromm commented Sep 18, 2019

zhenhuaw-me commented Sep 19, 2019 • edited Loading

jwfromm commented Sep 24, 2019 • edited Loading

ZihengJiang commented Sep 26, 2019 • edited Loading

zhenhuaw-me commented Sep 27, 2019

zhenhuaw-me commented Sep 27, 2019

anijain2305 commented Oct 29, 2019

zhenhuaw-me commented Oct 31, 2019

zhenhuaw-me commented Nov 3, 2019

zhenhuaw-me commented Nov 21, 2019

zhenhuaw-me commented Nov 21, 2019 • edited Loading

FrozenGene commented Nov 21, 2019

zhenhuaw-me commented Nov 22, 2019

tmoreau89 commented Nov 22, 2019

tmoreau89 commented Nov 22, 2019

zhenhuaw-me commented Nov 22, 2019

tmoreau89 commented Dec 4, 2019

zhenhuaw-me commented Dec 5, 2019

tqchen commented Dec 22, 2019

FrozenGene commented Dec 26, 2019 • edited Loading

FrozenGene commented Dec 26, 2019

tqchen commented Dec 26, 2019 • edited Loading

zhenhuaw-me commented Dec 27, 2019 • edited Loading

zhenhuaw-me commented Aug 30, 2019 •

edited

Loading

zhenhuaw-me commented Aug 31, 2019 •

edited

Loading

jwfromm commented Sep 10, 2019 •

edited

Loading

zhenhuaw-me commented Sep 19, 2019 •

edited

Loading

jwfromm commented Sep 24, 2019 •

edited

Loading

ZihengJiang commented Sep 26, 2019 •

edited

Loading

zhenhuaw-me commented Nov 21, 2019 •

edited

Loading

FrozenGene commented Dec 26, 2019 •

edited

Loading

tqchen commented Dec 26, 2019 •

edited

Loading

zhenhuaw-me commented Dec 27, 2019 •

edited

Loading