AVX schedule for conv_NCHW[x]c #1143

yzhliu · 2018-05-04T18:13:59Z

Add compute and schedule for conv_NCHWc
schedule lookup table for resnet-v1-50,101,152
allow to generate default AVX schedule for other workloads

In my test, it brings around 40% speedup for ResNet. You can try this patch with the latest NNVM. @kevinthesun @yidawang @FrozenGene @masahi

But need to verify something before merge:
looks like the fold_scale_axis optimization conflicts with kernel packing from OIHW to OIHW[x]i[y]o, leads to disagreement with mxnet's output. ResNet & VGG works fine with this PR (coz by alter_conv2d_layout we replace conv2d by conv2d_NCHWc, skipping the optimization in fold_scale_axis), but if we remove the function alter_conv2d_layout and run ResNet50-v2 with opt_level=3, we'll see the problem.
Moreover, squeeze_net's output (https://mxnet.incubator.apache.org/api/python/gluon/model_zoo.html) is not correct, not sure whether it is caused by same problem.

yzhliu · 2018-05-09T07:34:16Z

I can confirm the fold_scale_axis thing is expected, and the implement here is correct.
squeeze_net is another story, it is unrelated with the changes here.

Ready to merge.

tqchen · 2018-05-09T16:30:55Z

topi/python/topi/x86/conv2d.py

@@ -81,9 +122,63 @@ def _declaration_conv(data, kernel, stride, padding, layout, out_dtype):
        raise ValueError("not support this layout {} yet".format(layout))


+@conv2d_alter_layout.register("cpu")
+def _alter_conv2d_layout(attrs, inputs, tinfos):


Let us move the alter registry to the nnvm for now. We can still make use of the generic function system. Should put more thoughts into things once we merge nnvm into tvm

OK, i take this comment back after discussing with yizhi

tqchen · 2018-05-09T16:31:18Z

@ZihengJiang can you do a review?

yzhliu · 2018-05-09T17:50:37Z

Can you do a review as well? @kevinthesun @yidawang @Laurawly

kevinthesun · 2018-05-11T00:40:45Z

topi/python/topi/x86/conv2d.py

+                        tvm.placeholder((num_filter, ic, kh, kw), dtype=out_dtype),
+                        stride, padding, out_dtype)
+    sch = _get_schedule(wkl)
+    return _AVX_SCH_TO_DECL_FUNC[type(sch)](wkl, data, kernel)


Can we also pass sch into _AVX_SCH_TO_DECL_FUNC so that _AVX_SCH_TO_DECL_FUNC doesn't need to call _get_schedule again. This will help later for global schedules.

kevinthesun · 2018-05-11T00:41:21Z

topi/python/topi/x86/conv2d.py

+
+            wkl = _get_workload(original_data, original_kernel, stride, padding, conv_out.dtype)
+            sch = _get_schedule(wkl)
+            _AVX_SCH_TO_SCH_FUNC[type(sch)](s, wkl, data_vec,


Same here. Maybe pass sch into _AVX_SCH_TO_SCH_FUNC.

tqchen · 2018-05-12T02:41:05Z

please let me know when the changes are approved and we can merge it in

yzhliu · 2018-05-13T06:17:27Z

@kevinthesun confirm?

kevinthesun · 2018-05-14T02:32:03Z

lgtm

FrozenGene · 2018-05-14T02:40:02Z

@yzhliu Thanks. I will test it further. However, I am busying in implementing NNVM's SAME padding now, which will also affect scheduling (Because I will pass 4 arguments now, not only top / left, but also bottom / right). I will try to the current scheduling work and add it. Then I will test it within your new changes. 👍

FrozenGene · 2018-07-03T06:27:00Z

@yzhliu I find one potential issue. When the outfilter is 1001, the result is not right.
The workload is here:
Workload(in_dtype='float32', out_dtype='float32', height=1, width=1, in_filter=1024, out_filter=1001, hkernel=1, wkernel=1, tpad=0, lpad=0, bpad=0, rpad=0, hstride=1, wstride=1)

When we enter this function:

def _get_default_schedule(wkl, simd_width):
    HPAD, WPAD = wkl.tpad + wkl.bpad, wkl.lpad + wkl.rpad
    HSTR, WSTR = wkl.hstride, wkl.wstride
    out_height = (wkl.height + HPAD - wkl.hkernel) // HSTR + 1
    out_width = (wkl.width + WPAD - wkl.wkernel) // WSTR + 1

    oc_bn = 1
    for bn in range(simd_width, 0, -1):
        if wkl.out_filter % bn == 0:
            oc_bn = bn
            break

simd_width is fp32_vec_len, i.e. 8. Then wkl.out_filter is 1001, The result is oc_bn will be 7
When we compute the conv tensor, we will get this:
Tensor(shape=[1, 143, 1, 1, 7], op.name=conv2d_NCHWc)

However, this result is not right. If we turn off AlterOpLayout or make the oc_bn be 1, we will get the correct result.

Don't we consider the odd number of oc_bn? Or what else happened?

yzhliu · 2018-07-03T17:12:22Z

Not sure what happens, simply test on vectorizing over axis of size 7 actually produces correct result. I'll check later.

FrozenGene · 2018-07-06T05:27:12Z

@yzhliu I think I should sync the status with you. It is not related with your changes. The issue is Softmax. Softmax's FCorrectLayout is just

.set_attr<FCorrectLayout>("FCorrectLayout", ElemwiseArbitraryLayout<1, 1>)

However Softmax should keep the input layout as original, because the computation relates to the layout defined in pre-trained model. We should add SoftmaxCorrectLayout function to handle it.

* add conv2d_NCHWc compute template * add conv_NCHWc compute decl and schedules * allow default avx schedule * fix lint * remove unused schedule * remove import nnvm.reg * pass schedule object to compute and schedule

yzhliu · 2018-07-06T18:59:48Z

@FrozenGene gotcha. thanks, would you mind to shoot a PR to fix?

FrozenGene · 2018-07-07T05:40:32Z

@yzhliu In fact, we have modified much code on CoreML’s support and also on schedule. This work is done in the company and we have our rule of open source. I think I should ask this question to my manager about open source (such as this issue we talk). I personally want to pull a request to fix it.

BTW, I find that fold_scale_axis optimization will have problem on shuffleseg model. I investigate it and find that maybe fold_scale_axis optimization doesn’t handle convolution with groups != 1. Do you have any ideas? because I find when groups != 1 we will have problem(check shape error) when we apply this optimization.

yzhliu · 2018-07-08T19:44:56Z

No worries, I can make the PR. In my understanding, for now fold_scale_axis is only well defined for NCHW and groups=1 convolution.

FrozenGene · 2018-07-09T05:56:48Z

@yzhliu Ok. I know. Could you tell me why we don't support groups = 1? If it is very difficult for supporting, I will turn it off currently. For example : add condition param.groups == 1 in the condition of Conv2DScaleAxisForward / Conv2DScaleAxisBackward function

 // only optimize for nchw for now
  if (param.kernel_layout == "OIHW" && (*in_info)[0].axis == 1 && param.groups == 1) {

* add conv2d_NCHWc compute template * add conv_NCHWc compute decl and schedules * allow default avx schedule * fix lint * remove unused schedule * remove import nnvm.reg * pass schedule object to compute and schedule

yzhliu added 6 commits May 4, 2018 10:42

add conv2d_NCHWc compute template

3cc2fcb

add conv_NCHWc compute decl and schedules

109c618

allow default avx schedule

25987f5

fix lint

9f5b7da

remove unused schedule

2d499d6

remove import nnvm.reg

b9c3f87

yzhliu changed the title ~~[WIP] AVX schedule for conv_NCHW[x]c~~ AVX schedule for conv_NCHW[x]c May 9, 2018

yzhliu mentioned this pull request May 9, 2018

enable AlterOpLayout to keep OP unchanged dmlc/nnvm#471

Merged

tqchen requested changes May 9, 2018

View reviewed changes

kevinthesun reviewed May 11, 2018

View reviewed changes

pass schedule object to compute and schedule

206dc28

kevinthesun approved these changes May 14, 2018

View reviewed changes

tqchen approved these changes May 14, 2018

View reviewed changes

tqchen merged commit a3f1630 into apache:master May 14, 2018

yzhliu mentioned this pull request Jul 8, 2018

Fix CorrectLayout for softmax & log_softmax #1401

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX schedule for conv_NCHW[x]c #1143

AVX schedule for conv_NCHW[x]c #1143

yzhliu commented May 4, 2018 •

edited

Loading

yzhliu commented May 9, 2018

tqchen May 9, 2018

tqchen May 9, 2018

tqchen commented May 9, 2018

yzhliu commented May 9, 2018

kevinthesun May 11, 2018

kevinthesun May 11, 2018

tqchen commented May 12, 2018

yzhliu commented May 13, 2018

kevinthesun commented May 14, 2018

FrozenGene commented May 14, 2018

FrozenGene commented Jul 3, 2018 •

edited

Loading

yzhliu commented Jul 3, 2018

FrozenGene commented Jul 6, 2018

yzhliu commented Jul 6, 2018

FrozenGene commented Jul 7, 2018 •

edited

Loading

yzhliu commented Jul 8, 2018

FrozenGene commented Jul 9, 2018 •

edited

Loading

AVX schedule for conv_NCHW[x]c #1143

AVX schedule for conv_NCHW[x]c #1143

Conversation

yzhliu commented May 4, 2018 • edited Loading

yzhliu commented May 9, 2018

tqchen May 9, 2018

Choose a reason for hiding this comment

tqchen May 9, 2018

Choose a reason for hiding this comment

tqchen commented May 9, 2018

yzhliu commented May 9, 2018

kevinthesun May 11, 2018

Choose a reason for hiding this comment

kevinthesun May 11, 2018

Choose a reason for hiding this comment

tqchen commented May 12, 2018

yzhliu commented May 13, 2018

kevinthesun commented May 14, 2018

FrozenGene commented May 14, 2018

FrozenGene commented Jul 3, 2018 • edited Loading

yzhliu commented Jul 3, 2018

FrozenGene commented Jul 6, 2018

yzhliu commented Jul 6, 2018

FrozenGene commented Jul 7, 2018 • edited Loading

yzhliu commented Jul 8, 2018

FrozenGene commented Jul 9, 2018 • edited Loading

yzhliu commented May 4, 2018 •

edited

Loading

FrozenGene commented Jul 3, 2018 •

edited

Loading

FrozenGene commented Jul 7, 2018 •

edited

Loading

FrozenGene commented Jul 9, 2018 •

edited

Loading