Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX schedule for conv_NCHW[x]c #1143

Merged
merged 7 commits into from
May 14, 2018
Merged

Conversation

yzhliu
Copy link
Member

@yzhliu yzhliu commented May 4, 2018

  • Add compute and schedule for conv_NCHWc
  • schedule lookup table for resnet-v1-50,101,152
  • allow to generate default AVX schedule for other workloads

In my test, it brings around 40% speedup for ResNet. You can try this patch with the latest NNVM. @kevinthesun @yidawang @FrozenGene @masahi

But need to verify something before merge:
looks like the fold_scale_axis optimization conflicts with kernel packing from OIHW to OIHW[x]i[y]o, leads to disagreement with mxnet's output. ResNet & VGG works fine with this PR (coz by alter_conv2d_layout we replace conv2d by conv2d_NCHWc, skipping the optimization in fold_scale_axis), but if we remove the function alter_conv2d_layout and run ResNet50-v2 with opt_level=3, we'll see the problem.
Moreover, squeeze_net's output (https://mxnet.incubator.apache.org/api/python/gluon/model_zoo.html) is not correct, not sure whether it is caused by same problem.

@yzhliu
Copy link
Member Author

yzhliu commented May 9, 2018

I can confirm the fold_scale_axis thing is expected, and the implement here is correct.
squeeze_net is another story, it is unrelated with the changes here.

Ready to merge.

@yzhliu yzhliu changed the title [WIP] AVX schedule for conv_NCHW[x]c AVX schedule for conv_NCHW[x]c May 9, 2018
@@ -81,9 +122,63 @@ def _declaration_conv(data, kernel, stride, padding, layout, out_dtype):
raise ValueError("not support this layout {} yet".format(layout))


@conv2d_alter_layout.register("cpu")
def _alter_conv2d_layout(attrs, inputs, tinfos):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us move the alter registry to the nnvm for now. We can still make use of the generic function system. Should put more thoughts into things once we merge nnvm into tvm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i take this comment back after discussing with yizhi

@tqchen
Copy link
Member

tqchen commented May 9, 2018

@ZihengJiang can you do a review?

@yzhliu
Copy link
Member Author

yzhliu commented May 9, 2018

Can you do a review as well? @kevinthesun @yidawang @Laurawly

tvm.placeholder((num_filter, ic, kh, kw), dtype=out_dtype),
stride, padding, out_dtype)
sch = _get_schedule(wkl)
return _AVX_SCH_TO_DECL_FUNC[type(sch)](wkl, data, kernel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also pass sch into _AVX_SCH_TO_DECL_FUNC so that _AVX_SCH_TO_DECL_FUNC doesn't need to call _get_schedule again. This will help later for global schedules.


wkl = _get_workload(original_data, original_kernel, stride, padding, conv_out.dtype)
sch = _get_schedule(wkl)
_AVX_SCH_TO_SCH_FUNC[type(sch)](s, wkl, data_vec,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Maybe pass sch into _AVX_SCH_TO_SCH_FUNC.

@tqchen
Copy link
Member

tqchen commented May 12, 2018

please let me know when the changes are approved and we can merge it in

@yzhliu
Copy link
Member Author

yzhliu commented May 13, 2018

@kevinthesun confirm?

@kevinthesun
Copy link
Contributor

lgtm

@FrozenGene
Copy link
Member

@yzhliu Thanks. I will test it further. However, I am busying in implementing NNVM's SAME padding now, which will also affect scheduling (Because I will pass 4 arguments now, not only top / left, but also bottom / right). I will try to the current scheduling work and add it. Then I will test it within your new changes. 👍

@tqchen tqchen merged commit a3f1630 into apache:master May 14, 2018
@FrozenGene
Copy link
Member

FrozenGene commented Jul 3, 2018

@yzhliu I find one potential issue. When the outfilter is 1001, the result is not right.
The workload is here:
Workload(in_dtype='float32', out_dtype='float32', height=1, width=1, in_filter=1024, out_filter=1001, hkernel=1, wkernel=1, tpad=0, lpad=0, bpad=0, rpad=0, hstride=1, wstride=1)

When we enter this function:

def _get_default_schedule(wkl, simd_width):
    HPAD, WPAD = wkl.tpad + wkl.bpad, wkl.lpad + wkl.rpad
    HSTR, WSTR = wkl.hstride, wkl.wstride
    out_height = (wkl.height + HPAD - wkl.hkernel) // HSTR + 1
    out_width = (wkl.width + WPAD - wkl.wkernel) // WSTR + 1

    oc_bn = 1
    for bn in range(simd_width, 0, -1):
        if wkl.out_filter % bn == 0:
            oc_bn = bn
            break

simd_width is fp32_vec_len, i.e. 8. Then wkl.out_filter is 1001, The result is oc_bn will be 7
When we compute the conv tensor, we will get this:
Tensor(shape=[1, 143, 1, 1, 7], op.name=conv2d_NCHWc)

However, this result is not right. If we turn off AlterOpLayout or make the oc_bn be 1, we will get the correct result.

Don't we consider the odd number of oc_bn? Or what else happened?

@yzhliu
Copy link
Member Author

yzhliu commented Jul 3, 2018

Not sure what happens, simply test on vectorizing over axis of size 7 actually produces correct result. I'll check later.

@FrozenGene
Copy link
Member

@yzhliu I think I should sync the status with you. It is not related with your changes. The issue is Softmax. Softmax's FCorrectLayout is just

.set_attr<FCorrectLayout>("FCorrectLayout", ElemwiseArbitraryLayout<1, 1>)

However Softmax should keep the input layout as original, because the computation relates to the layout defined in pre-trained model. We should add SoftmaxCorrectLayout function to handle it.

tqchen pushed a commit to tqchen/tvm that referenced this pull request Jul 6, 2018
* add conv2d_NCHWc compute template

* add conv_NCHWc compute decl and schedules

* allow default avx schedule

* fix lint

* remove unused schedule

* remove import nnvm.reg

* pass schedule object to compute and schedule
@yzhliu
Copy link
Member Author

yzhliu commented Jul 6, 2018

@FrozenGene gotcha. thanks, would you mind to shoot a PR to fix?

@FrozenGene
Copy link
Member

FrozenGene commented Jul 7, 2018

@yzhliu In fact, we have modified much code on CoreML’s support and also on schedule. This work is done in the company and we have our rule of open source. I think I should ask this question to my manager about open source (such as this issue we talk). I personally want to pull a request to fix it.

BTW, I find that fold_scale_axis optimization will have problem on shuffleseg model. I investigate it and find that maybe fold_scale_axis optimization doesn’t handle convolution with groups != 1. Do you have any ideas? because I find when groups != 1 we will have problem(check shape error) when we apply this optimization.

@yzhliu
Copy link
Member Author

yzhliu commented Jul 8, 2018

No worries, I can make the PR. In my understanding, for now fold_scale_axis is only well defined for NCHW and groups=1 convolution.

@FrozenGene
Copy link
Member

FrozenGene commented Jul 9, 2018

@yzhliu Ok. I know. Could you tell me why we don't support groups = 1? If it is very difficult for supporting, I will turn it off currently. For example : add condition param.groups == 1 in the condition of Conv2DScaleAxisForward / Conv2DScaleAxisBackward function

 // only optimize for nchw for now
  if (param.kernel_layout == "OIHW" && (*in_info)[0].axis == 1 && param.groups == 1) {

sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018
* add conv2d_NCHWc compute template

* add conv_NCHWc compute decl and schedules

* allow default avx schedule

* fix lint

* remove unused schedule

* remove import nnvm.reg

* pass schedule object to compute and schedule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants