-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX schedule for conv_NCHW[x]c #1143
Conversation
I can confirm the Ready to merge. |
@@ -81,9 +122,63 @@ def _declaration_conv(data, kernel, stride, padding, layout, out_dtype): | |||
raise ValueError("not support this layout {} yet".format(layout)) | |||
|
|||
|
|||
@conv2d_alter_layout.register("cpu") | |||
def _alter_conv2d_layout(attrs, inputs, tinfos): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us move the alter registry to the nnvm for now. We can still make use of the generic function system. Should put more thoughts into things once we merge nnvm into tvm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, i take this comment back after discussing with yizhi
@ZihengJiang can you do a review? |
Can you do a review as well? @kevinthesun @yidawang @Laurawly |
topi/python/topi/x86/conv2d.py
Outdated
tvm.placeholder((num_filter, ic, kh, kw), dtype=out_dtype), | ||
stride, padding, out_dtype) | ||
sch = _get_schedule(wkl) | ||
return _AVX_SCH_TO_DECL_FUNC[type(sch)](wkl, data, kernel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also pass sch into _AVX_SCH_TO_DECL_FUNC so that _AVX_SCH_TO_DECL_FUNC doesn't need to call _get_schedule again. This will help later for global schedules.
topi/python/topi/x86/conv2d.py
Outdated
|
||
wkl = _get_workload(original_data, original_kernel, stride, padding, conv_out.dtype) | ||
sch = _get_schedule(wkl) | ||
_AVX_SCH_TO_SCH_FUNC[type(sch)](s, wkl, data_vec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Maybe pass sch into _AVX_SCH_TO_SCH_FUNC.
please let me know when the changes are approved and we can merge it in |
@kevinthesun confirm? |
lgtm |
@yzhliu Thanks. I will test it further. However, I am busying in implementing NNVM's SAME padding now, which will also affect scheduling (Because I will pass 4 arguments now, not only top / left, but also bottom / right). I will try to the current scheduling work and add it. Then I will test it within your new changes. 👍 |
@yzhliu I find one potential issue. When the outfilter is 1001, the result is not right. When we enter this function: def _get_default_schedule(wkl, simd_width):
HPAD, WPAD = wkl.tpad + wkl.bpad, wkl.lpad + wkl.rpad
HSTR, WSTR = wkl.hstride, wkl.wstride
out_height = (wkl.height + HPAD - wkl.hkernel) // HSTR + 1
out_width = (wkl.width + WPAD - wkl.wkernel) // WSTR + 1
oc_bn = 1
for bn in range(simd_width, 0, -1):
if wkl.out_filter % bn == 0:
oc_bn = bn
break simd_width is fp32_vec_len, i.e. 8. Then wkl.out_filter is 1001, The result is oc_bn will be 7 However, this result is not right. If we turn off AlterOpLayout or make the oc_bn be 1, we will get the correct result. Don't we consider the odd number of oc_bn? Or what else happened? |
Not sure what happens, simply test on vectorizing over axis of size 7 actually produces correct result. I'll check later. |
@yzhliu I think I should sync the status with you. It is not related with your changes. The issue is Softmax. Softmax's FCorrectLayout is just
However Softmax should keep the input layout as original, because the computation relates to the layout defined in pre-trained model. We should add SoftmaxCorrectLayout function to handle it. |
* add conv2d_NCHWc compute template * add conv_NCHWc compute decl and schedules * allow default avx schedule * fix lint * remove unused schedule * remove import nnvm.reg * pass schedule object to compute and schedule
@FrozenGene gotcha. thanks, would you mind to shoot a PR to fix? |
@yzhliu In fact, we have modified much code on CoreML’s support and also on schedule. This work is done in the company and we have our rule of open source. I think I should ask this question to my manager about open source (such as this issue we talk). I personally want to pull a request to fix it. BTW, I find that fold_scale_axis optimization will have problem on shuffleseg model. I investigate it and find that maybe fold_scale_axis optimization doesn’t handle convolution with groups != 1. Do you have any ideas? because I find when groups != 1 we will have problem(check shape error) when we apply this optimization. |
No worries, I can make the PR. In my understanding, for now fold_scale_axis is only well defined for NCHW and groups=1 convolution. |
@yzhliu Ok. I know. Could you tell me why we don't support groups = 1? If it is very difficult for supporting, I will turn it off currently. For example : add condition // only optimize for nchw for now
if (param.kernel_layout == "OIHW" && (*in_info)[0].axis == 1 && param.groups == 1) { |
* add conv2d_NCHWc compute template * add conv_NCHWc compute decl and schedules * allow default avx schedule * fix lint * remove unused schedule * remove import nnvm.reg * pass schedule object to compute and schedule
In my test, it brings around 40% speedup for ResNet. You can try this patch with the latest NNVM. @kevinthesun @yidawang @FrozenGene @masahi
But need to verify something before merge:
looks like the
fold_scale_axis
optimization conflicts with kernel packing fromOIHW
toOIHW[x]i[y]o
, leads to disagreement with mxnet's output. ResNet & VGG works fine with this PR (coz byalter_conv2d_layout
we replace conv2d by conv2d_NCHWc, skipping the optimization infold_scale_axis
), but if we remove the functionalter_conv2d_layout
and run ResNet50-v2 withopt_level=3
, we'll see the problem.Moreover,
squeeze_net
's output (https://mxnet.incubator.apache.org/api/python/gluon/model_zoo.html) is not correct, not sure whether it is caused by same problem.