Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] [NNVM] [TOPI] Implement NCHWc Winograd convolutions #2111

Closed
wants to merge 1 commit into from

Conversation

ajtulloch
Copy link
Contributor

@ajtulloch ajtulloch commented Nov 14, 2018

This is the implementation alluded to in
https://discuss.tvm.ai/t/improved-direct-winograd-nchwc-cpu-implementation-with-resnet-50-results/

It is a pretty standard Winograd implementation, modified for NCHWc
layout. It achieves reasonable speedups (up to 2x vs current
implementation) on a number of ResNet 3x3 layers on SKL and AVX.

Benchmarks:

  • Single-threaded, GCE Skylake instance, ~120GFLOP/s peak fp32 throughput.
  • 1000 iterations of AutoTVM tuning per task.
Workload Baseline (GFLOP/s) Winograd (eff GFLOP/s)
3x3s1p1, 1x64x56x56 -> 1x64x56x56 104 193
3x3s1p1, 1x128x28x28 -> 1x128x28x28 110 221
3x3s1p1, 1x256x14x14 -> 1x256x14x14 107 182
3x3s1p1, 1x512x7x7 -> 1x512x7x7 93 176

TODO:

  • Unit tests.
  • Modifications to tune_nnvm_x86.py.
  • Implement parallelism.
  • More benchmarking results.

@masahi
Copy link
Member

masahi commented Nov 15, 2018

@ajtulloch you don't need another _contrib_conv2d_NCHWc_winograd_without_weight_transform symbol. You can do weight transform any way you like inside alter_op_layout. See how ARM winograd does this here.

You shouldn't need to change anything in nnvm to enable winograd for x86.

@ajtulloch
Copy link
Contributor Author

@masahi how can I avoid adding _contrib_conv2d_NCHWc_winograd_without_weight_transform? This is an operator that takes a 5D NCHWc input tensor and a pre-transformed 6D transformed_weight tensor, and does a 3x3 Winograd convolution. I don't see how to express that with either _contrib_conv2d_NCHWc (need to modify decl to use Winograd method) or _contrib_conv2d_winograd_without_weight_transform (doesn't accept NCHWc input tensor).

I agree that we could probably remove the _contrib_conv2d_NCHWc_winograd_weight_transform by expressing it as a composition of conv2d_winograd_weight_transform + reshape + transpose.

But just as NCHW Winograd convolution required new NNVM ops, and NCHWc convolution required new NNVM ops, we need new ops for NCHWc Winograd AFAICT.

@ajtulloch ajtulloch force-pushed the NCHWc-Winograd-X86 branch 2 times, most recently from 91a968e to 327c7b3 Compare November 15, 2018 21:30
@masahi
Copy link
Member

masahi commented Nov 15, 2018

Have you tried _contrib_conv2d_winograd_without_weight_transform with NCHWc inputs? I have a home-grown NCHWc winograd + NNVM integration internally. _contrib_conv2d_winograd_without_weight_transform worked for me. It does accept 5d inputs.

As long as you have a correct

    new_attrs['layout'] = 'NCHW%dc' % ic_bn
    new_attrs['out_layout'] = 'NCHW%dc' % oc_bn

setup in _alter_conv2d_layout, I think it should work.

For the filter shape, the current solution in nnvm is not to do weight shape check during infer shape (see here). So you can use filter transform with any layout.

@ajtulloch ajtulloch force-pushed the NCHWc-Winograd-X86 branch 2 times, most recently from 06d6c5d to a9bdc2e Compare November 15, 2018 22:19
@ajtulloch
Copy link
Contributor Author

@masahi doesn't that logic apply to having duplicated conv2d/conv2d_NCHWc ops in NNVM as well? I don't see why it's reasonable to have conv2d/conv2d_NCHWc/conv2d_winograd_without_weight_transform but not also conv2d_NCHWc_winograd_without_weight_transform.

@masahi
Copy link
Member

masahi commented Nov 15, 2018

I see, honestly I don't know why a new sym.contrib.conv2d_NCHWc symbol was introduced, even though it uses the same param / infer_shape/ correct_layout as sym.conv2d. @yzhliu Can you comment?

My opinion is that since we already have everything on nnvm side to enable NCHWc winograd, we don't need another duplicated symbol.

@ajtulloch
Copy link
Contributor Author

ajtulloch commented Nov 16, 2018

@masahi I'd tend to agree, but I can also understand the appeal of being able to say 'conv2d and conv2d_winograd ops take a 4D NCHW input and produce 5D NCHWc output, conv2d_NCHWc and conv2d_NCHWc_winograd ops take a 5D NCHWc input and produce 5D NCHWc output', instead of essentially 'conv2d nodes take any dimension input and produce any dimension output'. If there's a desire from maintainers to consolidate all the convolution node types in NNVM/Relay, that's reasonable, but I'd argue it's orthogonal to this patch.

@masahi
Copy link
Member

masahi commented Nov 16, 2018

FYI, the recently added cuda int8 convolution uses NCHW4c layout with sym.conv2d symbol (see here). And as I mentioned, the conv2d_winograd can take a filter transform of arbitrary layout by not doing weight shape check.

So things are not consistent already. I think it's good time to discuss this issue as we transition to Relay @tqchen @merrymercy

@ajtulloch
Copy link
Contributor Author

@yzhliu, @yidawang this might be of interest to you folks.

@ajtulloch ajtulloch changed the title [X86] [NNVM] [TOPI] [WIP] Implement NCHWc Winograd convolutions [X86] [NNVM] [TOPI] Implement NCHWc Winograd convolutions Nov 16, 2018
@ajtulloch ajtulloch force-pushed the NCHWc-Winograd-X86 branch 2 times, most recently from 71aaacf to e0087df Compare November 16, 2018 03:57
@yidawang
Copy link
Contributor

This looks great. Thanks for the contribution! @ajtulloch We will test it out from our side.

@ajtulloch
Copy link
Contributor Author

@yidawang the parallelism part isn't well optimized (we need to conditionally enable parallelism depending on what granularity we compute the previous stages at), but I think everything else should be good to go. Note the caveats mentioned in https://discuss.tvm.ai/t/improved-direct-winograd-nchwc-cpu-implementation-with-resnet-50-results/1017/16?u=ajtulloch BTW, I don't have good heuristics for avoiding these cases right now so maybe just brute-force disabling it for CIn x COut >= 256 * 256 or something would be reasonable.

@yidawang
Copy link
Contributor

@ajtulloch Perhaps this paper may have some insight? Optimizing N-Dimensional, Winograd-Based Convolution for Manycore CPUs.

@tqchen
Copy link
Member

tqchen commented Nov 20, 2018

@ajtulloch is this PR ready for review? please request reviewers

@ajtulloch
Copy link
Contributor Author

ajtulloch commented Nov 20, 2018

Just rebased - @merrymercy, @yidawang, @yzhliu, @masahi would you be interested in reviewing these changes?

This is the implementation alluded to in
https://discuss.tvm.ai/t/improved-direct-winograd-nchwc-cpu-implementation-with-resnet-50-results/

It is a pretty standard Winograd implementation, modified for NCHWc
layout. It achieves reasonable speedups (up to 2x vs current
implementation) on a number of ResNet 3x3 layers on SKL and AVX.

TODO: Parallelization
TODO: Benchmarking suite results on full ResNet suite.
TODO: Demonstration in `tune_nnvm_x86.py`
@merrymercy merrymercy self-assigned this Nov 21, 2018
def div_round_up(a, b):
return (a + b - 1) // b

# assert all(k == 3 for k in (KH, KW))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable this check?

policy='candidate', candidate=[
[n, coo, cii, oh_m, ow_m, eps, ciii, nu, vc],
[n, coo, oh_m, ow_m, eps, cii, ciii, nu, vc],
# [n, coo, cii, oh_m, ow_m, ciii, nu, eps, vc],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete these candidates?

s[U].pragma(eps, 'debug_skip_region')
else:
pass
# r_kh, r_kw = s[U].op.reduce_axis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some basic schedule here



cfg.define_knob('input_tile_compute_location', [0, 1, 2, 3])
if cfg['input_tile_compute_location'].val == 0:
Copy link
Member

@merrymercy merrymercy Nov 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern appears many times. It will be good if we can wrap it into something like

cfg.define_annotate('input_tile_compute_location', [n, cii, oh_m, ow_m], policy='parallel_location')
axes = cfg['input_tile_compute_location'].apply(s, V, [n, cii, oh_m, ow_m], producer=input_tile)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy ,

I spent some time to see how to replace knob with something more elegant. They are used in more places, operations involved by knob are different, don't see how can generalize.

  • knob directs compute_at & vectorize but also fuse & compute_at too.
  • If have ideas please elaborate, otherwise can we leave as knob ?

parallel_axis = s[M].fuse(n, coo)
s[V].compute_at(s[M], ow_m)

# s[M].parallel(parallel_axis)
Copy link
Member

@merrymercy merrymercy Nov 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncomment this.

I know for the single-thread case, this parallel annotation will produce redundant code and harm the performance.
I think we should always annotate parallel in the schedule, but add some checks in
https://github.com/dmlc/tvm/blob/9473dca266e307cf1f9faece219af111686ca946/python/tvm/schedule.py#L548-L552
to skip annotating parallel when targeting single thread case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy ,

I believe here we just can return in case of single thread.

  • Could hint how to check this info right here ? From where to extract thread_num or similar info ?

@@ -414,6 +414,83 @@ NNVM_REGISTER_OP(_contrib_conv2d_winograd_without_weight_transform)

DMLC_REGISTER_PARAMETER(WinogradConv2DParam);

NNVM_REGISTER_OP(_contrib_conv2d_NCHWc_winograd_weight_transform)
Copy link
Member

@merrymercy merrymercy Nov 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to support NCHWc layout in the original symbol rather than creating a new symbol, as discussed by @masahi. Registering and maintaining new symbols is harder than adding "if" in the old symbol.

@merrymercy
Copy link
Member

The todo item Modifications to tune_nnvm_x86.py. is not finished. Please add some examples to the tutorial.

@yzhliu yzhliu added the status: need update need update based on feedbacks label Nov 26, 2018
@tqchen
Copy link
Member

tqchen commented Nov 29, 2018

@ajtulloch can you please act on @merrymercy 's comment?

@@ -0,0 +1,202 @@
"""Example code to do convolution."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also want to update autotvm x86 tutorial to add winograd tutorial.

@tqchen
Copy link
Member

tqchen commented Dec 6, 2018

ping ping @ajtulloch

@ajtulloch
Copy link
Contributor Author

Ah sorry folks, I missed the notifications. Will absolutely address this by the end of the weekend at the latest, my apologies.

@yzhliu
Copy link
Member

yzhliu commented Dec 10, 2018

@masahi regarding the sym.contrib.conv2d_NCHWc, only that the schedule and compute are slightly different. I agree we can merge it into regular conv impl.

@yzhliu
Copy link
Member

yzhliu commented Dec 18, 2018

@ajtulloch would you like to follow up the changes?

@tqchen
Copy link
Member

tqchen commented Jan 31, 2019

ping @ajtulloch can we land this PR for the 0.5?

@tqchen
Copy link
Member

tqchen commented Mar 12, 2019

@ajtulloch can you try to get this in? :)

@FrozenGene
Copy link
Member

@ajtulloch Could you update this work and upgrade it in Relay? I also find it have better performance improvement on ARM compared than default winograd schedule. I suggest we also add it on ARM CPU, for example, named it 'winograd_nchwc' and so on.

@cbalint13
Copy link
Contributor

@ajtulloch ,
@merrymercy , @masahi , @yidawang , @tqchen

  • I am interested in getting this PR in, may I take over this PR ? (i can do it in another PR only).

@tqchen
Copy link
Member

tqchen commented May 24, 2019

ping @ajtulloch , @cbalint13 let us wait for two days and if there is no update, you are more than welcomed to take over this PR

@tqchen
Copy link
Member

tqchen commented Jun 3, 2019

@cbalint13 please feel free to take over this PR, as long as you acknowledge andrew in the new one

@tqchen
Copy link
Member

tqchen commented Jul 23, 2019

superceded by #3553 thanks to @cbalint13

@tqchen tqchen closed this Jul 23, 2019
@cbalint13
Copy link
Contributor

@ajtulloch , @tqchen ,

#3553 is only a part of the integration, will continue with @ajtulloch NCHWc rebase .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: need review status: need update need update based on feedbacks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants