[OpenCL] Add vectorization to cuda conv2d_nhwc schedule #8636

echuraev · 2021-08-03T10:42:34Z

Adding vectorization significantly improved performance. About 6-7x
boost.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

jcf94

@echuraev Thanks!

Another thing is that you're doing test on a OpenCL device? While the schedule is in a topi/cuda dir. It would be very nice of you if you can help move this schedule to topi/gpu dir

python/tvm/topi/cuda/conv2d_nhwc.py

echuraev · 2021-08-06T10:21:52Z

@jcf94 thank you for your comments. I applied your comments and moved conv2d_nhwc to topi/gpu in separate commits to make it easier to review. Please take a look one again.

Another thing is that you're doing test on a OpenCL device?

Yes, I run topology from #8146 on my Samsung Galaxy A71 and got about 6-7x performance boost.

echuraev · 2021-08-07T12:57:01Z

@jcf94 also, I suppose that we should update statistic in tophub and rename conv2d_nhwc.cuda to conv2d_nhwc.gpu in all files. Am I right?

tmoreau89 · 2021-08-09T20:51:31Z

CC-ing @masahi - would this also benefit Vulkan FP16 codegen? And in general do we want to maintain TOPI definitions for both CUDA and GPU, or rename it all to GPU, to show that it encompasses more than just Nvidia backends? That would I believe be the topic of a larger RFC.

@echuraev could you confirm that making those changes to the schedule templates doesn't lead to regressions on the CUDA codegen on Nvidia GPUs? Thank you!

masahi · 2021-08-24T11:00:01Z

Thanks @echuraev, this is really interesting. With the new commit, I was able to generate a packed float16x2 instruction for AMD Vega or newer GPUs, which is supposed to be 2x faster than normal fp32 or fp16 instructions https://gist.github.com/masahi/2de1a7dc87e2068ffb50ba6135273f95#file-conv2d_nhwc_float16x2-s-L495-L496. I'm going to experiment further.

tmoreau89 · 2021-08-24T16:18:23Z

CC @mei-ye on the above comment - investigating if using vector types can lead to improved FP16 codegen on AMD GPUs

echuraev · 2021-09-02T17:21:51Z

@mbrookhart I wasn't able to reproduce the regression on NVidia 3070. As a possible solution, I can create a separate conv2d_nhwc schedule for OpenCL. But I saw that my fix for OpenCL issue with global_work_size also works for default schedule on Cuda: https://github.com/apache/tvm/pull/8636/files#diff-05fdfdcbc0bdf86e1df35950ae34877c2f9dbddab6a99ca630582547d4e7e0faL88-L89

	Results on NVidia 3090 (1024 trials per kernel)	Results on NVidia 3070 Mobile (512 trials per kernel)
Main (AutoTVM)	0.17 ms	0.4504 ms
Main (Ansor)	0.18 ms	0.4284 ms
Code from the PR (AutoTVM)	0.26 ms	0.4200 ms
Code from the PR (Ansor)	0.19 ms	0.4200 ms

echuraev · 2021-09-03T20:52:13Z

@jcf94, @masahi, @tmoreau89 could you please take a look one again on this PR?

tmoreau89 · 2021-09-03T22:05:12Z

So it appears @mbrookhart has encountered a regression and that regression wasn't reproduced by @echuraev on a slightly different GPU. The possible solution in order to be extra safe would be to introduce a new set of TOPI schedules for vectorized GPU operator implementations. WDYT @jwfromm and @mbrookhart

echuraev · 2021-09-04T03:05:52Z

So it appears @mbrookhart has encountered a regression and that regression wasn't reproduced by @echuraev on a slightly different GPU. The possible solution in order to be extra safe would be to introduce a new set of TOPI schedules for vectorized GPU operator implementations. WDYT @jwfromm and @mbrookhart

Sorry, my mistake. I didn't write an update. @mbrookhart has remeasured autotvm tuning on the latest code from my branch and he got the same numbers as he got on the main branch. Now the table can be updated:

	Results on NVidia 3090 (1024 trials per kernel)	Results on NVidia 3070 Mobile (512 trials per kernel)
Main (AutoTVM)	0.17 ms	0.4504 ms
Main (Ansor)	0.18 ms	0.4284 ms
Code from the PR (AutoTVM)	0.173 ms	0.4200 ms
Code from the PR (Ansor)	0.19 ms	0.4200 ms

So, we don't have regression on the latest code.

masahi · 2021-09-09T09:54:23Z

I think it is good to go as long as there would be no concern for regression. Does the new schedule subsume the old schedule in its search space?

echuraev · 2021-09-09T10:11:43Z

I think it is good to go as long as there would be no concern for regression. Does the new schedule subsume the old schedule in its search space?

Yes, it does. When vectorization factor is equal to 1, then generated code should be the same as it was with old schedule.

masahi

cc @jcf94

jwfromm · 2021-09-16T16:09:09Z

@jcf94 can you please take another look at this? It's currently being blocked by your request for changes.

echuraev · 2021-09-20T12:25:27Z

@jcf94 Could you please take a look one again?

jwfromm · 2021-09-23T16:26:30Z

@tqchen what's the protocol for a case like this where a PR is blocked by requested changes but the requester is away for a long period of time? Is there a way to override that or should @echuraev close and reopen the PR?

tmoreau89 · 2021-09-23T16:31:40Z

@jcf94 - please update the PR review based on latest comments and updates to the PR. I will dismiss your review in 24hrs so we can unblock this PR given that we haven't heard back from you.

Adding vectorization significantly improved performance. About 6-7x boost.

Change requested occurred on Aug 5th by jcf96. On 6th Egor addressed the changes. We then pinged 23, 17, 10, 6, and 3 days ago but no response. As such in order to unlock this PR I will dismiss the review.

tmoreau89 · 2021-09-26T21:32:05Z

I've dismissed the review that was blocking progress. In my view we've provided sufficient ping to notify the reviewer who requested changes. @jwfromm I'll leave it to you to merge the PR when convenient.

jwfromm · 2021-09-30T16:04:24Z

Thanks @echuraev, this is now merged.

* main: (80 commits) Introduce centralised name transformation functions (apache#9088) [OpenCL] Add vectorization to cuda conv2d_nhwc schedule (apache#8636) [6/6] Arm(R) Ethos(TM)-U NPU codegen integration with `tvmc` (apache#8854) [microTVM] Add wrapper for creating project using a MLF (apache#9090) Fix typo (apache#9156) [Hotfix][Testing] Wait for RPCServer to be established (apache#9150) Update find cublas so it search default path if needed. (apache#9149) [TIR][LowerMatchBuffer] Fix lowering strides when source region has higher dimension than the buffer (apache#9145) Fix flaky NMS test by making sure scores are unique (apache#9140) [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038) [LLVM] Make changes needed for opaque pointers (apache#9138) Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849) [CI] Split Integration tests out of first phase of pipeline (apache#9128) [Meta Schedule][M3b] Runner (apache#9111) Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141) [TIR] add loop partition hint pragma (apache#9121) fix things (apache#9146) [Meta Schedule][M3a] SearchStrategy (apache#9132) [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133) [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143) ...

* Add vectorization to cuda conv2d_nhwc schedule Adding vectorization significantly improved performance. About 6-7x boost. * Apply comment * Move schedule to topi/gpu dir * Add vectorization to inner loop * Update values of vectorization factor

echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from dbbc678 to 5fa169a Compare August 3, 2021 11:04

echuraev marked this pull request as ready for review August 3, 2021 11:05

echuraev requested review from Huyuwei, jcf94, jwfromm, kevinthesun, Laurawly, masahi, mbrookhart, vinx13, yzhliu and ZihengJiang as code owners August 3, 2021 11:05

echuraev changed the title ~~Add vectorization to cuda conv2d_nhwc schedule~~ [OpenCL] Add vectorization to cuda conv2d_nhwc schedule Aug 3, 2021

echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from 5fa169a to 1a931a1 Compare August 4, 2021 09:50

jcf94 previously requested changes Aug 5, 2021

View reviewed changes

python/tvm/topi/cuda/conv2d_nhwc.py Outdated Show resolved Hide resolved

python/tvm/topi/cuda/conv2d_nhwc.py Show resolved Hide resolved

echuraev requested review from anijain2305, areusch, comaniac, jroesch, junrushao, MarisaKirisame, merrymercy, slyubomirsky, tqchen, wweic and zhiics as code owners August 6, 2021 10:12

echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from a44c81e to ae90d88 Compare August 6, 2021 13:58

echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from e19c67a to 256f524 Compare September 2, 2021 17:12

echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from 256f524 to 6d9d210 Compare September 3, 2021 08:32

masahi approved these changes Sep 9, 2021

View reviewed changes

areusch removed their request for review September 14, 2021 15:05

tmoreau89 requested a review from jcf94 September 23, 2021 16:30

echuraev added 5 commits September 23, 2021 19:34

Add vectorization to cuda conv2d_nhwc schedule

b66ac7f

Adding vectorization significantly improved performance. About 6-7x boost.

Apply comment

66135c9

Move schedule to topi/gpu dir

d8cf61f

Add vectorization to inner loop

8014e80

Update values of vectorization factor

828cdfa

echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from 6d9d210 to 828cdfa Compare September 23, 2021 16:35

jwfromm merged commit 719d2f6 into apache:main Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenCL] Add vectorization to cuda conv2d_nhwc schedule #8636

[OpenCL] Add vectorization to cuda conv2d_nhwc schedule #8636

echuraev commented Aug 3, 2021

jcf94 left a comment •

edited

Loading

echuraev commented Aug 6, 2021

echuraev commented Aug 7, 2021

tmoreau89 commented Aug 9, 2021

masahi commented Aug 24, 2021

tmoreau89 commented Aug 24, 2021

echuraev commented Sep 2, 2021

echuraev commented Sep 3, 2021

tmoreau89 commented Sep 3, 2021

echuraev commented Sep 4, 2021 •

edited

Loading

masahi commented Sep 9, 2021

echuraev commented Sep 9, 2021

masahi left a comment

jwfromm commented Sep 16, 2021

echuraev commented Sep 20, 2021

jwfromm commented Sep 23, 2021

tmoreau89 commented Sep 23, 2021

tmoreau89 commented Sep 26, 2021

jwfromm commented Sep 30, 2021

[OpenCL] Add vectorization to cuda conv2d_nhwc schedule #8636

[OpenCL] Add vectorization to cuda conv2d_nhwc schedule #8636

Conversation

echuraev commented Aug 3, 2021

jcf94 left a comment • edited Loading

Choose a reason for hiding this comment

echuraev commented Aug 6, 2021

echuraev commented Aug 7, 2021

tmoreau89 commented Aug 9, 2021

masahi commented Aug 24, 2021

tmoreau89 commented Aug 24, 2021

echuraev commented Sep 2, 2021

echuraev commented Sep 3, 2021

tmoreau89 commented Sep 3, 2021

echuraev commented Sep 4, 2021 • edited Loading

masahi commented Sep 9, 2021

echuraev commented Sep 9, 2021

masahi left a comment

Choose a reason for hiding this comment

jwfromm commented Sep 16, 2021

echuraev commented Sep 20, 2021

jwfromm commented Sep 23, 2021

tmoreau89 commented Sep 23, 2021

tmoreau89 commented Sep 26, 2021

jwfromm commented Sep 30, 2021

jcf94 left a comment •

edited

Loading

echuraev commented Sep 4, 2021 •

edited

Loading