Auto TensorCore CodeGen #4234

minminsun · 2019-10-31T09:41:04Z

This pull request is for RFC #4105
We have re-implemented our solution. The new implementation is built on top of tensor intrinsics from #4052 and #4136.
Any feedbacks and comments are welcome.

minminsun · 2019-10-31T09:42:23Z

@Laurawly @Hzfengsy Could you please help to review? Thanks!

Hzfengsy

Good job! But there is still something to be improved.

Hzfengsy · 2019-10-31T20:38:08Z

include/tvm/ir_pass.h

+ *    buffer assignment of input and outputs.
+ * \return Transformed stmt.
+ */
+Stmt TensorCore(Stmt stmt,


Should we change the pass name? I think TensorCore is too general and confusing

Hzfengsy · 2019-10-31T20:51:38Z

src/pass/tensor_core.cc

+    return false;
+  }
+
+  // Match C = Cast(A*B)+C, where A & B are fp16/int8 local buffers,


TensorCores calculate C = Cast(A) * Cast(B) + C. We'd better to match the same thing if possible.

Yeah, we get it as you have already mentioned in the RFC. Sorry for our forgetting to fix this part.

TensorCores calculate C = Cast(A) * Cast(B) + C. We'd better to match the same thing if possible.

We were focusing on combing with tensor intrinsics. Some comments and feedbacks from the RFC and former pull request haven't got resolved yet. We will fix soon.

Hzfengsy · 2019-10-31T21:11:01Z

src/pass/tensor_core.cc

+  }
+}
+
+class MMAMatcher: public IRVisitor {


Can you please add comments to these classes and methods?

Hzfengsy · 2019-10-31T21:34:19Z

tutorials/autotvm/tune_tensor_core_batch_matmul.py

+tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-3)
+
+evaluator = func.time_evaluator(func.entry_name, ctx, number=100)
+print('Time cost of this operator: %f' % evaluator(a_tvm, b_tvm, c_tvm).mean)


I have tested the performance on my Titan V GPU, it seems that we can not reach a satisfying performance. In some scenarios, usually large size matmul, we even have the similar speed as non-tensorcore schedule. Perhaps we should add more optimization such as using storage_align to reduce bank conflicts.

Done. There's really a performance boost, especially on large shapes, after applying storage_align. Thanks!

vinx13 · 2019-11-01T05:12:28Z

python/tvm/build_module.py

@@ -387,6 +387,7 @@ def lower(sch,
    binds, arg_list = get_binds(args, compact, binds)

    # Phase 1
+    stmt = ir_pass.TensorCore(stmt, sch, binds)


We need to check current target is cuda before calling this

Thanks! We will add that.

Laurawly

Overall LGTM, just a few comments.

Laurawly · 2019-11-01T21:10:57Z

tutorials/autotvm/tune_tensor_core_matmul.py

+
+@autotvm.template
+def test_gemm_nn(N, L, M, dtype, layout):
+    if (layout == "NN"):


Could you document the layout a bit?

Done. Added more comments to the final formal tutorials.

Laurawly · 2019-11-01T21:11:25Z

tutorials/autotvm/tune_tensor_core_matmul.py

+tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-3)
+
+evaluator = func.time_evaluator(func.entry_name, ctx, number=100)
+print('Time cost of this operator: %f' % evaluator(a_tvm, b_tvm, c_tvm).mean)


Are the tuned results same ones reflected in the RFC? Is this template flexible enough to achieve good performance for other shapes after tuning?

Yes, the perf in the RFC are tuned with this tutorial script. The template is not customized for specific shapes so it should be flexible to apply on other shapes. But we do see poor performance on large shapes, as @Hzfengsy commented above. We found out 2 reasons:

Bank conflicts of shared memory, which can be reduced by storage_align as @Hzfengsy suggested.

"vthread" in this template is a fixed value of 1 instead of a tunable knob. We do so because the inject_virtual_thread pass does not support intrinsics.

The improvement for the first issue will get updated to this PR soon, but for the second one we are still trying to figure out a solution.

With storage_align, now the tuned results are better than the ones in the RFC.

aditya4d · 2019-11-07T03:27:02Z

src/pass/tensor_core.cc

+  }
+
+  // Do the pattern matching
+  bool mma_sync_match_(const Provide* op, BufferInfo store_buffer) {


Dumb question, does this function looks for wmma and replace with mma.sync?

This part is still in analysing phase. We do the mma pattern match here as well as record some matrix info.
And yes, finally we'll replace the whole AST block with a mma.sync Intrinsic call.

Awesome. If it is not too much trouble can you dump ptx into a gist and paste the link here? I can review it and suggest few changes if needed

Awesome. If it is not too much trouble can you dump ptx into a gist and paste the link here? I can review it and suggest few changes if needed
Thanks for the kindness. BTW, what do you expect to get by looking at the ptx assembly?
For cuda, one single mma.sync C APi will be replaced with several PTX instructions via nvcc, so may be you are afraid that there may be some miss-usage?

That is one reason. Here is why I want the ptx dump

I want to see if the schedule is good or not. If not, I can suggest how it should be. May be we can change few things to get it right

To see if there are any low-throughput ptx instructions that are causing any slowdown.

Want to see how shared memory is being used. mma.sync requires input operands (a, b and c) to be laid out in specific pattern. This can cause shared memory bank conflicts.

For your question. What do you mean by C API? Is it the C wrapper around intrinsic?

We generate CUDA code instead of PTX code, so it's nvcc that decides which ptx instructions to use. Yes, we are continuously optimizing the schedule, and any better schedule is also welcome. But it is beyond the scope of this pull request. The main goal of this pull request is not to deliver the best schedules, but a feature and a tutorial to guide how to use this feature and to prove it can achieve good enough performance in some cases (at least be able to reproduce the results in the RFC). If you are interested, we are looking forward to corporating with you to deliver better and better schedules. Thanks!

Awesome. If it is not too much trouble can you dump ptx into a gist and paste the link here? I can review it and suggest few changes if needed

Hi Aditya, I have sent you the generated cuda code as well as the ptx compiled with nvcc via a Message at https://discuss.tvm.ai/. Could you please help to take a look? Thank you!

minminsun · 2019-11-07T14:08:22Z

Thanks @Laurawly @Hzfengsy @vinx13 for your review comments and suggestions. We have improved the code accordingly, and have added a formal tutorial. Do you have any more feedbacks?

vinx13 · 2019-11-07T17:20:27Z

src/api/api_pass.cc

@@ -94,6 +94,13 @@ TVM_REGISTER_API("ir_pass.StorageFlatten")
    }
  });

+TVM_REGISTER_API("ir_pass.RewriteForTensorCore")
+.set_body([](TVMArgs args, TVMRetValue *ret) {


we can use set_body_typed

vinx13 · 2019-11-07T17:30:30Z

tutorials/optimize/opt_matmul_auto_tensorcore.py

+    y, x = s[C].op.axis
+    k = s[C].op.reduce_axis[0]
+
+    # storage_align params


please document how these params are chosen

Done. Added in the document above.

vinx13 · 2019-11-07T17:34:47Z

tutorials/optimize/opt_matmul_auto_tensorcore.py

+
+tuner = autotvm.tuner.XGBTuner(task)
+with tvm.build_config():
+    tuner.tune(n_trial=1000,


we need to comment out these lines to skip running on ci, otherwise it takes long time running on ci, see
https://github.com/apache/incubator-tvm/blob/master/tutorials/autotvm/tune_relay_cuda.py#L257-L260

Done. The running time reduced from 34s to 0.1s, thanks.

Hzfengsy

Thank you! LGTM

minminsun · 2019-11-09T01:23:11Z

Hi @Laurawly, could you please help to review the updates? Thank you!

tqchen · 2019-11-09T18:40:52Z

@yangjunpro @Laurawly please https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly

yangjunpro

Since this PR has been pending for more than one week and iterated with several refinements.
Also there is already approvals from several contributors.
We wish to merge it into master and close it.
Based on this, new work could also be started very soon.
Thanks.
@tqchen @yzhliu @Laurawly

tqchen · 2019-11-09T21:02:25Z

Thanks @Laurawly @adityaatluri @minminsun @yangjunpro @vinx13 @Hzfengsy

* Add Auto TensorCore TensorCore Unit Test * Rebase to tvm master branch & Add auto tensor core * Code Refine * Add tensor core switch by pragma * Add pragma in tensor core example code * Get real tile size to replace hard coded 16 * support more than 2 dimensions (e.g. batchmatmul) for buffer bind scope * support batch matmul * Move cuda env check to tensor_core.cc * Coderefine for tensor_core.cc * Refine comments * Some refinements of code and comment * Update TensorCore UT to pass the CPU test * remove redundant code * matmul's storage align for different layout * Add support for differenct position of type cast * Add formal tutorial for auto tensorcore codegen * move tensorcore check up to tutorial code * code and doc refine * comment out tune_and_evaluate in tutorial * fix cpplint error

Laurawly · 2019-11-26T21:58:49Z

Hi @Laurawly, could you please help to review the updates? Thank you!

Hi @minminsun, I saw the slides from TVM meetup in Shanghai and you guys showed tensor core performance on Turing architecture for int4 and int1 with great performance boost. I wonder if you plan to have those code upstream anytime in the future.

minminsun · 2019-11-27T07:39:56Z

Hi @Laurawly, could you please help to review the updates? Thank you!

Hi @minminsun, I saw the slides from TVM meetup in Shanghai and you guys showed tensor core performance on Turing architecture for int4 and int1 with great performance boost. I wonder if you plan to have those code upstream anytime in the future.

Yes, we plan to open an pr to merge the code after it gets cleaned up.

minminsun and others added 11 commits October 31, 2019 15:46

Add Auto TensorCore TensorCore Unit Test

bfffc34

Rebase to tvm master branch & Add auto tensor core

8772a03

Code Refine

ebb01b7

Add tensor core switch by pragma

626166b

Add pragma in tensor core example code

2f7ab5b

Get real tile size to replace hard coded 16

e0cbc0e

support more than 2 dimensions (e.g. batchmatmul) for buffer bind scope

af0824b

support batch matmul

a76e4e7

Move cuda env check to tensor_core.cc

4b0d18d

Coderefine for tensor_core.cc

f80fad8

Refine comments

afa9061

This was referenced Oct 31, 2019

Auto TensorCore CodeGen #4106

Closed

[RFC] Auto TensorCore CodeGen #4105

Closed

Hzfengsy reviewed Oct 31, 2019

View reviewed changes

vinx13 reviewed Nov 1, 2019

View reviewed changes

Laurawly reviewed Nov 1, 2019

View reviewed changes

jcf94 and others added 4 commits November 6, 2019 20:06

Some refinements of code and comment

7c6b880

Update TensorCore UT to pass the CPU test

d5134f2

remove redundant code

4868685

matmul's storage align for different layout

022adb9

aditya4d reviewed Nov 7, 2019

View reviewed changes

jcf94 and others added 3 commits November 7, 2019 12:11

Add support for differenct position of type cast

17a08bb

Add formal tutorial for auto tensorcore codegen

a975d6f

move tensorcore check up to tutorial code

9fe5363

vinx13 requested changes Nov 7, 2019

View reviewed changes

minminsun added 3 commits November 8, 2019 11:53

code and doc refine

9baf969

comment out tune_and_evaluate in tutorial

f61f362

fix cpplint error

4824579

minminsun requested a review from vinx13 November 8, 2019 04:06

Hzfengsy approved these changes Nov 8, 2019

View reviewed changes

vinx13 approved these changes Nov 8, 2019

View reviewed changes

yangjunpro approved these changes Nov 9, 2019

View reviewed changes

tqchen merged commit d64bf6b into apache:master Nov 9, 2019

tqchen added the status: accepted label Nov 9, 2019

tqchen mentioned this pull request Nov 16, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

jwfromm mentioned this pull request Mar 20, 2020

[TOPI][Tensor Core] Conv2d and Dense ops support on Tensor Core #5099

Merged

tqchen mentioned this pull request Jun 6, 2020

Upgrade AutoTensorCore as to a TIR Pass #5741

Closed

Auto TensorCore CodeGen #4234

Auto TensorCore CodeGen #4234

Conversation

minminsun commented Oct 31, 2019

minminsun commented Oct 31, 2019

Hzfengsy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcf94 Nov 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Laurawly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minminsun Nov 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minminsun commented Nov 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hzfengsy left a comment

Choose a reason for hiding this comment

minminsun commented Nov 9, 2019

tqchen commented Nov 9, 2019

yangjunpro left a comment • edited Loading

Choose a reason for hiding this comment

tqchen commented Nov 9, 2019

Laurawly commented Nov 26, 2019

minminsun commented Nov 27, 2019

jcf94 Nov 1, 2019 •

edited

Loading

minminsun Nov 4, 2019 •

edited

Loading

yangjunpro left a comment •

edited

Loading