[RFC] Auto TensorCore CodeGen #4105

minminsun · 2019-10-11T08:46:28Z

We propose a solution for TensorCore CodeGen with significant transparency, flexibility and usability. In this solution, the algorithm description and schedule of TensorCore CodeGen is no different than that of a normal CUDA CodeGen. All the information needed by wmma API, such as matrix_a/matrix_b/accumulator, row_major/col_major, warp tile size and so on, is automatically derived from the AST. Of course, not every algorithm and schedule is suitable for TensorCore computation. This solution will do the check and fall back to normal CUDA CodeGen for those that are not qualified for TensorCore CodeGen.

In this solution, 3 IRVisitors and 1 IRMutator are added.

IRVisitors: BodyVisitor, MMAMatcher and BufferAnalyser.
IRMutator: TensorCoreIRMutator.

BodyVisitor, which is called by ScheduleAnalyser, visits the body stmt of original ComputeOp to get the access indices of input matrices if it is recognized as matrix multiply. ScheduleAnalyser compares the access indices with the axis/reduce_axis of ComputeOp to figure out whether an input matrix is matrix_a or matrix_b, row_major or col_major.

MMAMatcher does the pattern matching on AST stmt. The pattern it tries to find out is as following:

If matched, the a, b, c will be recorded as fragment registers, which are important inputs to the next visitor.

BufferAnalyser, the last visitor, will get all of the rest information needed for TensorCoreIRMutator, like strides of src/dst buffer for wmma load/store matrix operation, warp tile size for fragment allocation as well as checking whether the schedule is qualified for TensorCore, loops that need to be scaled after normal load/store and compute operation replaced by TensorCore operations, etc..

TensorCoreIRMutator mutates the AST stmt for TensorCore CodeGen. The subtree matched by MMAMatcher will be replaced with “mma_sync” extern call. Load/Store of fragments are replaced with “load/store_matrix_sync” extern call, with the thread index getting unified within a warp. Thread index unification, i.e. changing the index of every thread to the same as the first thread of the warp, is done by ThreadIdxMutator on the subtree.

The TensorCore IR Passes are applied before StorageFlatten because they need stride/shape and index of specific dimensions before they got flattened into one. Before StorageFlatten, “Allocation” is represented by Realize IR Node, which has no new_expr member as Allocate IR Node has. So we added it to Realize IR Node to carry the expr for fragment allocation and pass to Allocate IR Node. We noticed the comment of deprecating new_expr when merging with the latest TVM codebase. We would like to ask for a reconsideration of this decision, because it is really useful for some non-standard buffer allocations.

This solution is evaluated on a sample schedule of Matmul, which is based on AutoTVM. It supports fp16 and int8 data type, and three kinds of data layouts: NN, NT, TN.

On some model layers, we have already achieved better performance than CUBLAS/CUDNN:

FP16 on V100, CUDA 9.0, Driver 396.44

NMT Online Service (In-house Model)

M, N, K	CUBLAS TensorCore	TVM TensorCore	Speed Up
512, 64, 512	9.05us	7.34us	1.23X
512, 32, 512	8.30us	6.84us	1.21X
512, 16, 512	7.88us	6.60us	1.19X

MobileNet (Public Model)

H W C_IN C_OUT KERNEL KERNEL PAD_H PAD_W STRIDE_H STRIDE_W	CUDNN TensorCore	TVM TensorCore	SpeedUp
56 56 64 128 1 1 0 0 1 1	8.5220us	6.9320us	1.23X
28 28 128 256 1 1 0 0 1 1	10.787us	8.3490us	1.29X
28 28 256 256 1 1 0 0 1 1	15.188us	14.136us	1.07X

Int8 on T4, CUDA10.1, Driver 418.39

NMT Online Service (In-house Model)

M, N, K	CUBLAS TensorCore	TVM TensorCore	Speed Up
512, 64, 512	23.163us	22.603us	1.025X
512, 32, 512	22.551us	14.263us	1.58X
512, 16, 512	22.510us	11.015us	2.04X

There are also many shapes on which CUBLAS/CUDNN is much better. The performance tuning is still on-going.

Thanks!
-- Minmin Sun, Lanbo Li, Chenfan Jia and Jun Yang of Alibaba PAI team

minminsun · 2019-10-11T08:55:01Z

We have simply introduced our solution in the comments of #4052
cc @Hzfengsy

antinucleon · 2019-10-11T10:44:24Z

Awesome solution! Just curios: for shapes which are worse than cudnn/cublas, what kind of tuning is using?

minminsun · 2019-10-11T13:16:31Z

Awesome solution! Just curios: for shapes which are worse than cudnn/cublas, what kind of tuning is using?

We haven’t spent much effort on performance tuning yet. For cases with bad performance we plan to do profiling to figure out the causes firstly. One possible way of optimization is to manually modify the generated code. If the manual optimization really works and it is general enough, we can try to implement it in the schedule.

yangjunpro · 2019-10-11T15:56:12Z

Awesome solution! Just curios: for shapes which are worse than cudnn/cublas, what kind of tuning is using?

Good point! We do have some internal discussions about whether we need to automatically search the schedule space based on performance between TensorCore and non-TensorCore kernel, since TensorCore implementation may not beat the non-TensorCore version for every shapes. This is one of the plan-to-do features and any further comments and inputs are also welcome. One possible solution is to expose TensorCore as another schedule configuration knob to let auto-tuner decide whether we need to turn it on or not. Another potential solution is that in the IR pass we decide on whether a certain shape may perform better with TensorCore with heuristics. There are pros and cons with both solution. For the former one, the tuner space will be enlarged, thus bringing a little bit larger tuning space. For the latter one, since we make decision in the IR pass internally, the tuner space is kept almost the same however introduce dependency upon the accuracy of the heuristics, although for TensorCore due it is hardware nature we think it might be clear to decide whether a shape is performance friendly for TensorCore or not, there is still possibility that we may choose a low-performance kernel.

tqchen · 2019-10-11T16:52:11Z

Thanks for the RFC, also cross link to #4052.

Non standard buffer allocation

We are moving toward using special memory scopes to annotate the special memory(e.g. mma). The use of new_expr was convenient, but never the less a bit too close to low level and overlaps with what we can do with special memory scope. Adding new_expr to Realize seems to enforce that decision even earlier, which I would not recommend.

Here is an alternative solution: introduce a new scope for the special memory needed for lowering, then the special rule can be used to generate the corresponding memory needed. Of course there could be additional hints that are needed to lower the the allocation code, you can likely embed that additional information with a special AttrStmt outside the allocation scope.

Place of Pattern Matching

Right now from the reading of RFC, seems the early pattern matching was done before flattening and was dependent on the compute structure.

I wonder if we could de-couple this, with some annotations, run some of the rewriting after storage flatten. Of course the low-level code does not enjoy the benefit of the multi-dimension indices, but the access pattern can still be detected by DetectLinearEquation.

One possible limitation I see the current approach is that whether we could support operations like conv2d, as we will need to explicitly express compute in this form(which is fine for now).

Complement and Combine with Tensor Intrinsics based TensorCore support

It would be great to hear from more thoughts @Hzfengsy @minminsun about how can we combine the tensor intrinsics based approach with the more automatic pattern detector one. e.g
#4052.

We always tries to have a philosophy to enable the manual scheduling options that can gives us a way to specify search space, then build automation on top. This allows us to takes a spectrum of approach, use more manual one if necessary, and build more diverse automated solution.

Our eventual goal would still be unify all tensorization support under tensor intrinsics, and build automation on top. One idea would be we still declare the lowering rules via tensor intrinsics, but reuses the pattern matching techniques in this RFC to rewrite to hints that applies the tensor intrinsics. This way we can organically combine the two ideas together.

Hzfengsy · 2019-10-11T20:59:45Z

Thank you for the RFC. It is complete TensorCore support. It is nice that you can support different types and different data layouts, which is not supported in my solution currently.

Lower Passes vs Intrinsic

Intrinsic is a tool for describing what instructions can be done in specific hardware. I believe TensorCore is one kind of specific hardware. It is perfect to use tensor intrinsic. It is standard and easy to maintain (if Nividia add another accelerator, we only need to add another intrinsic rather than a new pass)

Another thing is auto tensorization. Just as Tianqi says, our final goal is to generate schedules for all kinds of hardware using tensor intrinsics, which is my major work direction.

Suggestions and Questions

As we know using TensorCores will decrease precision. So, NVIDIA set up a switch to turn on and off TensorCores in CUBLAS and CUDNN (default not use TensorCores). At least we should let users determine whether use them.
In Volta Arichitecture Whitepaper, TensorCores do production in full precision, rather than half precision. I recommend changing the pattern into A/B -> Load -> Cast -> Mul -> Add if we still use pattern matching solution.
It shocks me that your solution is even faster than CUBLAS and CUDNN. I try to reproduce the result but fails. Did you use BatchMatMul and BatchConv? And which GPU did you test on? Could you show me the details about the performance?

Combine with Tensor Intrinsics

I am glad to see a different solution for TensorCore. And it seems that it is more complete and faster than mine. However, tensor intrinsic is the solution that Tianqi and I recommend. It would benefit the project and the community if we can cooperate, combining my tensor intrinsic and your complete and well-performance backend.

After all, thank you again for this impressive RFC.

ajtulloch · 2019-10-11T23:55:25Z

This is really impressive work, congrats!

Orion34-lanbo · 2019-10-12T02:47:32Z

It shocks me that your solution is even faster than CUBLAS and CUDNN. I try to reproduce the result but fails. Did you use BatchMatMul and BatchConv? And which GPU did you test on? Could you show me the details about the performance?

Our fp16 TensorCore kernel are tuned on V100 with CUDA toolkit 9.0 with driver 396.44. The int8 TensorCore kernels are tuned on T4 with CUDA toolkit 10.1 with driver 418.39. On different GPUs, the performance of tuned kernels can be different.

minminsun · 2019-10-12T04:06:57Z

Thanks @tqchen and @Hzfengsy for your valuable feedbacks. We are trying out some of your suggestions. Will have further discussions with you after we have made some evaluations and trials.

As we know using TensorCores will decrease precision. So, NVIDIA set up a switch to turn on and off TensorCores in CUBLAS and CUDNN (default not use TensorCores). At least we should let users determine whether use them.

I doubt whether "using TensorCores will decrease precision", if the inputs are already in fp16 or int8. We did try to add an "enable_tensor_core" option in tvm.build_config, but it seems like build_config can't be passed to AutoTVM building. Any suggestion on where to add this option is welcome. But I think eventually we will not need this option, after the implementation is proven to be robust enough. For example, in Tensorflow, MatMul/Conv on fp16 data by default uses TensorCore Kernel of cublas/cudnn.

In Volta Arichitecture Whitepaper, TensorCores do production in full precision, rather than half precision. I recommend changing the pattern into A/B -> Load -> Cast -> Mul -> Add if we still use pattern matching solution.

Thanks for correcting my understanding. So it seems like the tensorcore operation is more like c = float(a)*float(b) + c than c = float(a*b) + c

minminsun · 2019-10-15T04:38:23Z

We had a meeting with @Hzfengsy today. We discussed the difference and similarity of our solutions. They are different in the front-end: our solution tries to make it as transparent as possible to make it easy-using while #4095 provides more controllability to the user (schedule developer). They are actually targeting different users, so we think both solutions can co-exist. But we both agreed that the intrinsics in the back-end should combine. As to the fragment allocation, we are OK to change from new_expr to the way of introducing new scopes, but currently the new scope introduced in #4052 is not enough for the codegen of fragment allocation if it's extended to support different warp tile sizes and data layouts (col_major/row_major). One possible but not so elegant solution we proposed is to extend the scopes to also include tile size and data layout. @Hzfengsy is also trying to figure out a solution here. We will have more discussions on this.
cc @tqchen

were · 2019-10-19T01:45:32Z

I have a proposal to minimize the invasion in TVM and also fundamentally support TensorCore in TVM. This is in the middle of both methodology of #4052 and this RFC.
I suppose the current pain point of supporting TensorCore is the data structure provided by NVIDIA, which introduces non-standard buffer allocation.
I wrote a microbenchmark before to see the generated ptx assembly code, which turned out that fragment no longer exists after codegen, and the tensorize intrinsic is just several assembly instructions with 16 operands.
My proposal is that why do not we just extend the intrin and generate the code in embedded assembly?
@tqchen

minminsun · 2019-10-31T10:56:34Z

I have a proposal to minimize the invasion in TVM and also fundamentally support TensorCore in TVM. This is in the middle of both methodology of #4052 and this RFC.
I suppose the current pain point of supporting TensorCore is the data structure provided by NVIDIA, which introduces non-standard buffer allocation.
I wrote a microbenchmark before to see the generated ptx assembly code, which turned out that fragment no longer exists after codegen, and the tensorize intrinsic is just several assembly instructions with 16 operands.
My proposal is that why do not we just extend the intrin and generate the code in embedded assembly?
@tqchen

Sorry for the late reply. We were occupied by refactoring our implementation to combine with #4052. Thanks a lot for your proposal. Generating PTX or even SASS assembly is really an interesting topic and we may have some investigations and discussions on this later. As to the TensorCore CodeGen, I think maybe the data structure is not the only pain point. The root is in the programming model of TensorCore, in which the threads inside a warp are no longer individual threads and some high level information such as matrix_a/b, row/col_major, strides of a buffer, is required in low level operations. So I guess generating PTX directly may not relieve these pains. @Hzfengsy what do you think about this?

minminsun · 2019-10-31T11:03:25Z

Opened PR #4234 for the re-implementation of our solution based on tensor intrinsic. Many thanks to @Hzfengsy for his valuable suggestions and close collaboration with us on this.

zyxcambridge · 2020-08-09T15:53:56Z

mark

minminsun mentioned this issue Oct 11, 2019

Auto TensorCore CodeGen #4106

Closed

Hzfengsy mentioned this issue Oct 16, 2019

[RFC] Tensor Core Support #4052

Closed

4 tasks

minminsun mentioned this issue Oct 31, 2019

Auto TensorCore CodeGen #4234

Merged

merrymercy closed this as completed Nov 24, 2019

Orion34-lanbo mentioned this issue Dec 19, 2019

[CODEGEN] Support cuda tensorcore subbyte int data type in auto tensorcore #4546

Merged

Hzfengsy mentioned this issue Mar 21, 2020

[TOPI][Tensor Core] Conv2d and Dense ops support on Tensor Core #5099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Auto TensorCore CodeGen #4105

[RFC] Auto TensorCore CodeGen #4105

minminsun commented Oct 11, 2019

minminsun commented Oct 11, 2019 •

edited

Loading

antinucleon commented Oct 11, 2019

minminsun commented Oct 11, 2019 •

edited

Loading

yangjunpro commented Oct 11, 2019

tqchen commented Oct 11, 2019

Hzfengsy commented Oct 11, 2019

ajtulloch commented Oct 11, 2019

Orion34-lanbo commented Oct 12, 2019

minminsun commented Oct 12, 2019

minminsun commented Oct 15, 2019

were commented Oct 19, 2019 •

edited

Loading

minminsun commented Oct 31, 2019 •

edited

Loading

minminsun commented Oct 31, 2019

zyxcambridge commented Aug 9, 2020

[RFC] Auto TensorCore CodeGen #4105

[RFC] Auto TensorCore CodeGen #4105

Comments

minminsun commented Oct 11, 2019

FP16 on V100, CUDA 9.0, Driver 396.44

Int8 on T4, CUDA10.1, Driver 418.39

minminsun commented Oct 11, 2019 • edited Loading

antinucleon commented Oct 11, 2019

minminsun commented Oct 11, 2019 • edited Loading

yangjunpro commented Oct 11, 2019

tqchen commented Oct 11, 2019

Non standard buffer allocation

Place of Pattern Matching

Complement and Combine with Tensor Intrinsics based TensorCore support

Hzfengsy commented Oct 11, 2019

Lower Passes vs Intrinsic

Suggestions and Questions

Combine with Tensor Intrinsics

ajtulloch commented Oct 11, 2019

Orion34-lanbo commented Oct 12, 2019

minminsun commented Oct 12, 2019

minminsun commented Oct 15, 2019

were commented Oct 19, 2019 • edited Loading

minminsun commented Oct 31, 2019 • edited Loading

minminsun commented Oct 31, 2019

zyxcambridge commented Aug 9, 2020

minminsun commented Oct 11, 2019 •

edited

Loading

minminsun commented Oct 11, 2019 •

edited

Loading

were commented Oct 19, 2019 •

edited

Loading

minminsun commented Oct 31, 2019 •

edited

Loading