-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto TensorCore CodeGen #4234
Auto TensorCore CodeGen #4234
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! But there is still something to be improved.
include/tvm/ir_pass.h
Outdated
* buffer assignment of input and outputs. | ||
* \return Transformed stmt. | ||
*/ | ||
Stmt TensorCore(Stmt stmt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change the pass name? I think TensorCore
is too general and confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/pass/tensor_core.cc
Outdated
return false; | ||
} | ||
|
||
// Match C = Cast(A*B)+C, where A & B are fp16/int8 local buffers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TensorCores calculate C = Cast(A) * Cast(B) + C
. We'd better to match the same thing if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we get it as you have already mentioned in the RFC. Sorry for our forgetting to fix this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TensorCores calculate
C = Cast(A) * Cast(B) + C
. We'd better to match the same thing if possible.
We were focusing on combing with tensor intrinsics. Some comments and feedbacks from the RFC and former pull request haven't got resolved yet. We will fix soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
} | ||
|
||
class MMAMatcher: public IRVisitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add comments to these classes and methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-3) | ||
|
||
evaluator = func.time_evaluator(func.entry_name, ctx, number=100) | ||
print('Time cost of this operator: %f' % evaluator(a_tvm, b_tvm, c_tvm).mean) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested the performance on my Titan V GPU, it seems that we can not reach a satisfying performance. In some scenarios, usually large size matmul, we even have the similar speed as non-tensorcore schedule. Perhaps we should add more optimization such as using storage_align
to reduce bank conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. There's really a performance boost, especially on large shapes, after applying storage_align. Thanks!
python/tvm/build_module.py
Outdated
@@ -387,6 +387,7 @@ def lower(sch, | |||
binds, arg_list = get_binds(args, compact, binds) | |||
|
|||
# Phase 1 | |||
stmt = ir_pass.TensorCore(stmt, sch, binds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check current target is cuda before calling this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We will add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, just a few comments.
|
||
@autotvm.template | ||
def test_gemm_nn(N, L, M, dtype, layout): | ||
if (layout == "NN"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document the layout a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added more comments to the final formal tutorials.
tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-3) | ||
|
||
evaluator = func.time_evaluator(func.entry_name, ctx, number=100) | ||
print('Time cost of this operator: %f' % evaluator(a_tvm, b_tvm, c_tvm).mean) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the tuned results same ones reflected in the RFC? Is this template flexible enough to achieve good performance for other shapes after tuning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the perf in the RFC are tuned with this tutorial script. The template is not customized for specific shapes so it should be flexible to apply on other shapes. But we do see poor performance on large shapes, as @Hzfengsy commented above. We found out 2 reasons:
- Bank conflicts of shared memory, which can be reduced by storage_align as @Hzfengsy suggested.
- "vthread" in this template is a fixed value of 1 instead of a tunable knob. We do so because the inject_virtual_thread pass does not support intrinsics.
The improvement for the first issue will get updated to this PR soon, but for the second one we are still trying to figure out a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With storage_align, now the tuned results are better than the ones in the RFC.
} | ||
|
||
// Do the pattern matching | ||
bool mma_sync_match_(const Provide* op, BufferInfo store_buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question, does this function looks for wmma and replace with mma.sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is still in analysing phase. We do the mma pattern match here as well as record some matrix info.
And yes, finally we'll replace the whole AST block with a mma.sync Intrinsic call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. If it is not too much trouble can you dump ptx into a gist and paste the link here? I can review it and suggest few changes if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. If it is not too much trouble can you dump ptx into a gist and paste the link here? I can review it and suggest few changes if needed
Thanks for the kindness. BTW, what do you expect to get by looking at the ptx assembly?
For cuda, one single mma.sync C APi will be replaced with several PTX instructions via nvcc, so may be you are afraid that there may be some miss-usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is one reason. Here is why I want the ptx dump
- I want to see if the schedule is good or not. If not, I can suggest how it should be. May be we can change few things to get it right
- To see if there are any low-throughput ptx instructions that are causing any slowdown.
- Want to see how shared memory is being used. mma.sync requires input operands (a, b and c) to be laid out in specific pattern. This can cause shared memory bank conflicts.
For your question. What do you mean by C API? Is it the C wrapper around intrinsic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generate CUDA code instead of PTX code, so it's nvcc that decides which ptx instructions to use. Yes, we are continuously optimizing the schedule, and any better schedule is also welcome. But it is beyond the scope of this pull request. The main goal of this pull request is not to deliver the best schedules, but a feature and a tutorial to guide how to use this feature and to prove it can achieve good enough performance in some cases (at least be able to reproduce the results in the RFC). If you are interested, we are looking forward to corporating with you to deliver better and better schedules. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. If it is not too much trouble can you dump ptx into a gist and paste the link here? I can review it and suggest few changes if needed
Hi Aditya, I have sent you the generated cuda code as well as the ptx compiled with nvcc via a Message at https://discuss.tvm.ai/. Could you please help to take a look? Thank you!
src/api/api_pass.cc
Outdated
@@ -94,6 +94,13 @@ TVM_REGISTER_API("ir_pass.StorageFlatten") | |||
} | |||
}); | |||
|
|||
TVM_REGISTER_API("ir_pass.RewriteForTensorCore") | |||
.set_body([](TVMArgs args, TVMRetValue *ret) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use set_body_typed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fixed.
y, x = s[C].op.axis | ||
k = s[C].op.reduce_axis[0] | ||
|
||
# storage_align params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document how these params are chosen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added in the document above.
|
||
tuner = autotvm.tuner.XGBTuner(task) | ||
with tvm.build_config(): | ||
tuner.tune(n_trial=1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to comment out these lines to skip running on ci, otherwise it takes long time running on ci, see
https://github.com/apache/incubator-tvm/blob/master/tutorials/autotvm/tune_relay_cuda.py#L257-L260
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. The running time reduced from 34s to 0.1s, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! LGTM
Hi @Laurawly, could you please help to review the updates? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Laurawly @adityaatluri @minminsun @yangjunpro @vinx13 @Hzfengsy |
* Add Auto TensorCore TensorCore Unit Test * Rebase to tvm master branch & Add auto tensor core * Code Refine * Add tensor core switch by pragma * Add pragma in tensor core example code * Get real tile size to replace hard coded 16 * support more than 2 dimensions (e.g. batchmatmul) for buffer bind scope * support batch matmul * Move cuda env check to tensor_core.cc * Coderefine for tensor_core.cc * Refine comments * Some refinements of code and comment * Update TensorCore UT to pass the CPU test * remove redundant code * matmul's storage align for different layout * Add support for differenct position of type cast * Add formal tutorial for auto tensorcore codegen * move tensorcore check up to tutorial code * code and doc refine * comment out tune_and_evaluate in tutorial * fix cpplint error
Hi @minminsun, I saw the slides from TVM meetup in Shanghai and you guys showed tensor core performance on Turing architecture for |
Yes, we plan to open an pr to merge the code after it gets cleaned up. |
This pull request is for RFC #4105
We have re-implemented our solution. The new implementation is built on top of tensor intrinsics from #4052 and #4136.
Any feedbacks and comments are welcome.