Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto TensorCore CodeGen #4234

Merged
merged 21 commits into from
Nov 9, 2019
Merged

Auto TensorCore CodeGen #4234

merged 21 commits into from
Nov 9, 2019

Conversation

minminsun
Copy link
Contributor

This pull request is for RFC #4105
We have re-implemented our solution. The new implementation is built on top of tensor intrinsics from #4052 and #4136.
Any feedbacks and comments are welcome.

@minminsun
Copy link
Contributor Author

@Laurawly @Hzfengsy Could you please help to review? Thanks!

This was referenced Oct 31, 2019
Copy link
Member

@Hzfengsy Hzfengsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! But there is still something to be improved.

* buffer assignment of input and outputs.
* \return Transformed stmt.
*/
Stmt TensorCore(Stmt stmt,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the pass name? I think TensorCore is too general and confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return false;
}

// Match C = Cast(A*B)+C, where A & B are fp16/int8 local buffers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorCores calculate C = Cast(A) * Cast(B) + C. We'd better to match the same thing if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we get it as you have already mentioned in the RFC. Sorry for our forgetting to fix this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorCores calculate C = Cast(A) * Cast(B) + C. We'd better to match the same thing if possible.

We were focusing on combing with tensor intrinsics. Some comments and feedbacks from the RFC and former pull request haven't got resolved yet. We will fix soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

class MMAMatcher: public IRVisitor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add comments to these classes and methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-3)

evaluator = func.time_evaluator(func.entry_name, ctx, number=100)
print('Time cost of this operator: %f' % evaluator(a_tvm, b_tvm, c_tvm).mean)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the performance on my Titan V GPU, it seems that we can not reach a satisfying performance. In some scenarios, usually large size matmul, we even have the similar speed as non-tensorcore schedule. Perhaps we should add more optimization such as using storage_align to reduce bank conflicts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. There's really a performance boost, especially on large shapes, after applying storage_align. Thanks!

@@ -387,6 +387,7 @@ def lower(sch,
binds, arg_list = get_binds(args, compact, binds)

# Phase 1
stmt = ir_pass.TensorCore(stmt, sch, binds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check current target is cuda before calling this

Copy link
Contributor

@jcf94 jcf94 Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We will add that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@Laurawly Laurawly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just a few comments.


@autotvm.template
def test_gemm_nn(N, L, M, dtype, layout):
if (layout == "NN"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document the layout a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added more comments to the final formal tutorials.

tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-3)

evaluator = func.time_evaluator(func.entry_name, ctx, number=100)
print('Time cost of this operator: %f' % evaluator(a_tvm, b_tvm, c_tvm).mean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the tuned results same ones reflected in the RFC? Is this template flexible enough to achieve good performance for other shapes after tuning?

Copy link
Contributor Author

@minminsun minminsun Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the perf in the RFC are tuned with this tutorial script. The template is not customized for specific shapes so it should be flexible to apply on other shapes. But we do see poor performance on large shapes, as @Hzfengsy commented above. We found out 2 reasons:

  1. Bank conflicts of shared memory, which can be reduced by storage_align as @Hzfengsy suggested.
  2. "vthread" in this template is a fixed value of 1 instead of a tunable knob. We do so because the inject_virtual_thread pass does not support intrinsics.

The improvement for the first issue will get updated to this PR soon, but for the second one we are still trying to figure out a solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With storage_align, now the tuned results are better than the ones in the RFC.

}

// Do the pattern matching
bool mma_sync_match_(const Provide* op, BufferInfo store_buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question, does this function looks for wmma and replace with mma.sync?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is still in analysing phase. We do the mma pattern match here as well as record some matrix info.
And yes, finally we'll replace the whole AST block with a mma.sync Intrinsic call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. If it is not too much trouble can you dump ptx into a gist and paste the link here? I can review it and suggest few changes if needed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. If it is not too much trouble can you dump ptx into a gist and paste the link here? I can review it and suggest few changes if needed
Thanks for the kindness. BTW, what do you expect to get by looking at the ptx assembly?
For cuda, one single mma.sync C APi will be replaced with several PTX instructions via nvcc, so may be you are afraid that there may be some miss-usage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is one reason. Here is why I want the ptx dump

  1. I want to see if the schedule is good or not. If not, I can suggest how it should be. May be we can change few things to get it right
  2. To see if there are any low-throughput ptx instructions that are causing any slowdown.
  3. Want to see how shared memory is being used. mma.sync requires input operands (a, b and c) to be laid out in specific pattern. This can cause shared memory bank conflicts.

For your question. What do you mean by C API? Is it the C wrapper around intrinsic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generate CUDA code instead of PTX code, so it's nvcc that decides which ptx instructions to use. Yes, we are continuously optimizing the schedule, and any better schedule is also welcome. But it is beyond the scope of this pull request. The main goal of this pull request is not to deliver the best schedules, but a feature and a tutorial to guide how to use this feature and to prove it can achieve good enough performance in some cases (at least be able to reproduce the results in the RFC). If you are interested, we are looking forward to corporating with you to deliver better and better schedules. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. If it is not too much trouble can you dump ptx into a gist and paste the link here? I can review it and suggest few changes if needed

Hi Aditya, I have sent you the generated cuda code as well as the ptx compiled with nvcc via a Message at https://discuss.tvm.ai/. Could you please help to take a look? Thank you!

@minminsun
Copy link
Contributor Author

Thanks @Laurawly @Hzfengsy @vinx13 for your review comments and suggestions. We have improved the code accordingly, and have added a formal tutorial. Do you have any more feedbacks?

@@ -94,6 +94,13 @@ TVM_REGISTER_API("ir_pass.StorageFlatten")
}
});

TVM_REGISTER_API("ir_pass.RewriteForTensorCore")
.set_body([](TVMArgs args, TVMRetValue *ret) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use set_body_typed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed.

y, x = s[C].op.axis
k = s[C].op.reduce_axis[0]

# storage_align params
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document how these params are chosen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added in the document above.


tuner = autotvm.tuner.XGBTuner(task)
with tvm.build_config():
tuner.tune(n_trial=1000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to comment out these lines to skip running on ci, otherwise it takes long time running on ci, see
https://github.com/apache/incubator-tvm/blob/master/tutorials/autotvm/tune_relay_cuda.py#L257-L260

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The running time reduced from 34s to 0.1s, thanks.

@minminsun minminsun requested a review from vinx13 November 8, 2019 04:06
Copy link
Member

@Hzfengsy Hzfengsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! LGTM

@minminsun
Copy link
Contributor Author

Hi @Laurawly, could you please help to review the updates? Thank you!

@tqchen
Copy link
Member

tqchen commented Nov 9, 2019

Copy link

@yangjunpro yangjunpro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR has been pending for more than one week and iterated with several refinements.
Also there is already approvals from several contributors.
We wish to merge it into master and close it.
Based on this, new work could also be started very soon.
Thanks.
@tqchen @yzhliu @Laurawly

@tqchen tqchen merged commit d64bf6b into apache:master Nov 9, 2019
@tqchen
Copy link
Member

tqchen commented Nov 9, 2019

Thanks @Laurawly @adityaatluri @minminsun @yangjunpro @vinx13 @Hzfengsy

zxy844288792 pushed a commit to neo-ai/tvm that referenced this pull request Nov 13, 2019
* Add Auto TensorCore TensorCore Unit Test

* Rebase to tvm master branch & Add auto tensor core

* Code Refine

* Add tensor core switch by pragma

* Add pragma in tensor core example code

* Get real tile size to replace hard coded 16

* support more than 2 dimensions (e.g. batchmatmul) for buffer bind scope

* support batch matmul

* Move cuda env check to tensor_core.cc

* Coderefine for tensor_core.cc

* Refine comments

* Some refinements of code and comment

* Update TensorCore UT to pass the CPU test

* remove redundant code

* matmul's storage align for different layout

* Add support for differenct position of type cast

* Add formal tutorial for auto tensorcore codegen

* move tensorcore check up to tutorial code

* code and doc refine

* comment out tune_and_evaluate in tutorial

* fix cpplint error
@Laurawly
Copy link
Contributor

Hi @Laurawly, could you please help to review the updates? Thank you!

Hi @minminsun, I saw the slides from TVM meetup in Shanghai and you guys showed tensor core performance on Turing architecture for int4 and int1 with great performance boost. I wonder if you plan to have those code upstream anytime in the future.

@minminsun
Copy link
Contributor Author

Hi @Laurawly, could you please help to review the updates? Thank you!

Hi @minminsun, I saw the slides from TVM meetup in Shanghai and you guys showed tensor core performance on Turing architecture for int4 and int1 with great performance boost. I wonder if you plan to have those code upstream anytime in the future.

Yes, we plan to open an pr to merge the code after it gets cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants