-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TensorIR] Support for L2 prefetch async copy and pred_guard enabled async in vectorized if_then_else #14329
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
Thanks @LeiWang1999 for the improvement. Could you please add some performance benchmarks as it is a perf-related PR :) |
@Hzfengsy I'll have some updates next week. : ) |
That makes a lot of sense. Thanks @LeiWang1999 for the bugfix! In addition, I am curious too if L2 prefetch could really help with performance :-) @cblmemo @andy-yang-1 PTAL |
This is an enhanced version of the ptx_ldg32 pass. I didn’t think about using async when I wrote ptx_ldg32, but cp.async needs to be careful about the GPU architecture. Some architectures cannot use cp.async. Can we distinguish between different GPU architectures? |
@andy-yang-1 yeah I noticed that there's a pass named ptx_ldg32, but it seems to me that this pass is not perfect yet? because ldg32 only load 4bytes from global, but sometimes we need ldg64 and ldg128 for more efficient data load and store in GPU. For the second consideration, I don't think this is something that this pr should take into account, partly because support for asynchronous copy is already there in current tensor ir, and partly because it's not a pass that is enabled by default, so users need to manually annotate and enable it and in python interface, it will be handled more comfortable I think. |
@LeiWang1999 Yeah, you are right. The ptx_ldg32 pass only supports ldg32 instruction for loading 4 bytes from global memory. I will also add support for them in the future 😉 I confused this with the automatic async work done by Tian. This work does not need to consider the architecture issue. Thank you very much for your feedback! |
Microbenchmark
performance (weight vectorized load is in async way in both cases)
: The L2 performance was as expected, as in my previous tests I didn't really see any impact on performance. I leveraged the L2 Cache in a different way, and I will wait until it was ready before submitting another pull request. Eventhough this :l2 feature has no effect in tests, I think it is still necessary to add such a feature because sota library like cutlass/cublas ‘s kernels have implemented it in their kernels? This may need more discussions. |
Very cool results @LeiWang1999!! Please fix the CI and let’s get this PR in! |
I got all test passed in my branch, however, after merge this pull request, which merged into main two days ago : tvm/src/tir/transforms/lower_async_dma.cc Lines 53 to 57 in c6c89c3
add some constraints into the flow of dma lowering, it seems like this pr is made for Hexagon backend, but it will also effect on cuda codegen.. CC~ @junrushao @adstraw |
The same TIR annotation The advantage of reusing the The disadvantage of reusing the
|
35a2d3f
to
630cdd5
Compare
I think this pr is ready to be merged in if I resolved the last lint ci failure (actually the link checker ci told me which file that do not follow the format rule, but I didn't find any scirpt that can auto-format the file, like clang-format or pep8, I may need some helps. @junrushao @adstraw I have a tricky fix of the dma lowering conflict between cuda backend and Hexagon, in the latest commit |
@LeiWang1999 To run linters locally, below are some commands you may find useful: # Assume you are now at TVM's root directory
# Run all linters
bash ./tests/scripts/task_lint.sh
# Run python's formatter: black. CI uses black 22.12.0.
bash ./tests/lint/git-black.sh
# Run c++'s formatter: clang-format
bash ./tests/lint/git-clang-format.sh
# Run whitespace checking, another linter unhappy with your code
bash ./tests/lint/whitespace.sh You might need to ensure the version of black/clang-format are consistent with our CI - which can be a bit annoying at times. |
CC @junrushao @Hzfengsy , this pr is ready I think. |
Please remove code instead of commenting them out :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm merging this in as there is no objection :-)
This pull request adds support for the L2 prefetch option in the cp.async instruction, which is supported in CUDA 11.4 and later, with the support method referencing Cutlass. Additionally, this pull request adds support for asynchronous copying of if_then_else under vectorization, and fixes some bugs.
for example, the original async cp can not support vectorized if_then_else, for a given template:
this pr will make the code into:
while the original injectptxasync pass can only support for a serial, 4 bytes aligned assignment of if_then_else, and the current code also has some bugs:
if the condition is false, the code above will do nothing, however the right way is to assign zero to the shared memory address, or the value of memory is unpredictable, which will make the result uncorrected. we fixed by native shared memory copy asm: