Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would you release your implementations of CUDA optimized kernels using TVM? #2

Open
TengFeiHan0 opened this issue May 24, 2021 · 4 comments

Comments

@TengFeiHan0
Copy link

Hi @jwyang In your paper, you said that

since it’s
not making use of the highly optimized matrix multiplication libraries in CUDA, its speed is still slow in
practice.

The implementation using the customized CUDA kernel is about 20% faster
than the full attention in the same setting, while achieving
the theoretical memory complexity. The sliding-chunk approach is the fastest, which is 60% faster than the full attention with a cost of consuming 20% more memory than the
theoretical complexity.

Therefore, your code only contains the implementations of the sliding-chunk approach. However, have you ever tried to implement or generate CUDA optimized kernels based on new arches(sm75+)? In my opinion, introducing tensor-core instructions and highly optimized GEMM libraries(CUBLAS,etc) can improve the performance of longformer.

@pzzhang
Copy link
Contributor

pzzhang commented May 27, 2021

The following customized CUDA kernel works for both old archs (as low as sm52) and new archs (we have tried it on V100). It depends on CUDA 10 and is written with TVM. We do not recommend it currently because the sliding-chunk approach has clear advantage in speed. Moreover, we has not figured out a way to make the customized CUDA kernel works under AMP, which is another reason we are not using it right now.

Try the customized CUDA kernel

First, download the customized CUDA kernel files, extract the files and put them in the corresponding places in the code folder. PS: sorry that the previous link to the zip file is wrong.

You can test the correctness of your installation with this script.

Use the choice MODEL.VIT.MSVIT.ATTN_TYPE = longformer_cuda to run the vision longformer with this customized cuda kernel.

Environment requirements

For virtual environment user, one need the following setup inside the environment:

conda install cudatoolkit=10.0
python -m pip install CausalDomainTransfer/src # to prepare the tvm package

For Docker user, one can use the following docker.

pengchuanzhang/maskrcnn:py3.7-cuda10.0-pytorch1.7

You may need to reset the environment variable

export MKL_THREADING_LAYER='GNU'

due to a known pytorch-mkl issue

@pzzhang
Copy link
Contributor

pzzhang commented Jun 2, 2021

@TengFeiHan0 Did you successfully run the customized cuda kernel?

@TengFeiHan0
Copy link
Author

@TengFeiHan0 Did you successfully run the customized cuda kernel?

Sorry for the late reply. Recently, I mainly have been focusing on your sliding-chunk approach. I didn't use the customized cuda kernel because you have clearly shown the difference between TVM cuda kernel and the sliding-chunk approach. Thank you for your quick reply.

@jameslahm
Copy link

@pzzhang The link of customized CUDA kernel files is unaccessible now. Would you mind updating it? Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants