Would you release your implementations of CUDA optimized kernels using TVM? #2

TengFeiHan0 · 2021-05-24T12:24:29Z

Hi @jwyang In your paper, you said that

since it’s
not making use of the highly optimized matrix multiplication libraries in CUDA, its speed is still slow in
practice.

The implementation using the customized CUDA kernel is about 20% faster
than the full attention in the same setting, while achieving
the theoretical memory complexity. The sliding-chunk approach is the fastest, which is 60% faster than the full attention with a cost of consuming 20% more memory than the
theoretical complexity.

Therefore, your code only contains the implementations of the sliding-chunk approach. However, have you ever tried to implement or generate CUDA optimized kernels based on new arches(sm75+)? In my opinion, introducing tensor-core instructions and highly optimized GEMM libraries(CUBLAS,etc) can improve the performance of longformer.

pzzhang · 2021-05-27T18:03:12Z

The following customized CUDA kernel works for both old archs (as low as sm52) and new archs (we have tried it on V100). It depends on CUDA 10 and is written with TVM. We do not recommend it currently because the sliding-chunk approach has clear advantage in speed. Moreover, we has not figured out a way to make the customized CUDA kernel works under AMP, which is another reason we are not using it right now.

Try the customized CUDA kernel

First, download the customized CUDA kernel files, extract the files and put them in the corresponding places in the code folder. PS: sorry that the previous link to the zip file is wrong.

You can test the correctness of your installation with this script.

Use the choice MODEL.VIT.MSVIT.ATTN_TYPE = longformer_cuda to run the vision longformer with this customized cuda kernel.

Environment requirements

For virtual environment user, one need the following setup inside the environment:

conda install cudatoolkit=10.0
python -m pip install CausalDomainTransfer/src # to prepare the tvm package

For Docker user, one can use the following docker.

pengchuanzhang/maskrcnn:py3.7-cuda10.0-pytorch1.7

You may need to reset the environment variable

export MKL_THREADING_LAYER='GNU'

due to a known pytorch-mkl issue

pzzhang · 2021-06-02T04:20:18Z

@TengFeiHan0 Did you successfully run the customized cuda kernel?

TengFeiHan0 · 2021-06-02T04:32:02Z

@TengFeiHan0 Did you successfully run the customized cuda kernel?

Sorry for the late reply. Recently, I mainly have been focusing on your sliding-chunk approach. I didn't use the customized cuda kernel because you have clearly shown the difference between TVM cuda kernel and the sliding-chunk approach. Thank you for your quick reply.

jameslahm · 2023-10-11T07:03:53Z

@pzzhang The link of customized CUDA kernel files is unaccessible now. Would you mind updating it? Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would you release your implementations of CUDA optimized kernels using TVM? #2

Would you release your implementations of CUDA optimized kernels using TVM? #2

TengFeiHan0 commented May 24, 2021

pzzhang commented May 27, 2021 •

edited

Loading

pzzhang commented Jun 2, 2021

TengFeiHan0 commented Jun 2, 2021

jameslahm commented Oct 11, 2023

Would you release your implementations of CUDA optimized kernels using TVM? #2

Would you release your implementations of CUDA optimized kernels using TVM? #2

Comments

TengFeiHan0 commented May 24, 2021

pzzhang commented May 27, 2021 • edited Loading

Try the customized CUDA kernel

Environment requirements

pzzhang commented Jun 2, 2021

TengFeiHan0 commented Jun 2, 2021

jameslahm commented Oct 11, 2023

pzzhang commented May 27, 2021 •

edited

Loading