-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would you release your implementations of CUDA optimized kernels using TVM? #2
Comments
The following customized CUDA kernel works for both old archs (as low as sm52) and new archs (we have tried it on V100). It depends on CUDA 10 and is written with TVM. We do not recommend it currently because the sliding-chunk approach has clear advantage in speed. Moreover, we has not figured out a way to make the customized CUDA kernel works under AMP, which is another reason we are not using it right now. Try the customized CUDA kernelFirst, download the customized CUDA kernel files, extract the files and put them in the corresponding places in the code folder. PS: sorry that the previous link to the zip file is wrong. You can test the correctness of your installation with this script. Use the choice Environment requirementsFor virtual environment user, one need the following setup inside the environment:
For Docker user, one can use the following docker.
You may need to reset the environment variable
due to a known pytorch-mkl issue |
@TengFeiHan0 Did you successfully run the customized cuda kernel? |
Sorry for the late reply. Recently, I mainly have been focusing on your sliding-chunk approach. I didn't use the customized cuda kernel because you have clearly shown the difference between TVM cuda kernel and the sliding-chunk approach. Thank you for your quick reply. |
@pzzhang The link of customized CUDA kernel files is unaccessible now. Would you mind updating it? Thanks a lot! |
Hi @jwyang In your paper, you said that
Therefore, your code only contains the implementations of the sliding-chunk approach. However, have you ever tried to implement or generate CUDA optimized kernels based on new arches(sm75+)? In my opinion, introducing tensor-core instructions and highly optimized GEMM libraries(CUBLAS,etc) can improve the performance of longformer.
The text was updated successfully, but these errors were encountered: