Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel error for running example.py #1

Closed
zhuole1025 opened this issue Nov 8, 2024 · 5 comments
Closed

Kernel error for running example.py #1

zhuole1025 opened this issue Nov 8, 2024 · 5 comments

Comments

@zhuole1025
Copy link

Hi, thanks for this amazing work! When running the given example, I got the following error: RuntimeError: CUDA error: no kernel image is available for execution on the device (at /data/nunchaku/src/kernels/awq/gemv_awq.cu:312). I have followed the instructions for installation using torch 2.4.1.

@sxtyzhangzk
Copy link
Collaborator

Hi, may I ask which GPU you are using? We currently support sm_86 (Ampere, RTX3090/A6000) and sm_89 (Ada, RTX4090). The kernel may run on sm_80 (A100) but expect a significant performance drop. If you want to try it on A100 you could edit setup.py and change arch=compute_86,code=sm_86 to arch=compute_80,code=sm_80.
Unfortunately, we don't support Turing (RTX20 series) and earlier architectures since we depend on FlashAttention. Hopper (H100) also does not work due to the lack of INT4 TensorCore.

@zhuole1025
Copy link
Author

Thanks for your explanation. Such a pity since I am using H100..

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 9, 2024

Can you add an option for xformers or SDPA? If you used the AWQ kernel that supports older cards that's all it would take. Are the weights just standard gemv or is it custom?

@bghira
Copy link

bghira commented Nov 11, 2024

may as well use a different inference engine if you're talking about using more generic kernels.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 12, 2024

There aren't a lot of kernels to choose from sadly. Custom kernels seem like the way to go for these transformer based models just like on the LLM side. Unfortunately everyone is using ampere only as the baseline. AWQ does have kernels working for previous implementations and there are other attention mechanisms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants