-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
__CUDA_ARCH__ macro is unreliable #6529
Comments
As an additional note, I attempted to compile llama.cpp's latest commit (855f544) with the |
The problem is that you are compiling the llama.cpp code for compute capability 5.2 which is the default for CUDA 12. But the code needs compute capability 6.1 or higher. In llama.cpp proper the code is either compiled for the compute capability of the GPU in the system (make) or for compute capabilities 5.2, 6.1, and 7.0 (cmake). Otherwise the CPU code will at runtime select a kernel that does not have device code. The fix is to modify whichever command you're using for compilation and to set the correct CUDA architecture, e.g. via |
@JohannesGaessler - That was the problem! There was an issue with my environment variables in the build pipeline, so the project was failing to pass the correct make flags from the start. Thanks for the help, closing this issue now. |
I discovered this issue while trying to utilize llama.cpp through llama-cpp-python, but it looks like the root issue may reside with llama.cpp. I'm getting errors during execution complaining that
/llama.cpp/ggml-cuda/convert.cu:64: ERROR: CUDA kernel dequantize_block_q8_0_f16 has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
. This was previously reported, but closed shortly after with no clarity on the fix. This primarily happens when trying to leverage functionary v2 for tool selection; chatting normally works fine. I am using 2 x 3090 graphics cards (they are not linked using NVLink) with driver version 550.54.15 and CUDA version 12.4 (update 1) on Debian x86_64.Despite being on a relatively new driver with the latest CUDA version, the
__CUDA_ARCH__
macro reports the version as 520, which causes functionality designed for PASCAL and higher (CC_PASCAL = 600) to fail. If I'm not mistaken, this value should be 860 for 3090 series cards, but it clearly isn't.I used this code to confirm the
__CUDA_ARCH__
macro value:The text was updated successfully, but these errors were encountered: