__CUDA_ARCH__ macro is unreliable #6529

Avlyssna · 2024-04-07T22:53:43Z

I discovered this issue while trying to utilize llama.cpp through llama-cpp-python, but it looks like the root issue may reside with llama.cpp. I'm getting errors during execution complaining that /llama.cpp/ggml-cuda/convert.cu:64: ERROR: CUDA kernel dequantize_block_q8_0_f16 has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520. This was previously reported, but closed shortly after with no clarity on the fix. This primarily happens when trying to leverage functionary v2 for tool selection; chatting normally works fine. I am using 2 x 3090 graphics cards (they are not linked using NVLink) with driver version 550.54.15 and CUDA version 12.4 (update 1) on Debian x86_64.

Despite being on a relatively new driver with the latest CUDA version, the __CUDA_ARCH__ macro reports the version as 520, which causes functionality designed for PASCAL and higher (CC_PASCAL = 600) to fail. If I'm not mistaken, this value should be 860 for 3090 series cards, but it clearly isn't.

I used this code to confirm the __CUDA_ARCH__ macro value:

#include <cstdio>
#define STR_HELPER(x) #x
#define STR(x) STR_HELPER(x)

__device__ void print_arch(){
  const char my_compile_time_arch[] = STR(__CUDA_ARCH__);
  printf("__CUDA_ARCH__: %s\n", my_compile_time_arch);
}
__global__ void example()
{
   print_arch();
}

int main(){

example<<<1,1>>>();
cudaDeviceSynchronize();
}

The text was updated successfully, but these errors were encountered:

Avlyssna · 2024-04-07T23:27:39Z

As an additional note, I attempted to compile llama.cpp's latest commit (855f544) with the CC_PASCAL, MIN_CC_DP4A, and CC_VOLTA values set to 0. Once I had a fresh libllama.so, I overwrote the old libllama.so and it magically started working. Not sure the best way to fix this check in llama.cpp, but it really looks like __CUDA_ARCH__ can't be trusted for feature availability. 😢

JohannesGaessler · 2024-04-08T09:59:57Z

The problem is that you are compiling the llama.cpp code for compute capability 5.2 which is the default for CUDA 12. But the code needs compute capability 6.1 or higher. In llama.cpp proper the code is either compiled for the compute capability of the GPU in the system (make) or for compute capabilities 5.2, 6.1, and 7.0 (cmake). Otherwise the CPU code will at runtime select a kernel that does not have device code. The fix is to modify whichever command you're using for compilation and to set the correct CUDA architecture, e.g. via -arch=native.

Avlyssna · 2024-04-08T11:14:25Z

@JohannesGaessler - That was the problem! There was an issue with my environment variables in the build pipeline, so the project was failing to pass the correct make flags from the start. Thanks for the help, closing this issue now.

Avlyssna added the bug-unconfirmed label Apr 7, 2024

Avlyssna closed this as completed Apr 8, 2024

laooopooo mentioned this issue May 11, 2024

GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error" Mozilla-Ocho/llamafile#403

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

__CUDA_ARCH__ macro is unreliable #6529

__CUDA_ARCH__ macro is unreliable #6529

Avlyssna commented Apr 7, 2024

Avlyssna commented Apr 7, 2024 •

edited

Loading

JohannesGaessler commented Apr 8, 2024

Avlyssna commented Apr 8, 2024

__CUDA_ARCH__ macro is unreliable #6529

__CUDA_ARCH__ macro is unreliable #6529

Comments

Avlyssna commented Apr 7, 2024

Avlyssna commented Apr 7, 2024 • edited Loading

JohannesGaessler commented Apr 8, 2024

Avlyssna commented Apr 8, 2024

Avlyssna commented Apr 7, 2024 •

edited

Loading