-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error when trying to offload layers to AMD GPU using rocBlas/hipBlas #2799
Comments
When compiling rocblas, did you compile it for gfx1012? When running cmake to create/configure the build folder for llama.cpp, you need to pass it |
Yes I compiled for gfx1012, but I didn't pass that cmake option to llama.cpp. I did that now and the error disappeared, but It doesn't generate any tokens, it doesn't respond at all (after loading in the model). After pressing CTRL+C a few times, I see this:
Note: The model seems to be successfully loaded into the VRAM, almost all of the VRAM was used. |
I should probably mention that when I don't offload layers, also nothing happens |
And the CPU peaks in both cases |
Using
|
I think that the debug build is being built and it's extremely slow. You can use the Ninja Multi-Config generator and build it like this, and then main.exe will be in the build\bin\Release folder,
or pass the build type to the single config Ninja.
CMakeLists.txt checks for MSVC explicitly instead of Also once you have this working, for more performance can you test adding Lines 92 to 100 in 04f4b1e
|
Okay I think that has done it, using codellama-7b with the website in 10 steps command:
and using the second suggestion:
And both combined using the same command, but with llama-2-13b-chat:
thank you 👍 |
@jan-tennert rocm on windows? That's awesome, maybe you could share the test binaries? |
Do you mean the llama binaries? I recompiled rocm using #1087 (comment) because it isn't officially supported, then followed #1087 (comment) to compile llama.cpp. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Hey, I successfully compiled llama.cpp on Windows using
#1087 (comment)
and #1087 (comment)
but I'm getting this error when trying to offload layers to the GPU:
CUDA error 98 at H:/SD/llama.cpp/ggml-cuda.cu:6036: invalid device function.
Full log
H:\SD\llama.cpp\build\bin>main -m falcon-7b-Q4_0-GGUF.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 35
main: build = 1069 (232caf3)
main: seed = 1692998941
ggml_init_cublas: found 1 ROCm devices:
Device 0: Radeon RX 5500 XT, compute capability 10.1
llama_model_loader: loaded meta data with 19 key-value pairs and 196 tensors from falcon-7b-Q4_0-GGUF.gguf (version GGUF V1 (lat��Jg����@����llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 4544, 65024, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 7: blk.1.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 8: blk.1.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.1.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 13: blk.2.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 14: blk.2.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 15: blk.2.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 16: blk.2.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 17: blk.2.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 18: blk.2.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 19: blk.3.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 20: blk.3.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 21: blk.3.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 22: blk.3.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 23: blk.3.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 24: blk.3.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 25: blk.4.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 26: blk.4.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 27: blk.4.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 28: blk.4.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 29: blk.4.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 30: blk.4.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 31: blk.5.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 32: blk.5.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 33: blk.5.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 34: blk.5.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 35: blk.5.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 36: blk.5.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 37: blk.6.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 38: blk.6.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 39: blk.6.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 40: blk.6.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 41: blk.6.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 42: blk.6.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 43: blk.7.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 44: blk.7.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 45: blk.7.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 46: blk.7.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 47: blk.7.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 48: blk.7.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 49: blk.8.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 50: blk.8.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 51: blk.8.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 52: blk.8.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 53: blk.8.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 54: blk.8.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 55: blk.9.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 56: blk.9.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 57: blk.9.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 58: blk.9.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 59: blk.9.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 60: blk.9.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 61: blk.10.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 62: blk.10.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 63: blk.10.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 64: blk.10.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 65: blk.10.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 66: blk.10.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 67: blk.11.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 68: blk.11.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 69: blk.11.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 70: blk.11.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 71: blk.11.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 72: blk.11.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 73: blk.12.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 74: blk.12.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 75: blk.12.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 76: blk.12.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 77: blk.12.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 78: blk.12.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 79: blk.13.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 80: blk.13.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 81: blk.13.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 82: blk.13.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 83: blk.13.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 84: blk.13.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 85: blk.14.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 86: blk.14.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 87: blk.14.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 88: blk.14.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 89: blk.14.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 90: blk.14.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 91: blk.15.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 92: blk.15.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 93: blk.15.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 94: blk.15.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 95: blk.15.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 96: blk.15.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 97: blk.16.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 98: blk.16.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 99: blk.16.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 100: blk.16.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 101: blk.16.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 102: blk.16.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 103: blk.17.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 104: blk.17.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 105: blk.17.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 106: blk.17.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 107: blk.17.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 108: blk.17.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 109: blk.18.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 110: blk.18.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 111: blk.18.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 112: blk.18.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 113: blk.18.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 114: blk.18.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 115: blk.19.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 116: blk.19.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 117: blk.19.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 118: blk.19.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 119: blk.19.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 120: blk.19.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 121: blk.20.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 122: blk.20.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 123: blk.20.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 124: blk.20.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 125: blk.20.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 126: blk.20.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 127: blk.21.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 128: blk.21.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 129: blk.21.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 130: blk.21.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 131: blk.21.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 132: blk.21.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 133: blk.22.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 134: blk.22.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 135: blk.22.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 136: blk.22.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 137: blk.22.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 138: blk.22.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 139: blk.23.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 140: blk.23.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 141: blk.23.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 142: blk.23.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 143: blk.23.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 144: blk.23.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 145: blk.24.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 146: blk.24.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 147: blk.24.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 148: blk.24.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 149: blk.24.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 150: blk.24.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 151: blk.25.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 152: blk.25.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 153: blk.25.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 154: blk.25.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 155: blk.25.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 156: blk.25.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 157: blk.26.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 158: blk.26.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 159: blk.26.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 160: blk.26.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 161: blk.26.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 162: blk.26.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 163: blk.27.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 164: blk.27.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 165: blk.27.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 166: blk.27.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 167: blk.27.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 168: blk.27.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 169: blk.28.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 170: blk.28.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 171: blk.28.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 172: blk.28.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 173: blk.28.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 174: blk.28.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 175: blk.29.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 176: blk.29.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 177: blk.29.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 178: blk.29.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 179: blk.29.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 180: blk.29.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 181: blk.30.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 182: blk.30.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 183: blk.30.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 184: blk.30.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 185: blk.30.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 186: blk.30.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 187: blk.31.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 188: blk.31.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 189: blk.31.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 190: blk.31.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 191: blk.31.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 192: blk.31.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 193: output_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 194: output_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 195: output.weight q4_0 [ 4544, 65024, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: falcon.context_length u32
llama_model_loader: - kv 3: falcon.tensor_data_layout str
llama_model_loader: - kv 4: falcon.embedding_length u32
llama_model_loader: - kv 5: falcon.feed_forward_length u32
llama_model_loader: - kv 6: falcon.block_count u32
llama_model_loader: - kv 7: falcon.attention.head_count u32
llama_model_loader: - kv 8: falcon.attention.head_count_kv u32
llama_model_loader: - kv 9: falcon.attention.layer_norm_epsilon f32
llama_model_loader: - kv 10: tokenizer.ggml.model str
llama_model_loader: - kv 11: tokenizer.ggml.merges arr
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: general.quantization_version u32
llama_model_loader: - kv 18: general.file_type u32
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q4_0: 130 tensors
llm_load_print_meta: format = GGUF V1 (latest)
llm_load_print_meta: arch = falcon
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 65024
llm_load_print_meta: n_merges = 64784
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_ctx = 512
llm_load_print_meta: n_embd = 4544
llm_load_print_meta: n_head = 71
llm_load_print_meta: n_head_kv = 1
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_gqa = 71
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 18176
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model size = 7.22 B
llm_load_print_meta: general.name = Falcon
llm_load_print_meta: BOS token = 11 '<|endoftext|>'
llm_load_print_meta: EOS token = 11 '<|endoftext|>'
llm_load_print_meta: LF token = 193 '
'
llm_load_tensors: ggml ctx size = 0.06 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required = 158.56 MB (+ 4.00 MB per state)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 3719 MB
.....................................................................................
llama_new_context_with_model: kv self size = 4.00 MB
llama_new_context_with_model: compute buffer total size = 146.16 MB
llama_new_context_with_model: VRAM scratch buffer: 144.75 MB
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
CUDA error 98 at H:/SD/llama.cpp/ggml-cuda.cu:6036: invalid device function
If I don't offload layers, I don't see tokens generating and if I use this comment:
main -m falcon-7b-Q4_0-GGUF.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
I get this:
OS: Windows 11
GPU: AMD RX 5500 XT
The text was updated successfully, but these errors were encountered: