Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error when trying to offload layers to AMD GPU using rocBlas/hipBlas #2799

Closed
4 tasks done
jan-tennert opened this issue Aug 25, 2023 · 9 comments
Closed
4 tasks done

Comments

@jan-tennert
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Hey, I successfully compiled llama.cpp on Windows using

#1087 (comment)

and #1087 (comment)

but I'm getting this error when trying to offload layers to the GPU:
CUDA error 98 at H:/SD/llama.cpp/ggml-cuda.cu:6036: invalid device function.

Full log

H:\SD\llama.cpp\build\bin>main -m falcon-7b-Q4_0-GGUF.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 35
main: build = 1069 (232caf3)
main: seed = 1692998941
ggml_init_cublas: found 1 ROCm devices:
Device 0: Radeon RX 5500 XT, compute capability 10.1
llama_model_loader: loaded meta data with 19 key-value pairs and 196 tensors from falcon-7b-Q4_0-GGUF.gguf (version GGUF V1 (lat��Jg����@����llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 4544, 65024, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 7: blk.1.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 8: blk.1.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.1.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 13: blk.2.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 14: blk.2.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 15: blk.2.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 16: blk.2.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 17: blk.2.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 18: blk.2.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 19: blk.3.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 20: blk.3.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 21: blk.3.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 22: blk.3.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 23: blk.3.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 24: blk.3.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 25: blk.4.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 26: blk.4.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 27: blk.4.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 28: blk.4.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 29: blk.4.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 30: blk.4.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 31: blk.5.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 32: blk.5.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 33: blk.5.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 34: blk.5.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 35: blk.5.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 36: blk.5.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 37: blk.6.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 38: blk.6.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 39: blk.6.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 40: blk.6.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 41: blk.6.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 42: blk.6.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 43: blk.7.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 44: blk.7.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 45: blk.7.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 46: blk.7.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 47: blk.7.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 48: blk.7.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 49: blk.8.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 50: blk.8.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 51: blk.8.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 52: blk.8.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 53: blk.8.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 54: blk.8.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 55: blk.9.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 56: blk.9.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 57: blk.9.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 58: blk.9.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 59: blk.9.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 60: blk.9.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 61: blk.10.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 62: blk.10.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 63: blk.10.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 64: blk.10.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 65: blk.10.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 66: blk.10.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 67: blk.11.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 68: blk.11.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 69: blk.11.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 70: blk.11.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 71: blk.11.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 72: blk.11.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 73: blk.12.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 74: blk.12.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 75: blk.12.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 76: blk.12.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 77: blk.12.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 78: blk.12.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 79: blk.13.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 80: blk.13.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 81: blk.13.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 82: blk.13.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 83: blk.13.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 84: blk.13.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 85: blk.14.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 86: blk.14.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 87: blk.14.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 88: blk.14.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 89: blk.14.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 90: blk.14.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 91: blk.15.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 92: blk.15.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 93: blk.15.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 94: blk.15.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 95: blk.15.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 96: blk.15.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 97: blk.16.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 98: blk.16.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 99: blk.16.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 100: blk.16.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 101: blk.16.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 102: blk.16.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 103: blk.17.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 104: blk.17.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 105: blk.17.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 106: blk.17.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 107: blk.17.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 108: blk.17.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 109: blk.18.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 110: blk.18.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 111: blk.18.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 112: blk.18.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 113: blk.18.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 114: blk.18.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 115: blk.19.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 116: blk.19.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 117: blk.19.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 118: blk.19.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 119: blk.19.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 120: blk.19.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 121: blk.20.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 122: blk.20.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 123: blk.20.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 124: blk.20.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 125: blk.20.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 126: blk.20.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 127: blk.21.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 128: blk.21.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 129: blk.21.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 130: blk.21.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 131: blk.21.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 132: blk.21.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 133: blk.22.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 134: blk.22.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 135: blk.22.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 136: blk.22.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 137: blk.22.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 138: blk.22.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 139: blk.23.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 140: blk.23.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 141: blk.23.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 142: blk.23.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 143: blk.23.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 144: blk.23.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 145: blk.24.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 146: blk.24.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 147: blk.24.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 148: blk.24.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 149: blk.24.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 150: blk.24.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 151: blk.25.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 152: blk.25.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 153: blk.25.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 154: blk.25.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 155: blk.25.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 156: blk.25.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 157: blk.26.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 158: blk.26.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 159: blk.26.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 160: blk.26.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 161: blk.26.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 162: blk.26.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 163: blk.27.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 164: blk.27.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 165: blk.27.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 166: blk.27.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 167: blk.27.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 168: blk.27.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 169: blk.28.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 170: blk.28.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 171: blk.28.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 172: blk.28.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 173: blk.28.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 174: blk.28.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 175: blk.29.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 176: blk.29.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 177: blk.29.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 178: blk.29.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 179: blk.29.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 180: blk.29.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 181: blk.30.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 182: blk.30.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 183: blk.30.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 184: blk.30.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 185: blk.30.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 186: blk.30.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 187: blk.31.attn_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 188: blk.31.attn_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 189: blk.31.attn_qkv.weight q4_0 [ 4544, 4672, 1, 1 ]
llama_model_loader: - tensor 190: blk.31.attn_output.weight q4_0 [ 4544, 4544, 1, 1 ]
llama_model_loader: - tensor 191: blk.31.ffn_up.weight q4_0 [ 4544, 18176, 1, 1 ]
llama_model_loader: - tensor 192: blk.31.ffn_down.weight q4_0 [ 18176, 4544, 1, 1 ]
llama_model_loader: - tensor 193: output_norm.weight f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 194: output_norm.bias f32 [ 4544, 1, 1, 1 ]
llama_model_loader: - tensor 195: output.weight q4_0 [ 4544, 65024, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: falcon.context_length u32
llama_model_loader: - kv 3: falcon.tensor_data_layout str
llama_model_loader: - kv 4: falcon.embedding_length u32
llama_model_loader: - kv 5: falcon.feed_forward_length u32
llama_model_loader: - kv 6: falcon.block_count u32
llama_model_loader: - kv 7: falcon.attention.head_count u32
llama_model_loader: - kv 8: falcon.attention.head_count_kv u32
llama_model_loader: - kv 9: falcon.attention.layer_norm_epsilon f32
llama_model_loader: - kv 10: tokenizer.ggml.model str
llama_model_loader: - kv 11: tokenizer.ggml.merges arr
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: general.quantization_version u32
llama_model_loader: - kv 18: general.file_type u32
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q4_0: 130 tensors
llm_load_print_meta: format = GGUF V1 (latest)
llm_load_print_meta: arch = falcon
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 65024
llm_load_print_meta: n_merges = 64784
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_ctx = 512
llm_load_print_meta: n_embd = 4544
llm_load_print_meta: n_head = 71
llm_load_print_meta: n_head_kv = 1
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_gqa = 71
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 18176
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model size = 7.22 B
llm_load_print_meta: general.name = Falcon
llm_load_print_meta: BOS token = 11 '<|endoftext|>'
llm_load_print_meta: EOS token = 11 '<|endoftext|>'
llm_load_print_meta: LF token = 193 '
'
llm_load_tensors: ggml ctx size = 0.06 MB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: mem required = 158.56 MB (+ 4.00 MB per state)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloading v cache to GPU
llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 3719 MB
.....................................................................................
llama_new_context_with_model: kv self size = 4.00 MB
llama_new_context_with_model: compute buffer total size = 146.16 MB
llama_new_context_with_model: VRAM scratch buffer: 144.75 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
CUDA error 98 at H:/SD/llama.cpp/ggml-cuda.cu:6036: invalid device function

If I don't offload layers, I don't see tokens generating and if I use this comment:
main -m falcon-7b-Q4_0-GGUF.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
I get this:

[...]
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:Assertion failed: !isinf(x), file H:/SD/llama.cpp/ggml.c, line 10398
Assertion failed: !isinf(x), file H:/SD/llama.cpp/ggml.c, line 10398
Assertion failed: !isinf(x), file H:/SD/llama.cpp/ggml.c, line 10398
Assertion failed: !isinf(x), file H:/SD/llama.cpp/ggml.c, line 10398
Assertion failed: !isinf(x), file H:/SD/llama.cpp/ggml.c, line 10398
Assertion failed: !isinf(x), file H:/SD/llama.cpp/ggml.c, line 10398

OS: Windows 11
GPU: AMD RX 5500 XT

@Engininja2
Copy link
Contributor

When compiling rocblas, did you compile it for gfx1012?

When running cmake to create/configure the build folder for llama.cpp, you need to pass it -DAMDGPU_TARGETS=gfx1012

@jan-tennert
Copy link
Author

jan-tennert commented Aug 25, 2023

When compiling rocblas, did you compile it for gfx1012?

When running cmake to create/configure the build folder for llama.cpp, you need to pass it -DAMDGPU_TARGETS=gfx1012

Yes I compiled for gfx1012, but I didn't pass that cmake option to llama.cpp. I did that now and the error disappeared, but It doesn't generate any tokens, it doesn't respond at all (after loading in the model). After pressing CTRL+C a few times, I see this:

llama_print_timings:        load time =  2499.85 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 30747.88 ms

Note: The model seems to be successfully loaded into the VRAM, almost all of the VRAM was used.

@jan-tennert
Copy link
Author

I should probably mention that when I don't offload layers, also nothing happens

@jan-tennert
Copy link
Author

And the CPU peaks in both cases

@jan-tennert
Copy link
Author

jan-tennert commented Aug 26, 2023

Using -nommq seems to fix that nothing happens, but the token generation is incredibly slow (using -ngl 35):

llama_print_timings:        load time =  9858.21 ms
llama_print_timings:      sample time =     1.49 ms /     5 runs   (    0.30 ms per token,  3353.45 tokens per second)
llama_print_timings: prompt eval time = 78408.08 ms /    98 tokens (  800.08 ms per token,     1.25 tokens per second)
llama_print_timings:        eval time = 35383.75 ms /     5 runs   ( 7076.75 ms per token,     0.14 tokens per second)
llama_print_timings:       total time = 120927.22 ms

@Engininja2
Copy link
Contributor

Engininja2 commented Aug 26, 2023

I think that the debug build is being built and it's extremely slow.

You can use the Ninja Multi-Config generator and build it like this, and then main.exe will be in the build\bin\Release folder,

cmake -B build -G "Ninja Multi-Config" -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1012
cmake --build build --config Release

or pass the build type to the single config Ninja.

cmake -B build -G "Ninja" -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1012 -DCMAKE_BUILD_TYPE=Release
cmake --build build

CMakeLists.txt checks for MSVC explicitly instead of GENERATOR_IS_MULTI_CONFIG so it isn't setting the default build type to Release in that case. edit: I was wrong, it's not the generator part of the check. It's that VS cmake sets Debug as the initial build type before getting to that point.

Also once you have this working, for more performance can you test adding || defined(__gfx1012__) to __dp4a after defined(__gfx1030__)? The fallback generated by the compiler is slow.

llama.cpp/ggml-cuda.cu

Lines 92 to 100 in 04f4b1e

static __device__ __forceinline__ int __dp4a(const int a, const int b, int c) {
#if defined(__gfx906__) || defined(__gfx908__) || defined(__gfx90a__) || defined(__gfx1030__)
c = __builtin_amdgcn_sdot4(a, b, c, false);
#elif defined(__gfx1100__)
c = __builtin_amdgcn_sudot4( true, a, true, b, c, false);
#elif defined(__gfx1010__) || defined(__gfx900__)
int tmp1;
int tmp2;
asm("\n \

@jan-tennert
Copy link
Author

jan-tennert commented Aug 26, 2023

Okay I think that has done it, using codellama-7b with the website in 10 steps command:

llama_print_timings:        load time =  2093.96 ms
llama_print_timings:      sample time =    75.83 ms /   332 runs   (    0.23 ms per token,  4377.98 tokens per second)
llama_print_timings: prompt eval time =  1254.33 ms /    19 tokens (   66.02 ms per token,    15.15 tokens per second)
llama_print_timings:        eval time = 15299.99 ms /   331 runs   (   46.22 ms per token,    21.63 tokens per second)
llama_print_timings:       total time = 16708.70 ms

and using the second suggestion:

llama_print_timings:        load time =  2115.14 ms
llama_print_timings:      sample time =    90.21 ms /   400 runs   (    0.23 ms per token,  4433.95 tokens per second)
llama_print_timings: prompt eval time =   615.75 ms /    19 tokens (   32.41 ms per token,    30.86 tokens per second)
llama_print_timings:        eval time = 16886.86 ms /   399 runs   (   42.32 ms per token,    23.63 tokens per second)
llama_print_timings:       total time = 17683.55 ms

And both combined using the same command, but with llama-2-13b-chat:

llama_print_timings:        load time = 11362.33 ms
llama_print_timings:      sample time =    35.87 ms /   160 runs   (    0.22 ms per token,  4461.05 tokens per second)
llama_print_timings: prompt eval time =   965.75 ms /    19 tokens (   50.83 ms per token,    19.67 tokens per second)
llama_print_timings:        eval time = 47745.10 ms /   159 runs   (  300.28 ms per token,     3.33 tokens per second)
llama_print_timings:       total time = 48783.27 ms

thank you 👍

@BarfingLemurs
Copy link
Contributor

@jan-tennert rocm on windows? That's awesome, maybe you could share the test binaries?

@jan-tennert
Copy link
Author

@jan-tennert rocm on windows? That's awesome, maybe you could share the test binaries?

Do you mean the llama binaries? I recompiled rocm using #1087 (comment) because it isn't officially supported, then followed #1087 (comment) to compile llama.cpp.
You can install the HIP-SDK for Windows here: https://www.amd.com/en/developer/rocm-hub/hip-sdk.html (which is required for recompiling rocm)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants