Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crippled Performance on Multi Gpu Due to Loading onto RAM #4498

Closed
Noobville1345 opened this issue Dec 16, 2023 · 8 comments
Closed

Crippled Performance on Multi Gpu Due to Loading onto RAM #4498

Noobville1345 opened this issue Dec 16, 2023 · 8 comments

Comments

@Noobville1345
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [Yes ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [Yes] I carefully followed the README.md.
  • [Yes] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [Yes ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Have high prompt processing and inference speed when using 2x 3090s (with expected variability based on model.)

Current Behavior

Extremely low t/s when using 2x 3090 (7 t/s on tiny llama q2.)

Environment and Context

Fresh installs for both llamacpp self compiled for CUBLAS as well as most recent pull of oobabooga.- no matter what situation I end up with extremely low tokens per second if I ever split the model onto two separate gpus.

  • Physical (or virtual) hardware you are using, e.g. for Linux:

2x rtx 3090 (both running at x8 PCIe lanes), 64 gb RAM ddr4,

Operating System:

Windows 10

Failure Information (for bugs)

Low tokens per second, it seems that the loader is preferentially using RAM or even my SSD instead of my GPUs despite being fully loaded onto them. A 400 megabyte model should not be triggering 2.8 gb of shared GPU memory.

Steps to Reproduce

Step 1. Load up Oobabooga or llamacpp.

Step 2. Use default settings except tensor split to 18,17 and context to 30k (to mimic mixtral)

Step 3. Connect model to external API

Step 4. "Message Hi" in standard alpaca.

Failure Logs

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 20 key-value pairs and 201 tensors from models\tinyllama-1.1b-chat-v0.3.Q2_K.gguf (version GGUF V2)
llama_model_loader: - tensor 0: token_embd.weight q2_K [ 2048, 32003, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 13: blk.1.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 15: blk.1.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 16: blk.1.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.2.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 20: blk.2.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 21: blk.2.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 22: blk.2.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 24: blk.2.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 25: blk.2.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 28: blk.3.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 29: blk.3.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 30: blk.3.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 31: blk.3.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 33: blk.3.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 34: blk.3.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.4.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 38: blk.4.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 39: blk.4.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 40: blk.4.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 42: blk.4.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 43: blk.4.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 46: blk.5.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 47: blk.5.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 48: blk.5.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 49: blk.5.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 50: blk.5.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 51: blk.5.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 52: blk.5.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 55: blk.6.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 56: blk.6.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 57: blk.6.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 58: blk.6.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 59: blk.6.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 60: blk.6.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 61: blk.6.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 64: blk.7.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 65: blk.7.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 66: blk.7.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 67: blk.7.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 68: blk.7.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 69: blk.7.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 70: blk.7.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 73: blk.8.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 74: blk.8.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 75: blk.8.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 76: blk.8.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 77: blk.8.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 78: blk.8.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 79: blk.8.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 82: blk.9.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 83: blk.9.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 84: blk.9.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 85: blk.9.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 86: blk.9.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 87: blk.9.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 88: blk.9.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 91: blk.10.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 92: blk.10.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 93: blk.10.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 94: blk.10.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 95: blk.10.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 96: blk.10.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 97: blk.10.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 100: blk.11.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 101: blk.11.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 102: blk.11.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 103: blk.11.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 104: blk.11.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 105: blk.11.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 106: blk.11.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 109: blk.12.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 110: blk.12.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 111: blk.12.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 112: blk.12.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 113: blk.12.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 114: blk.12.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 115: blk.12.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 118: blk.13.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 119: blk.13.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 120: blk.13.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 121: blk.13.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 122: blk.13.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 123: blk.13.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 124: blk.13.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 127: blk.14.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 128: blk.14.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 129: blk.14.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 130: blk.14.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 131: blk.14.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 132: blk.14.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 133: blk.14.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 136: blk.15.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 137: blk.15.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 138: blk.15.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 139: blk.15.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 141: blk.15.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 142: blk.15.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 145: blk.16.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 146: blk.16.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 147: blk.16.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 148: blk.16.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 149: blk.16.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 150: blk.16.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 151: blk.16.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 154: blk.17.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 155: blk.17.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 156: blk.17.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 157: blk.17.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 158: blk.17.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 159: blk.17.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 160: blk.17.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 163: blk.18.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 164: blk.18.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 165: blk.18.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 166: blk.18.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 167: blk.18.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 168: blk.18.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 169: blk.18.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 172: blk.19.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 173: blk.19.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 174: blk.19.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 175: blk.19.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 176: blk.19.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 177: blk.19.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 178: blk.19.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 181: blk.20.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 182: blk.20.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 183: blk.20.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 184: blk.20.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 185: blk.20.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 186: blk.20.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 187: blk.20.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 190: blk.21.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 191: blk.21.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 192: blk.21.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
llama_model_loader: - tensor 193: blk.21.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
llama_model_loader: - tensor 194: blk.21.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 195: blk.21.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
llama_model_loader: - tensor 196: blk.21.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 199: output_norm.weight f32 [ 2048, 1, 1, 1 ]
llama_model_loader: - tensor 200: output.weight q6_K [ 2048, 32003, 1, 1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = py007_tinyllama-1.1b-chat-v0.3
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 22
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 10
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32003] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32003] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32003] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type q2_K: 45 tensors
llama_model_loader: - type q3_K: 110 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 262/32003 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32003
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = mostly Q2_K
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 459.11 MiB (3.50 BPW)
llm_load_print_meta: general.name = py007_tinyllama-1.1b-chat-v0.3
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.08 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 20.59 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: VRAM used: 438.60 MiB
......................................................................................
llama_new_context_with_model: n_ctx = 31232
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size = 671.00 MiB, K (f16): 335.50 MiB, V (f16): 335.50 MiB
llama_build_graph: non-view tensors processed: 466/466
llama_new_context_with_model: compute buffer total size = 2028.32 MiB
llama_new_context_with_model: VRAM scratch buffer: 2025.00 MiB
llama_new_context_with_model: total VRAM used: 2463.61 MiB (model: 438.60 MiB, context: 2025.00 MiB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
2023-12-16 07:36:01 INFO:LOADER: llama.cpp
2023-12-16 07:36:01 INFO:TRUNCATION LENGTH: 31232
2023-12-16 07:36:01 INFO:INSTRUCTION TEMPLATE: Alpaca
2023-12-16 07:36:01 INFO:Loaded the model in 1.46 seconds.
Output generated in 25.54 seconds (6.81 tokens/s, 174 tokens, context 1227, seed 1278067900)

Attached is screenshot summary.

image

@Ph0rk0z
Copy link

Ph0rk0z commented Dec 18, 2023

Your KV cache is on CPU :(
llama_new_context_with_model: KV self size = 671.00 MiB, K (f16): 335.50 MiB, V (f16): 335.50 MiB

Some of the API is changed.

@sumukshashidhar
Copy link

I can confirm this - for some reason, it's dipping into shared GPU memory, despite there being max layers and leftover VRAM.

@mindkrypted
Copy link

I experience the same extremely poor performance due to usage of Windows' Shared GPU Memory.
Win10, 2x 3090, Win10, AMD 5950x, 128GB ram

Same as the others reported, with small models that all layers fits in the VRAM, it uses the shared memory.

When squeezing as much layers as possible to the GPUs with llama.cpp or koboldcpp:
70b models - avg of 1.5 t/s q5 (miqu)
120b models - avg of 0.7 t/s q4 (miqu 2x70b, goliath)
155b model - avg of 0.6 t/s q4 (theprofessor)
155b model with only one gpu 0.5 t/s q4
155b model with cpu only - avg of 0.3 t/s q4

Any other apps using shared memory gets their performance tanked by a lot, games, exllama, gptq, stable diffusion. At least, with those we can use Nvidia's control panel option to disable the "prefer no system fallback".
prefer_no_system_fallback

Noticed that here (#4256) and here (#4742) it's intentionally being used.

@ggerganov @JohannesGaessler Please consider adding a flag to disable Shared GPU Memory usage.

Thanks :)

@JohannesGaessler
Copy link
Collaborator

"Shared memory" in the context of the linked PRs refers to the fast on-chip memory that is shared between threads in a CUDA block. It has nothing to do with whatever Windows is doing that swaps VRAM to RAM.

@JohannesGaessler
Copy link
Collaborator

The llama.cpp CUDA code does not manually move data between VRAM and RAM beyond the bare minimum. Also, there is nothing in the CUDA code that treats Windows differently to Linux where there are not and never have been any issues with VRAM<->RAM swapping. This is 100% a driver issue and I don't know what we as application developers are supposed to do in order to fix it. Supposedly the latest NVIDIA Windows drivers let you turn it off.

Honestly my recommendation is that you just install Linux anyways because even without this driver issue the performance on Windows is significantly worse.

@mindkrypted
Copy link

@JohannesGaessler Thanks for taking the time to review my comment.

This is 100% a driver issue and I don't know what we as application developers are supposed to do in order to fix it.

According the what I've been able to find using Google, Microsoft's WDDM implementation doesn't allow shared GPU memory functionality to be disabled.

Supposedly the latest NVIDIA Windows drivers let you turn it off.

"prefer no system fallback" is a workaround from Nvidia, which if used, makes the app crash instead of overflowing into the shared GPU memory.

To be honest, I have no idea of how complex it is, or even, if it's possible. Still, I'd be curious to know if you had the opportunity, at some point, to try a solution without the usage of shared memory and test how it works out?

  • (I'd be more than happy to contribute, unfortunately, that's not something I can do. My programming skills with C++ and knowledge on drivers and using low level CUDA are close to none.)

Honestly my recommendation is that you just install Linux anyways because even without this driver issue the performance on Windows is significantly worse.

(I'll soon add a drive to dual boot and test Linux to compare.)
Yet, there's a bunch of people who'd be quite happy to run larger models on cpu+gpus with usable t/s under Windows as it's much much more accessible.

Thanks,

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants