Crippled Performance on Multi Gpu Due to Loading onto RAM #4498

Noobville1345 · 2023-12-16T12:59:06Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[Yes ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[Yes] I carefully followed the README.md.
[Yes] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[Yes ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Have high prompt processing and inference speed when using 2x 3090s (with expected variability based on model.)

Current Behavior

Extremely low t/s when using 2x 3090 (7 t/s on tiny llama q2.)

Environment and Context

Fresh installs for both llamacpp self compiled for CUBLAS as well as most recent pull of oobabooga.- no matter what situation I end up with extremely low tokens per second if I ever split the model onto two separate gpus.

Physical (or virtual) hardware you are using, e.g. for Linux:

2x rtx 3090 (both running at x8 PCIe lanes), 64 gb RAM ddr4,

Operating System:

Windows 10

Failure Information (for bugs)

Low tokens per second, it seems that the loader is preferentially using RAM or even my SSD instead of my GPUs despite being fully loaded onto them. A 400 megabyte model should not be triggering 2.8 gb of shared GPU memory.

Steps to Reproduce

Step 1. Load up Oobabooga or llamacpp.

Step 2. Use default settings except tensor split to 18,17 and context to 30k (to mimic mixtral)

Step 3. Connect model to external API

Step 4. "Message Hi" in standard alpaca.

Failure Logs

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: ggml_init_cublas: CUDA_USE_TENSOR_CORES: ggml_init_cublas: found 2 Device 0: NVIDIA GeForce Device 1: NVIDIA GeForce llama_model_loader: loaded llama_model_loader: - tensor 0: llama_model_loader: - tensor 1: llama_model_loader: - tensor 2: llama_model_loader: - tensor 3: llama_model_loader: - tensor 4: llama_model_loader: - tensor 5: llama_model_loader: - tensor 6: llama_model_loader: - tensor 7: llama_model_loader: - tensor 8: llama_model_loader: - tensor 9: llama_model_loader: - tensor 10: llama_model_loader: - tensor 11: llama_model_loader: - tensor 12: llama_model_loader: - tensor 13: llama_model_loader: - tensor 14: llama_model_loader: - tensor 15: llama_model_loader: - tensor 16: llama_model_loader: - tensor 17: llama_model_loader: - tensor 18: llama_model_loader: - tensor 19: llama_model_loader: - tensor 20: llama_model_loader: - tensor 21: llama_model_loader: - tensor 22: llama_model_loader: - tensor 23: llama_model_loader: - tensor 24: llama_model_loader: - tensor 25: llama_model_loader: - tensor 26: llama_model_loader: - tensor 27: llama_model_loader: - tensor 28: llama_model_loader: - tensor 29: llama_model_loader: - tensor 30: llama_model_loader: - tensor 31: llama_model_loader: - tensor 32: llama_model_loader: - tensor 33: llama_model_loader: - tensor 34: llama_model_loader: - tensor 35: llama_model_loader: - tensor 36: llama_model_loader: - tensor 37: llama_model_loader: - tensor 38: llama_model_loader: - tensor 39: llama_model_loader: - tensor 40: llama_model_loader: - tensor 41: llama_model_loader: - tensor 42: llama_model_loader: - tensor 43: llama_model_loader: - tensor 44: llama_model_loader: - tensor 45: llama_model_loader: - tensor 46: llama_model_loader: - tensor 47: llama_model_loader: - tensor 48: llama_model_loader: - tensor 49: llama_model_loader: - tensor 50: llama_model_loader: - tensor 51: llama_model_loader: - tensor 52: llama_model_loader: - tensor 53: llama_model_loader: - tensor 54: llama_model_loader: - tensor 55: llama_model_loader: - tensor 56: llama_model_loader: - tensor 57: llama_model_loader: - tensor 58: llama_model_loader: - tensor 59: llama_model_loader: - tensor 60: llama_model_loader: - tensor 61: llama_model_loader: - tensor 62: llama_model_loader: - tensor 63: llama_model_loader: - tensor 64: llama_model_loader: - tensor 65: llama_model_loader: - tensor 66: llama_model_loader: - tensor 67: llama_model_loader: - tensor 68: llama_model_loader: - tensor 69: llama_model_loader: - tensor 70: llama_model_loader: - tensor 71: llama_model_loader: - tensor 72: llama_model_loader: - tensor 73: llama_model_loader: - tensor 74: llama_model_loader: - tensor 75: llama_model_loader: - tensor 76: llama_model_loader: - tensor 77: llama_model_loader: - tensor 78: llama_model_loader: - tensor 79: llama_model_loader: - tensor 80: llama_model_loader: - tensor 81: llama_model_loader: - tensor 82: llama_model_loader: - tensor 83: llama_model_loader: - tensor 84: llama_model_loader: - tensor 85: llama_model_loader: - tensor 86: llama_model_loader: - tensor 87: llama_model_loader: - tensor 88: llama_model_loader: - tensor 89: llama_model_loader: - tensor 90: llama_model_loader: - tensor 91: llama_model_loader: - tensor 92: llama_model_loader: - tensor 93: llama_model_loader: - tensor 94: llama_model_loader: - tensor 95: llama_model_loader: - tensor 96: llama_model_loader: - tensor 97: llama_model_loader: - tensor 98: llama_model_loader: - tensor 99: llama_model_loader: - tensor 100: llama_model_loader: - tensor 101: llama_model_loader: - tensor 102: llama_model_loader: - tensor 103: llama_model_loader: - tensor 104: llama_model_loader: - tensor 105: llama_model_loader: - tensor 106: llama_model_loader: - tensor 107: llama_model_loader: - tensor 108: llama_model_loader: - tensor 109: llama_model_loader: - tensor 110: llama_model_loader: - tensor 111: llama_model_loader: - tensor 112: llama_model_loader: - tensor 113: llama_model_loader: - tensor 114: llama_model_loader: - tensor 115: llama_model_loader: - tensor 116: llama_model_loader: - tensor 117: llama_model_loader: - tensor 118: llama_model_loader: - tensor 119: llama_model_loader: - tensor 120: llama_model_loader: - tensor 121: llama_model_loader: - tensor 122: llama_model_loader: - tensor 123: llama_model_loader: - tensor 124: llama_model_loader: - tensor 125: llama_model_loader: - tensor 126: llama_model_loader: - tensor 127: llama_model_loader: - tensor 128: llama_model_loader: - tensor 129: llama_model_loader: - tensor 130: llama_model_loader: - tensor 131: llama_model_loader: - tensor 132: llama_model_loader: - tensor 133: llama_model_loader: - tensor 134: llama_model_loader: - tensor 135: llama_model_loader: - tensor 136: llama_model_loader: - tensor 137: llama_model_loader: - tensor 138: llama_model_loader: - tensor 139: llama_model_loader: - tensor 140: llama_model_loader: - tensor 141: llama_model_loader: - tensor 142: llama_model_loader: - tensor 143: llama_model_loader: - tensor 144: llama_model_loader: - tensor 145: llama_model_loader: - tensor 146: llama_model_loader: - tensor 147: llama_model_loader: - tensor 148: llama_model_loader: - tensor 149: llama_model_loader: - tensor 150: llama_model_loader: - tensor 151: llama_model_loader: - tensor 152: llama_model_loader: - tensor 153: llama_model_loader: - tensor 154: llama_model_loader: - tensor 155: llama_model_loader: - tensor 156: llama_model_loader: - tensor 157: llama_model_loader: - tensor 158: llama_model_loader: - tensor 159: llama_model_loader: - tensor 160: llama_model_loader: - tensor 161: llama_model_loader: - tensor 162: llama_model_loader: - tensor 163: llama_model_loader: - tensor 164: llama_model_loader: - tensor 165: llama_model_loader: - tensor 166: llama_model_loader: - tensor 167: llama_model_loader: - tensor 168: llama_model_loader: - tensor 169: llama_model_loader: - tensor 170: llama_model_loader: - tensor 171: llama_model_loader: - tensor 172: llama_model_loader: - tensor 173: llama_model_loader: - tensor 174: llama_model_loader: - tensor 175: llama_model_loader: - tensor 176: llama_model_loader: - tensor 177: llama_model_loader: - tensor 178: llama_model_loader: - tensor 179: llama_model_loader: - tensor 180: llama_model_loader: - tensor 181: llama_model_loader: - tensor 182: llama_model_loader: - tensor 183: llama_model_loader: - tensor 184: llama_model_loader: - tensor 185: llama_model_loader: - tensor 186: llama_model_loader: - tensor 187: llama_model_loader: - tensor 188: llama_model_loader: - tensor 189: llama_model_loader: - tensor 190: llama_model_loader: - tensor 191: llama_model_loader: - tensor 192: llama_model_loader: - tensor 193: llama_model_loader: - tensor 194: llama_model_loader: - tensor 195: llama_model_loader: - tensor 196: llama_model_loader: - tensor 197: llama_model_loader: - tensor 198: llama_model_loader: - tensor 199: llama_model_loader: - tensor 200: llama_model_loader: Dumping llama_model_loader: - kv 0: llama_model_loader: - kv 1: llama_model_loader: - kv 2: llama_model_loader: - kv 3: llama_model_loader: - kv 4: llama_model_loader: - kv 5: llama_model_loader: - kv 6: llama_model_loader: - kv 7: llama_model_loader: - kv 8: llama_model_loader: - kv 9: llama_model_loader: - kv 10: llama_model_loader: - kv 11: llama_model_loader: - kv 12: llama_model_loader: - kv 13: llama_model_loader: - kv 14: llama_model_loader: - kv 15: llama_model_loader: - kv 16: llama_model_loader: - kv 17: llama_model_loader: - kv 18: llama_model_loader: - kv 19: llama_model_loader: - type f32: llama_model_loader: - type q2_K: llama_model_loader: - type q3_K: llama_model_loader: - type q6_K: llm_load_vocab: special tokens llm_load_print_meta: format llm_load_print_meta: arch llm_load_print_meta: vocab type llm_load_print_meta: n_vocab llm_load_print_meta: n_merges llm_load_print_meta: n_ctx_train llm_load_print_meta: n_embd llm_load_print_meta: n_head llm_load_print_meta: n_head_kv llm_load_print_meta: n_layer llm_load_print_meta: n_rot llm_load_print_meta: n_gqa llm_load_print_meta: f_norm_eps llm_load_print_meta: f_norm_rms_eps llm_load_print_meta: f_clamp_kqv llm_load_print_meta: f_max_alibi_bias llm_load_print_meta: n_ff llm_load_print_meta: n_expert llm_load_print_meta: n_expert_used llm_load_print_meta: rope scaling llm_load_print_meta: freq_base_train llm_load_print_meta: freq_scale_train llm_load_print_meta: n_yarn_orig_ctx llm_load_print_meta: rope_finetuned llm_load_print_meta: model type llm_load_print_meta: model ftype llm_load_print_meta: model params llm_load_print_meta: model size llm_load_print_meta: general.name llm_load_print_meta: BOS token llm_load_print_meta: EOS token llm_load_print_meta: UNK token llm_load_print_meta: LF token llm_load_tensors: ggml ctx size = llm_load_tensors: using CUDA llm_load_tensors: mem required = llm_load_tensors: offloading llm_load_tensors: offloading llm_load_tensors: offloaded llm_load_tensors: VRAM used: .................................. llama_new_context_with_model: n_ctx llama_new_context_with_model: freq_base llama_new_context_with_model: llama_new_context_with_model: KV self size llama_build_graph: non-view llama_new_context_with_model: llama_new_context_with_model: llama_new_context_with_model: AVX = 1 | AVX2 = 1 | AVX512 2023-12-16 07:36:01 INFO:LOADER: 2023-12-16 07:36:01 INFO:TRUNCATION 2023-12-16 07:36:01 INFO:INSTRUCTION 2023-12-16 07:36:01 INFO:Loaded Output generated in 25.54 yes
no
CUDA devices:
RTX 3090, compute capability 8.6
RTX 3090, compute capability 8.6
meta data with 20 key-value pairs and 201 tensors from models\tinyllama-1.1b-chat-v0.3.Q2_K.gguf (version GGUF V2)
token_embd.weight q2_K [ 2048, 32003, 1, 1 ]
blk.0.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.0.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.0.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.0.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.0.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.0.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.0.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.0.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.0.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.1.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.1.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.1.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.1.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.1.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.1.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.1.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.1.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.1.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.2.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.2.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.2.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.2.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.2.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.2.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.2.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.2.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.2.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.3.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.3.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.3.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.3.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.3.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.3.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.3.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.3.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.3.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.4.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.4.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.4.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.4.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.4.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.4.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.4.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.4.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.4.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.5.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.5.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.5.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.5.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.5.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.5.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.5.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.5.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.5.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.6.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.6.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.6.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.6.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.6.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.6.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.6.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.6.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.6.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.7.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.7.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.7.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.7.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.7.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.7.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.7.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.7.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.7.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.8.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.8.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.8.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.8.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.8.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.8.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.8.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.8.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.8.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.9.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.9.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.9.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.9.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.9.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.9.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.9.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.9.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.9.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.10.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.10.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.10.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.10.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.10.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.10.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.10.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.10.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.10.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.11.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.11.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.11.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.11.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.11.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.11.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.11.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.11.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.11.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.12.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.12.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.12.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.12.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.12.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.12.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.12.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.12.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.12.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.13.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.13.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.13.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.13.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.13.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.13.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.13.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.13.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.13.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.14.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.14.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.14.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.14.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.14.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.14.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.14.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.14.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.14.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.15.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.15.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.15.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.15.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.15.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.15.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.15.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.15.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.15.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.16.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.16.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.16.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.16.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.16.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.16.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.16.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.16.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.16.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.17.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.17.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.17.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.17.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.17.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.17.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.17.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.17.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.17.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.18.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.18.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.18.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.18.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.18.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.18.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.18.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.18.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.18.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.19.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.19.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.19.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.19.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.19.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.19.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.19.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.19.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.19.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.20.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.20.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.20.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.20.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.20.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.20.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.20.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.20.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.20.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.21.attn_q.weight q2_K [ 2048, 2048, 1, 1 ]
blk.21.attn_k.weight q2_K [ 2048, 256, 1, 1 ]
blk.21.attn_v.weight q3_K [ 2048, 256, 1, 1 ]
blk.21.attn_output.weight q3_K [ 2048, 2048, 1, 1 ]
blk.21.ffn_gate.weight q3_K [ 2048, 5632, 1, 1 ]
blk.21.ffn_up.weight q3_K [ 2048, 5632, 1, 1 ]
blk.21.ffn_down.weight q3_K [ 5632, 2048, 1, 1 ]
blk.21.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
blk.21.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
output_norm.weight f32 [ 2048, 1, 1, 1 ]
output.weight q6_K [ 2048, 32003, 1, 1 ]
metadata keys/values. Note: KV overrides do not apply in this output.
general.architecture str = llama
general.name str = py007_tinyllama-1.1b-chat-v0.3
llama.context_length u32 = 2048
llama.embedding_length u32 = 2048
llama.block_count u32 = 22
llama.feed_forward_length u32 = 5632
llama.rope.dimension_count u32 = 64
llama.attention.head_count u32 = 32
llama.attention.head_count_kv u32 = 4
llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama.rope.freq_base f32 = 10000.000000
general.file_type u32 = 10
tokenizer.ggml.model str = llama
tokenizer.ggml.tokens arr[str,32003] = ["", "~~", "~~", "<0x00>", "<...
tokenizer.ggml.scores arr[f32,32003] = [0.000000, 0.000000, 0.000000, 0.0000...
tokenizer.ggml.token_type arr[i32,32003] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
tokenizer.ggml.bos_token_id u32 = 1
tokenizer.ggml.eos_token_id u32 = 2
tokenizer.ggml.unknown_token_id u32 = 0
general.quantization_version u32 = 2
45 tensors
45 tensors
110 tensors
1 tensors
definition check successful ( 262/32003 ).
= GGUF V2
= llama
= SPM
= 32003
= 0
= 2048
= 2048
= 32
= 4
= 22
= 64
= 8
= 0.0e+00
= 1.0e-05
= 0.0e+00
= 0.0e+00
= 5632
= 0
= 0
= linear
= 10000.0
= 1
= 2048
= unknown
= ?B
= mostly Q2_K
= 1.10 B
= 459.11 MiB (3.50 BPW)
= py007_tinyllama-1.1b-chat-v0.3
= 1 ''
= 2 ''
= 0 ''
= 13 '<0x0A>'
0.08 MiB
for GPU acceleration
20.59 MiB
22 repeating layers to GPU
non-repeating layers to GPU
23/23 layers to GPU
438.60 MiB
....................................................
= 31232
= 1000000.0
freq_scale = 1
= 671.00 MiB, K (f16): 335.50 MiB, V (f16): 335.50 MiB
tensors processed: 466/466
compute buffer total size = 2028.32 MiB
VRAM scratch buffer: 2025.00 MiB
total VRAM used: 2463.61 MiB (model: 438.60 MiB, context: 2025.00 MiB)
= 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
llama.cpp
LENGTH: 31232
TEMPLATE: Alpaca
the model in 1.46 seconds.
seconds (6.81 tokens/s, 174 tokens, context 1227, seed 1278067900)

Attached is screenshot summary.

Ph0rk0z · 2023-12-18T13:41:13Z

Your KV cache is on CPU :(
llama_new_context_with_model: KV self size = 671.00 MiB, K (f16): 335.50 MiB, V (f16): 335.50 MiB

Some of the API is changed.

sumukshashidhar · 2023-12-25T08:56:15Z

I can confirm this - for some reason, it's dipping into shared GPU memory, despite there being max layers and leftover VRAM.

mindkrypted · 2024-02-14T06:06:21Z

I experience the same extremely poor performance due to usage of Windows' Shared GPU Memory.
Win10, 2x 3090, Win10, AMD 5950x, 128GB ram

Same as the others reported, with small models that all layers fits in the VRAM, it uses the shared memory.

When squeezing as much layers as possible to the GPUs with llama.cpp or koboldcpp:
70b models - avg of 1.5 t/s q5 (miqu)
120b models - avg of 0.7 t/s q4 (miqu 2x70b, goliath)
155b model - avg of 0.6 t/s q4 (theprofessor)
155b model with only one gpu 0.5 t/s q4
155b model with cpu only - avg of 0.3 t/s q4

Any other apps using shared memory gets their performance tanked by a lot, games, exllama, gptq, stable diffusion. At least, with those we can use Nvidia's control panel option to disable the "prefer no system fallback".

Noticed that here (#4256) and here (#4742) it's intentionally being used.

@ggerganov @JohannesGaessler Please consider adding a flag to disable Shared GPU Memory usage.

Thanks :)

JohannesGaessler · 2024-02-14T08:07:43Z

"Shared memory" in the context of the linked PRs refers to the fast on-chip memory that is shared between threads in a CUDA block. It has nothing to do with whatever Windows is doing that swaps VRAM to RAM.

JohannesGaessler · 2024-02-14T08:22:25Z

The llama.cpp CUDA code does not manually move data between VRAM and RAM beyond the bare minimum. Also, there is nothing in the CUDA code that treats Windows differently to Linux where there are not and never have been any issues with VRAM<->RAM swapping. This is 100% a driver issue and I don't know what we as application developers are supposed to do in order to fix it. Supposedly the latest NVIDIA Windows drivers let you turn it off.

Honestly my recommendation is that you just install Linux anyways because even without this driver issue the performance on Windows is significantly worse.

mindkrypted · 2024-02-15T02:15:30Z

@JohannesGaessler Thanks for taking the time to review my comment.

This is 100% a driver issue and I don't know what we as application developers are supposed to do in order to fix it.

According the what I've been able to find using Google, Microsoft's WDDM implementation doesn't allow shared GPU memory functionality to be disabled.

Supposedly the latest NVIDIA Windows drivers let you turn it off.

"prefer no system fallback" is a workaround from Nvidia, which if used, makes the app crash instead of overflowing into the shared GPU memory.

To be honest, I have no idea of how complex it is, or even, if it's possible. Still, I'd be curious to know if you had the opportunity, at some point, to try a solution without the usage of shared memory and test how it works out?

(I'd be more than happy to contribute, unfortunately, that's not something I can do. My programming skills with C++ and knowledge on drivers and using low level CUDA are close to none.)

Honestly my recommendation is that you just install Linux anyways because even without this driver issue the performance on Windows is significantly worse.

(I'll soon add a drive to dual boot and test Linux to compare.)
Yet, there's a bunch of people who'd be quite happy to run larger models on cpu+gpus with usable t/s under Windows as it's much much more accessible.

Thanks,

github-actions · 2024-03-18T01:36:01Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-02T01:10:50Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Noobville1345 added the bug-unconfirmed label Dec 16, 2023

github-actions bot added the stale label Mar 18, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crippled Performance on Multi Gpu Due to Loading onto RAM #4498

Crippled Performance on Multi Gpu Due to Loading onto RAM #4498

Noobville1345 commented Dec 16, 2023

Ph0rk0z commented Dec 18, 2023

sumukshashidhar commented Dec 25, 2023

mindkrypted commented Feb 14, 2024

JohannesGaessler commented Feb 14, 2024

JohannesGaessler commented Feb 14, 2024

mindkrypted commented Feb 15, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

Crippled Performance on Multi Gpu Due to Loading onto RAM #4498

Crippled Performance on Multi Gpu Due to Loading onto RAM #4498

Comments

Noobville1345 commented Dec 16, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Ph0rk0z commented Dec 18, 2023

sumukshashidhar commented Dec 25, 2023

mindkrypted commented Feb 14, 2024

JohannesGaessler commented Feb 14, 2024

JohannesGaessler commented Feb 14, 2024

mindkrypted commented Feb 15, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024